Bug: wasm log aws.appmesh.ingress_http_stats and aws.appmesh.egress_http_stats panicked #484

tnsardesai · 2024-01-04T19:41:09Z

Summary
What are you observing that doesn't seem right?
Seeing a panic

[2024-01-03 23:46:27.247][92][critical][wasm] [source/extensions/common/wasm/context.cc:1157] wasm log aws.appmesh.ingress_http_stats: panicked at /codebuild/output/src3353/src/s3/00/wasm/cargo-project/vendor/proxy-wasm/src/hostcalls.rs:1192:42:
called `Result::unwrap()` on an `Err` value: FromUtf8Error { bytes: [49, 32, 192, 167, 192, 162, 37, 50, 53, 50, 55, 37, 50, 53, 50, 50, 44, 32, 49, 53, 52, 46, 51, 56, 46, 49, 55, 50, 46, 50, 52, 51, 44, 32, 49, 53, 46, 49, 53, 56, 46, 52, 55, 46, 49, 52, 49], error: Utf8Error { valid_up_to: 2, error_len: Some(1) } }
[2024-01-03 23:46:27.247][92][error][wasm] [source/extensions/common/wasm/wasm_vm.cc:38] Function: proxy_on_request_headers failed: Uncaught RuntimeError: unreachable
Proxy-Wasm plugin in-VM backtrace:
0: 0x385ad - rust_panic
1: 0x37842 - std::panicking::rust_panic_with_hook::h3aa054d35a0817d7
2: 0x38a56 - std::panicking::begin_panic_handler::_$u7b$$u7b$closure$u7d$$u7d$::h2f73e4cf6cd6319a
3: 0x3898f - std::sys_common::backtrace::__rust_end_short_backtrace::h98ac61a6abbff7e9
4: 0x24033 - rust_begin_unwind
5: 0x24d1 - core::panicking::panic_fmt::h3e1dd3d08288569e
6: 0x425f - core::result::unwrap_failed::h8b3db0f11171b57b
7: 0x103f3 - proxy_wasm::hostcalls::get_map::hfc84c18de12c4fd4
8: 0xff0e - _$LT$amzn_appmesh_wasm..reader..host..Host$u20$as$u20$amzn_appmesh_wasm..reader..HttpMapReader$GT$::get_request_headers::h60ed0437583308a0
9: 0xd219 - _$LT$amzn_appmesh_wasm..plugin..context..http..HttpFilterContext$u20$as$u20$proxy_wasm..traits..HttpContext$GT$::on_http_request_headers::h706d8f144b9ef225

I also see a similar log for [93][critical][wasm] [source/extensions/common/wasm/context.cc:1157] wasm log aws.appmesh.egress_http_stats: panicked at /codebuild/output/src3353/src/s3/00/wasm/cargo-project/vendor/proxy-wasm/src/hostcalls.rs:1192:42:

Steps to Reproduce
What are the steps you can take to reproduce this issue?
No idea. But pretty sure this happens only under high load. Looking at our logs I am seeing this error only on services which receive a high amount of load.

Are you currently working around this issue?
How are you currently solving this problem?
Even after seeing this log the container keeps functioning and responding to requests. I am not sure what the impact on this panic is. We are also seeing envoy metrics are not getting reported so that might be the impact.

Additional context
Anything else we should know?
This started after we updated our sidecar to 1.27.2.0.

The text was updated successfully, but these errors were encountered:

tnsardesai · 2024-01-09T18:41:08Z

Saw a slightly different error today

[2024-01-09 16:07:37.590][91][critical][wasm] [source/extensions/common/wasm/context.cc:1157] wasm log aws.appmesh.egress_http_stats: panicked at amzn-appmesh-aggregate-stats/src/http/egress.rs:302:29:
called `Option::unwrap()` on a `None` value
[2024-01-09 16:07:37.591][91][error][wasm] [source/extensions/common/wasm/wasm_vm.cc:38] Function: proxy_on_done failed: Uncaught RuntimeError: unreachable
Proxy-Wasm plugin in-VM backtrace:
0:  0x385ad - rust_panic
1:  0x37842 - std::panicking::rust_panic_with_hook::h3aa054d35a0817d7
2:  0x38a24 - std::panicking::begin_panic_handler::_$u7b$$u7b$closure$u7d$$u7d$::h2f73e4cf6cd6319a
3:  0x3898f - std::sys_common::backtrace::__rust_end_short_backtrace::h98ac61a6abbff7e9
4:  0x24033 - rust_begin_unwind
5:  0x24d1 - core::panicking::panic_fmt::h3e1dd3d08288569e
6:  0xced1 - core::panicking::panic::h0f0c05b20da93dd7
7:  0x1179d - _$LT$aws_appmesh_aggregate_stats..http..egress..EgressHttpStatsFilter$u20$as$u20$amzn_appmesh_wasm..plugin..Filter$GT$::on_done::h8b7afc1a5cddb214
8:  0xcfa5 - _$LT$amzn_appmesh_wasm..plugin..context..http..HttpFilterContext$u20$as$u20$proxy_wasm..traits..Context$GT$::on_done::h61f448a25b9c1222
9:  0x2be71 - proxy_on_done

I rolled back a service and gateway to v1.25.1.0. This is not happening in that version of envoy

ytsssun · 2024-01-10T22:45:21Z

Hi @tnsardesai ,

Thanks for the bug report. We are looking into it. To help us better understand the issue, could you share a bit more information about your mesh setup?

You mentioned this is only observed when in high load, do you by any chance have the snapshot of the envoy stats (the Envoy crashed so you might not be able to retrieve it)?

Have you tried any other version of Envoy images? For example v1.25.3.0 and onward.

Lastly, if you have Premium Support, feel free to engage them to cut us a ticket.

ytsssun · 2024-01-12T00:51:05Z

The crash is from within proxy-wasm-rust-sdk. The bytes returned is not valid utf-8 bytes.

bytes: [49, 32, 192, 167, 192, 162, 37, 50, 53, 50, 55, 37, 50, 53, 50, 50, 44, 32, 49, 53, 52, 46, 51, 56, 46, 49, 55, 50, 46, 50, 52, 51, 44, 32, 49, 53, 46, 49, 53, 56, 46, 52, 55, 46, 49, 52, 49]

This easily reproducible if I ran following code, see the playground:

#![allow(unused)]
fn main() {
  use std::str;

  // some bytes, in a vector
  let problematic_bytes = vec! [49, 32, 192, 167, 192, 162, 37, 50, 53, 50, 55, 37, 50, 53, 50, 50, 44, 32, 49, 53, 52, 46, 51, 56, 46, 49, 55, 50, 46, 50, 52, 51, 44, 32, 49, 53, 46, 49, 53, 56, 46, 52, 55, 46, 49, 52, 49];

  let result = str::from_utf8(&problematic_bytes).unwrap();
}

I submitted a bug report to proxy wasm sdk - proxy-wasm/proxy-wasm-rust-sdk#217.

Meanwhile I wonder if there is any breaking change introduced in Envoy that would produce non-UTF-8 encoding bytes.

tnsardesai · 2024-01-12T05:13:44Z

Thanks for your response. I previously thought it was due to high load but looking at the data over 7 days it is happening at all times of the day. I believe this is happening on a specific bot or attacker request with bad headers. All of these errors are on our externally exposed virtual gateways and applications. We are not seeing this in any internal applications.

I have rolled back all our applications and gateways which see error to v.1.25.1.0. I could try upgrading some of our applications next week to help pin point what version broke this if that will aid your investigation.

I have already engaged aws support, case id is 170303481101460 for your reference

Can you also take a look at the error in #484 (comment). It has a different stack trace.

I also saw this other log today which is different than the other two.

[2024-01-11 22:24:28.980][91][critical][wasm] [source/extensions/common/wasm/context.cc:1157] wasm log aws.appmesh.egress_http_stats: panicked at 'called `Option::unwrap()` on a `None` value', amzn-appmesh-aggregate-stats/src/http/egress.rs:172:9
[2024-01-11 22:24:28.980][91][error][wasm] [source/extensions/common/wasm/wasm_vm.cc:38] Function: proxy_on_done failed: Uncaught RuntimeError: unreachable
Proxy-Wasm plugin in-VM backtrace:
0: 0x2db69 - rust_panic
1: 0x2c9a3 - std::panicking::rust_panic_with_hook::hbafe3e603d331223
2: 0x2e0f1 - std::panicking::begin_panic_handler::_$u7b$$u7b$closure$u7d$$u7d$::h8ab6ee68d5b4c391
3: 0x2e059 - std::sys_common::backtrace::__rust_end_short_backtrace::h008f69666d134159
4: 0x1f5cc - rust_begin_unwind
5: 0x40ed - core::panicking::panic_fmt::h1d17fc068f528130
6: 0xc34f - core::panicking::panic::h27b5eefa3e4ff738
7: 0x1802e - _$LT$aws_appmesh_aggregate_stats..http..egress..EgressHttpStatsFilter$u20$as$u20$amzn_appmesh_wasm..plugin..Filter$GT$::on_done::h90cbcd72c8e2cb58
8: 0xc3f3 - _$LT$amzn_appmesh_wasm..plugin..context..http..HttpFilterContext$u20$as$u20$proxy_wasm..traits..Context$GT$::on_done::h30819f4e590de618
9: 0x26138 - proxy_on_done

The crash is from within proxy-wasm-rust-sdk. The bytes returned is not valid utf-8 bytes.
Meanwhile I wonder if there is any breaking change introduced in Envoy that would produce non-UTF-8 encoding bytes.

We have noticed an increase in non-UTF-8 bytes in our access logs. For example right around the time of this stack trace I see this access log. [2024-01-11T22:24:28.968Z] "GET /oauth/lookup-ip HTTP/1.1" 400 - 0 0 17 17 "��%2527%2522\'\", 45.115.27.18, 70.132.38.113,127.0.0.1" "Mozilla/5.0 (Linux; U; Android 4.0.3; en-us; LG-L160L Build/IML74K) AppleWebkit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30" "09c51df3-6b4d-9e15-8bf6-472c96e8b355" "production--oauth--833ed5a3.prod.mesh" "127.0.0.1:80"

PiotrSikora · 2024-01-12T16:02:04Z

[2024-01-09 16:07:37.590][91][critical][wasm] [source/extensions/common/wasm/context.cc:1157] wasm log aws.appmesh.egress_http_stats: panicked at amzn-appmesh-aggregate-stats/src/http/egress.rs:302:29:
called `Option::unwrap()` on a `None` value
[2024-01-09 16:07:37.591][91][error][wasm] [source/extensions/common/wasm/wasm_vm.cc:38] Function: proxy_on_done failed: Uncaught RuntimeError: unreachable
[...]

@tnsardesai this is different than the UTF-8 bug from the initial comment, and it looks like a bug in amzn-appmesh-aggregate-stats, so it probably makes sense to open a separate issue for that.

tnsardesai · 2024-01-16T17:15:27Z

I created a #485 for the other bug! Over the weekend I saw this bug even on v1.25.1.0 so definitely different since I haven't seen the bug in this issue over the weekend on v1.25.1.0.

This week I plan on upgrading envoy versions to help pin point the exact version of envoy where things broke!

tnsardesai · 2024-01-17T01:53:24Z

Hey I upgraded a service from public.ecr.aws/appmesh/aws-appmesh-envoy:v1.25.1.0-prod to public.ecr.aws/appmesh/aws-appmesh-envoy:v1.25.3.0-prod and now got this error in that service today. Other services on v1.25.1.0 are not seeing this error so v1.25.3.0 must be the first version with this error

[2024-01-17 01:35:05.995][77][critical][wasm] [source/extensions/common/wasm/[context.cc:1157](http://context.cc:1157/)] wasm log aws.appmesh.ingress_http_stats: panicked at 'called `Result::unwrap()` on an `Err` value: FromUtf8Error { bytes: [97, 116, 116, 97, 99, 104, 109, 101, 110, 116, 59, 32, 102, 105, 108, 101, 110, 97, 109, 101, 61, 34, 83, 237, 108, 97, 98, 97, 115, 32, 84, 243, 110, 105, 99, 97, 115, 32, 121, 32, 97, 116, 243, 110, 97, 115, 32, 49, 54, 32, 101, 110, 101, 114, 111, 32, 50, 48, 50, 52, 46, 112, 110, 103, 34], error: Utf8Error { valid_up_to: 23, error_len: Some(1) } }', /codebuild/output/src573/src/s3/00/wasm/cargo-project/vendor/proxy-wasm/src/[hostcalls.rs:1192](http://hostcalls.rs:1192/):42
[2024-01-17 01:35:05.995][77][error][wasm] [source/extensions/common/wasm/[wasm_vm.cc:38](http://wasm_vm.cc:38/)] Function: proxy_on_response_headers failed: Uncaught RuntimeError: unreachable
Proxy-Wasm plugin in-VM backtrace:
0: 0x37468 - rust_panic
1: 0x3652d - std::panicking::rust_panic_with_hook::ha23c8ba67d558e34
2: 0x3794b - std::panicking::begin_panic_handler::_$u7b$$u7b$closure$u7d$$u7d$::h17240ff11865f8cf
3: 0x37881 - std::sys_common::backtrace::__rust_end_short_backtrace::h4c1464e549336429
4: 0x22abb - rust_begin_unwind
5: 0x24b3 - core::panicking::panic_fmt::h573ffd15dee74109
6: 0x419f - core::result::unwrap_failed::h26e3bef5a756b770
7: 0x11077 - proxy_wasm::hostcalls::get_map::h042a2451be115839
8: 0x11111 - _$LT$amzn_appmesh_wasm..reader..host..Host$u20$as$u20$amzn_appmesh_wasm..reader..HttpMapReader$GT$::get_response_headers::h7c3e1959a20473cb
9: 0xcb7f - _$LT$amzn_appmesh_wasm..plugin..context..http..HttpFilterContext$u20$as$u20$proxy_wasm..traits..HttpContext$GT$::on_http_response_headers::hb63fc0cb769db28f

ytsssun · 2024-02-16T08:33:00Z

Hi @tnsardesai , just so you know that we are working on a Envoy patch release. It will contain the fix for this issue and #485. I cannot provide a specific date, but it will be soon.

liubnu · 2024-02-28T15:37:30Z

This issue is fixed by Envoy release v1.27.3.0 #486

tnsardesai added the Bug Something isn't working label Jan 4, 2024

ytsssun mentioned this issue Jan 16, 2024

Bug: wasm log aws.appmesh.egress_http_stats: panicked at 'called Option::unwrap() on a None value', amzn-appmesh-aggregate-stats/src/http/egress.rs:172:9 #485

Closed

liubnu mentioned this issue Feb 28, 2024

Bug: Metrics related to Outbound Traffic are not exported after enabling the appmesh metric extension re #488

Closed

liubnu closed this as completed Feb 28, 2024

github-project-automation bot added this to aws-app-mesh-known-issues Aug 27, 2024

github-project-automation bot moved this to Researching in aws-app-mesh-known-issues Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: wasm log aws.appmesh.ingress_http_stats and aws.appmesh.egress_http_stats panicked #484

Bug: wasm log aws.appmesh.ingress_http_stats and aws.appmesh.egress_http_stats panicked #484

tnsardesai commented Jan 4, 2024

tnsardesai commented Jan 9, 2024

ytsssun commented Jan 10, 2024

ytsssun commented Jan 12, 2024

tnsardesai commented Jan 12, 2024

PiotrSikora commented Jan 12, 2024

tnsardesai commented Jan 16, 2024

tnsardesai commented Jan 17, 2024

ytsssun commented Feb 16, 2024

liubnu commented Feb 28, 2024

Bug: wasm log aws.appmesh.ingress_http_stats and aws.appmesh.egress_http_stats panicked #484

Bug: wasm log aws.appmesh.ingress_http_stats and aws.appmesh.egress_http_stats panicked #484

Comments

tnsardesai commented Jan 4, 2024

tnsardesai commented Jan 9, 2024

ytsssun commented Jan 10, 2024

ytsssun commented Jan 12, 2024

tnsardesai commented Jan 12, 2024

PiotrSikora commented Jan 12, 2024

tnsardesai commented Jan 16, 2024

tnsardesai commented Jan 17, 2024

ytsssun commented Feb 16, 2024

liubnu commented Feb 28, 2024