Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: wasm log aws.appmesh.ingress_http_stats and aws.appmesh.egress_http_stats panicked #484

Closed
tnsardesai opened this issue Jan 4, 2024 · 9 comments
Labels
Bug Something isn't working

Comments

@tnsardesai
Copy link

Summary
What are you observing that doesn't seem right?
Seeing a panic

[2024-01-03 23:46:27.247][92][critical][wasm] [source/extensions/common/wasm/context.cc:1157] wasm log aws.appmesh.ingress_http_stats: panicked at /codebuild/output/src3353/src/s3/00/wasm/cargo-project/vendor/proxy-wasm/src/hostcalls.rs:1192:42:
called `Result::unwrap()` on an `Err` value: FromUtf8Error { bytes: [49, 32, 192, 167, 192, 162, 37, 50, 53, 50, 55, 37, 50, 53, 50, 50, 44, 32, 49, 53, 52, 46, 51, 56, 46, 49, 55, 50, 46, 50, 52, 51, 44, 32, 49, 53, 46, 49, 53, 56, 46, 52, 55, 46, 49, 52, 49], error: Utf8Error { valid_up_to: 2, error_len: Some(1) } }
[2024-01-03 23:46:27.247][92][error][wasm] [source/extensions/common/wasm/wasm_vm.cc:38] Function: proxy_on_request_headers failed: Uncaught RuntimeError: unreachable
Proxy-Wasm plugin in-VM backtrace:
0: 0x385ad - rust_panic
1: 0x37842 - std::panicking::rust_panic_with_hook::h3aa054d35a0817d7
2: 0x38a56 - std::panicking::begin_panic_handler::_$u7b$$u7b$closure$u7d$$u7d$::h2f73e4cf6cd6319a
3: 0x3898f - std::sys_common::backtrace::__rust_end_short_backtrace::h98ac61a6abbff7e9
4: 0x24033 - rust_begin_unwind
5: 0x24d1 - core::panicking::panic_fmt::h3e1dd3d08288569e
6: 0x425f - core::result::unwrap_failed::h8b3db0f11171b57b
7: 0x103f3 - proxy_wasm::hostcalls::get_map::hfc84c18de12c4fd4
8: 0xff0e - _$LT$amzn_appmesh_wasm..reader..host..Host$u20$as$u20$amzn_appmesh_wasm..reader..HttpMapReader$GT$::get_request_headers::h60ed0437583308a0
9: 0xd219 - _$LT$amzn_appmesh_wasm..plugin..context..http..HttpFilterContext$u20$as$u20$proxy_wasm..traits..HttpContext$GT$::on_http_request_headers::h706d8f144b9ef225

I also see a similar log for [93][critical][wasm] [source/extensions/common/wasm/context.cc:1157] wasm log aws.appmesh.egress_http_stats: panicked at /codebuild/output/src3353/src/s3/00/wasm/cargo-project/vendor/proxy-wasm/src/hostcalls.rs:1192:42:

Steps to Reproduce
What are the steps you can take to reproduce this issue?
No idea. But pretty sure this happens only under high load. Looking at our logs I am seeing this error only on services which receive a high amount of load.

Are you currently working around this issue?
How are you currently solving this problem?
Even after seeing this log the container keeps functioning and responding to requests. I am not sure what the impact on this panic is. We are also seeing envoy metrics are not getting reported so that might be the impact.

Additional context
Anything else we should know?
This started after we updated our sidecar to 1.27.2.0.

@tnsardesai tnsardesai added the Bug Something isn't working label Jan 4, 2024
@tnsardesai
Copy link
Author

Saw a slightly different error today

[2024-01-09 16:07:37.590][91][critical][wasm] [source/extensions/common/wasm/context.cc:1157] wasm log aws.appmesh.egress_http_stats: panicked at amzn-appmesh-aggregate-stats/src/http/egress.rs:302:29:
called `Option::unwrap()` on a `None` value
[2024-01-09 16:07:37.591][91][error][wasm] [source/extensions/common/wasm/wasm_vm.cc:38] Function: proxy_on_done failed: Uncaught RuntimeError: unreachable
Proxy-Wasm plugin in-VM backtrace:
0:  0x385ad - rust_panic
1:  0x37842 - std::panicking::rust_panic_with_hook::h3aa054d35a0817d7
2:  0x38a24 - std::panicking::begin_panic_handler::_$u7b$$u7b$closure$u7d$$u7d$::h2f73e4cf6cd6319a
3:  0x3898f - std::sys_common::backtrace::__rust_end_short_backtrace::h98ac61a6abbff7e9
4:  0x24033 - rust_begin_unwind
5:  0x24d1 - core::panicking::panic_fmt::h3e1dd3d08288569e
6:  0xced1 - core::panicking::panic::h0f0c05b20da93dd7
7:  0x1179d - _$LT$aws_appmesh_aggregate_stats..http..egress..EgressHttpStatsFilter$u20$as$u20$amzn_appmesh_wasm..plugin..Filter$GT$::on_done::h8b7afc1a5cddb214
8:  0xcfa5 - _$LT$amzn_appmesh_wasm..plugin..context..http..HttpFilterContext$u20$as$u20$proxy_wasm..traits..Context$GT$::on_done::h61f448a25b9c1222
9:  0x2be71 - proxy_on_done

I rolled back a service and gateway to v1.25.1.0. This is not happening in that version of envoy

@ytsssun
Copy link

ytsssun commented Jan 10, 2024

Hi @tnsardesai ,

Thanks for the bug report. We are looking into it. To help us better understand the issue, could you share a bit more information about your mesh setup?

You mentioned this is only observed when in high load, do you by any chance have the snapshot of the envoy stats (the Envoy crashed so you might not be able to retrieve it)?

Have you tried any other version of Envoy images? For example v1.25.3.0 and onward.

Lastly, if you have Premium Support, feel free to engage them to cut us a ticket.

@ytsssun
Copy link

ytsssun commented Jan 12, 2024

The crash is from within proxy-wasm-rust-sdk. The bytes returned is not valid utf-8 bytes.

bytes: [49, 32, 192, 167, 192, 162, 37, 50, 53, 50, 55, 37, 50, 53, 50, 50, 44, 32, 49, 53, 52, 46, 51, 56, 46, 49, 55, 50, 46, 50, 52, 51, 44, 32, 49, 53, 46, 49, 53, 56, 46, 52, 55, 46, 49, 52, 49]

This easily reproducible if I ran following code, see the playground:

#![allow(unused)]
fn main() {
  use std::str;

  // some bytes, in a vector
  let problematic_bytes = vec! [49, 32, 192, 167, 192, 162, 37, 50, 53, 50, 55, 37, 50, 53, 50, 50, 44, 32, 49, 53, 52, 46, 51, 56, 46, 49, 55, 50, 46, 50, 52, 51, 44, 32, 49, 53, 46, 49, 53, 56, 46, 52, 55, 46, 49, 52, 49];

  let result = str::from_utf8(&problematic_bytes).unwrap();
}

I submitted a bug report to proxy wasm sdk - proxy-wasm/proxy-wasm-rust-sdk#217.

Meanwhile I wonder if there is any breaking change introduced in Envoy that would produce non-UTF-8 encoding bytes.

@tnsardesai
Copy link
Author

Thanks for your response. I previously thought it was due to high load but looking at the data over 7 days it is happening at all times of the day. I believe this is happening on a specific bot or attacker request with bad headers. All of these errors are on our externally exposed virtual gateways and applications. We are not seeing this in any internal applications.

I have rolled back all our applications and gateways which see error to v.1.25.1.0. I could try upgrading some of our applications next week to help pin point what version broke this if that will aid your investigation.

I have already engaged aws support, case id is 170303481101460 for your reference

Can you also take a look at the error in #484 (comment). It has a different stack trace.

I also saw this other log today which is different than the other two.

[2024-01-11 22:24:28.980][91][critical][wasm] [source/extensions/common/wasm/context.cc:1157] wasm log aws.appmesh.egress_http_stats: panicked at 'called `Option::unwrap()` on a `None` value', amzn-appmesh-aggregate-stats/src/http/egress.rs:172:9
[2024-01-11 22:24:28.980][91][error][wasm] [source/extensions/common/wasm/wasm_vm.cc:38] Function: proxy_on_done failed: Uncaught RuntimeError: unreachable
Proxy-Wasm plugin in-VM backtrace:
0: 0x2db69 - rust_panic
1: 0x2c9a3 - std::panicking::rust_panic_with_hook::hbafe3e603d331223
2: 0x2e0f1 - std::panicking::begin_panic_handler::_$u7b$$u7b$closure$u7d$$u7d$::h8ab6ee68d5b4c391
3: 0x2e059 - std::sys_common::backtrace::__rust_end_short_backtrace::h008f69666d134159
4: 0x1f5cc - rust_begin_unwind
5: 0x40ed - core::panicking::panic_fmt::h1d17fc068f528130
6: 0xc34f - core::panicking::panic::h27b5eefa3e4ff738
7: 0x1802e - _$LT$aws_appmesh_aggregate_stats..http..egress..EgressHttpStatsFilter$u20$as$u20$amzn_appmesh_wasm..plugin..Filter$GT$::on_done::h90cbcd72c8e2cb58
8: 0xc3f3 - _$LT$amzn_appmesh_wasm..plugin..context..http..HttpFilterContext$u20$as$u20$proxy_wasm..traits..Context$GT$::on_done::h30819f4e590de618
9: 0x26138 - proxy_on_done

The crash is from within proxy-wasm-rust-sdk. The bytes returned is not valid utf-8 bytes.
Meanwhile I wonder if there is any breaking change introduced in Envoy that would produce non-UTF-8 encoding bytes.

We have noticed an increase in non-UTF-8 bytes in our access logs. For example right around the time of this stack trace I see this access log. [2024-01-11T22:24:28.968Z] "GET /oauth/lookup-ip HTTP/1.1" 400 - 0 0 17 17 "����%2527%2522\'\", 45.115.27.18, 70.132.38.113,127.0.0.1" "Mozilla/5.0 (Linux; U; Android 4.0.3; en-us; LG-L160L Build/IML74K) AppleWebkit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30" "09c51df3-6b4d-9e15-8bf6-472c96e8b355" "production--oauth--833ed5a3.prod.mesh" "127.0.0.1:80"

@PiotrSikora
Copy link

[2024-01-09 16:07:37.590][91][critical][wasm] [source/extensions/common/wasm/context.cc:1157] wasm log aws.appmesh.egress_http_stats: panicked at amzn-appmesh-aggregate-stats/src/http/egress.rs:302:29:
called `Option::unwrap()` on a `None` value
[2024-01-09 16:07:37.591][91][error][wasm] [source/extensions/common/wasm/wasm_vm.cc:38] Function: proxy_on_done failed: Uncaught RuntimeError: unreachable
[...]

@tnsardesai this is different than the UTF-8 bug from the initial comment, and it looks like a bug in amzn-appmesh-aggregate-stats, so it probably makes sense to open a separate issue for that.

@tnsardesai
Copy link
Author

I created a #485 for the other bug! Over the weekend I saw this bug even on v1.25.1.0 so definitely different since I haven't seen the bug in this issue over the weekend on v1.25.1.0.

This week I plan on upgrading envoy versions to help pin point the exact version of envoy where things broke!

@tnsardesai
Copy link
Author

Hey I upgraded a service from public.ecr.aws/appmesh/aws-appmesh-envoy:v1.25.1.0-prod to public.ecr.aws/appmesh/aws-appmesh-envoy:v1.25.3.0-prod and now got this error in that service today. Other services on v1.25.1.0 are not seeing this error so v1.25.3.0 must be the first version with this error

[2024-01-17 01:35:05.995][77][critical][wasm] [source/extensions/common/wasm/[context.cc:1157](http://context.cc:1157/)] wasm log aws.appmesh.ingress_http_stats: panicked at 'called `Result::unwrap()` on an `Err` value: FromUtf8Error { bytes: [97, 116, 116, 97, 99, 104, 109, 101, 110, 116, 59, 32, 102, 105, 108, 101, 110, 97, 109, 101, 61, 34, 83, 237, 108, 97, 98, 97, 115, 32, 84, 243, 110, 105, 99, 97, 115, 32, 121, 32, 97, 116, 243, 110, 97, 115, 32, 49, 54, 32, 101, 110, 101, 114, 111, 32, 50, 48, 50, 52, 46, 112, 110, 103, 34], error: Utf8Error { valid_up_to: 23, error_len: Some(1) } }', /codebuild/output/src573/src/s3/00/wasm/cargo-project/vendor/proxy-wasm/src/[hostcalls.rs:1192](http://hostcalls.rs:1192/):42
[2024-01-17 01:35:05.995][77][error][wasm] [source/extensions/common/wasm/[wasm_vm.cc:38](http://wasm_vm.cc:38/)] Function: proxy_on_response_headers failed: Uncaught RuntimeError: unreachable
Proxy-Wasm plugin in-VM backtrace:
0: 0x37468 - rust_panic
1: 0x3652d - std::panicking::rust_panic_with_hook::ha23c8ba67d558e34
2: 0x3794b - std::panicking::begin_panic_handler::_$u7b$$u7b$closure$u7d$$u7d$::h17240ff11865f8cf
3: 0x37881 - std::sys_common::backtrace::__rust_end_short_backtrace::h4c1464e549336429
4: 0x22abb - rust_begin_unwind
5: 0x24b3 - core::panicking::panic_fmt::h573ffd15dee74109
6: 0x419f - core::result::unwrap_failed::h26e3bef5a756b770
7: 0x11077 - proxy_wasm::hostcalls::get_map::h042a2451be115839
8: 0x11111 - _$LT$amzn_appmesh_wasm..reader..host..Host$u20$as$u20$amzn_appmesh_wasm..reader..HttpMapReader$GT$::get_response_headers::h7c3e1959a20473cb
9: 0xcb7f - _$LT$amzn_appmesh_wasm..plugin..context..http..HttpFilterContext$u20$as$u20$proxy_wasm..traits..HttpContext$GT$::on_http_response_headers::hb63fc0cb769db28f

@ytsssun
Copy link

ytsssun commented Feb 16, 2024

Hi @tnsardesai , just so you know that we are working on a Envoy patch release. It will contain the fix for this issue and #485. I cannot provide a specific date, but it will be soon.

@liubnu
Copy link

liubnu commented Feb 28, 2024

This issue is fixed by Envoy release v1.27.3.0 #486

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working
Projects
Status: Researching
Development

No branches or pull requests

4 participants