Bug: Intermittent 503 with UC,URX response flags when making cross region/mesh requests via virtual gateway #460

tnsardesai · 2023-04-24T15:56:30Z

Summary
What are you observing that doesn't seem right?
Seeing high number of 503s when making cross region requests via a virtual gateway

Steps to Reproduce
What are the steps you can take to reproduce this issue?
The diagram below should describe our architecture

service-A in us-east-1 wants to call service-B in us-west-1
us-west-1 service-B is setup using virtual service, virtual router, virtual node. The virtual node provider is using cloudmap service discovery type and is setup using and ECS fargate service
us-east-1 service-B is also setup using virtual service, virtual router, virtual node. This virtual node provider is using DNS service discovery type and is pointing to the address of the NLB for virtual gateway
us-east-1 service-A's virtual node has setup service-B as a backend.
us-east-1 has dummy cloudmap service to side step Intercept and respond to DNS queries for Virtual Services using Envoy's DNS filter #65
errors are intermittent. We have not noticed any pattern right now. We are also seeing this in dev so I doubt this has anything to do with scale.
NLB metrics show a high number of loadbalancer reset and a small number of client reset

Are you currently working around this issue?
How are you currently solving this problem?
We are not :(. Attempting different configurations of timeouts, retries and outlier detection to minimize the number of errors.

Additional context
Anything else we should know?
internal support case id - 12565299101

Attachments
envoy debug logs from one of the failed cross region request.
extract-2023-04-21T16_31_02.134Z.csv.txt
Main error I see is remote address:10.24.19.114:80,TLS error: 33554536:system library:OPENSSL_internal:Connection reset by peer 33554464:system library:OPENSSL_internal:Broken pipe which suggests that envoy in us-east-1 (client) is not closing the connection with NLB (even though idle timeout in envoy is set to 150s), so NLB is sending a TCP RST once 350 seconds have passed and it receives a new request. Any help debugging this would be appreciated

The text was updated successfully, but these errors were encountered:

tnsardesai added the Bug Something isn't working label Apr 24, 2023

tnsardesai mentioned this issue Apr 26, 2023

Feature Request: How to complete cross-VPC access through App Mesh. #386

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Intermittent 503 with UC,URX response flags when making cross region/mesh requests via virtual gateway #460

Bug: Intermittent 503 with UC,URX response flags when making cross region/mesh requests via virtual gateway #460

tnsardesai commented Apr 24, 2023

Bug: Intermittent 503 with UC,URX response flags when making cross region/mesh requests via virtual gateway #460

Bug: Intermittent 503 with UC,URX response flags when making cross region/mesh requests via virtual gateway #460

Comments

tnsardesai commented Apr 24, 2023