Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Intermittent 503 with UC,URX response flags when making cross region/mesh requests via virtual gateway #460

Open
tnsardesai opened this issue Apr 24, 2023 · 0 comments
Labels
Bug Something isn't working

Comments

@tnsardesai
Copy link

Summary
What are you observing that doesn't seem right?
Seeing high number of 503s when making cross region requests via a virtual gateway

Steps to Reproduce
What are the steps you can take to reproduce this issue?
The diagram below should describe our architecture

  • service-A in us-east-1 wants to call service-B in us-west-1
  • us-west-1 service-B is setup using virtual service, virtual router, virtual node. The virtual node provider is using cloudmap service discovery type and is setup using and ECS fargate service
  • us-east-1 service-B is also setup using virtual service, virtual router, virtual node. This virtual node provider is using DNS service discovery type and is pointing to the address of the NLB for virtual gateway
  • us-east-1 service-A's virtual node has setup service-B as a backend.
  • us-east-1 has dummy cloudmap service to side step Intercept and respond to DNS queries for Virtual Services using Envoy's DNS filter #65
  • errors are intermittent. We have not noticed any pattern right now. We are also seeing this in dev so I doubt this has anything to do with scale.
  • NLB metrics show a high number of loadbalancer reset and a small number of client reset

Screen Shot 2023-04-24 at 11 34 13 AM

Are you currently working around this issue?
How are you currently solving this problem?
We are not :(. Attempting different configurations of timeouts, retries and outlier detection to minimize the number of errors.

Additional context
Anything else we should know?
internal support case id - 12565299101

Attachments
envoy debug logs from one of the failed cross region request.
extract-2023-04-21T16_31_02.134Z.csv.txt
Main error I see is remote address:10.24.19.114:80,TLS error: 33554536:system library:OPENSSL_internal:Connection reset by peer 33554464:system library:OPENSSL_internal:Broken pipe which suggests that envoy in us-east-1 (client) is not closing the connection with NLB (even though idle timeout in envoy is set to 150s), so NLB is sending a TCP RST once 350 seconds have passed and it receives a new request. Any help debugging this would be appreciated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant