-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 [BUG] - Synthetic Probes show a high number a 504 errors #441
Comments
Example log entry: {
"textPayload": "The request has been terminated because it has reached the maximum request timeout. To change this limit, see https://cloud.google.com/run/docs/configuring/request-timeout",
"insertId": "66042246000dd918ac9c23d5",
"httpRequest": {
"requestMethod": "POST",
"requestUrl": "<REDACTED>",
"requestSize": "494",
"status": 504,
"responseSize": "713",
"userAgent": "Google-Cloud-Scheduler",
"remoteIp": "107.178.196.96",
"serverIp": "216.239.32.53",
"latency": "29.999816421s",
"protocol": "HTTP/1.1"
},
"resource": {
"type": "cloud_run_revision",
"labels": {
"configuration_name": "slo-generator",
"service_name": "slo-generator",
"revision_name": "slo-generator-00154-2gv",
"project_id": "<REDACTED>",
"location": "<REDACTED>"
}
},
"timestamp": "2024-03-27T13:42:00.906092Z",
"severity": "ERROR",
"labels": {
"instanceId": "<REDACTED>"
},
"logName": "projects/<REDACTED>/logs/run.googleapis.com%2Frequests",
"trace": "projects/<REDACTED>/traces/de38c4e5fb97cd35e44c685f67a1768b",
"receiveTimestamp": "2024-03-27T13:42:30.966597414Z",
"spanId": "16424922216080379185"
} |
Additional info: At one point we tried increasing the timeout to the max of 1hr. This did not seem to help i.e. if the SLO evaluation doesn't finish executing in 5 min, it won't finish in 1hr. To make things worse, this actually ran up our GCP bill as we were charged for the compute time while the instance was waiting to timeout. We also tried increasing the Cloud Run max concurrency and max instances. Max instances helped a little, as it would sometimes allow SLOs to be processed by additional instances, but evetually all instances would be stuck waiting to timeout. |
Thank you @skinlayers. Same conclusions here. The requests to Cloud Run are timing out. However, increasing the timeout value does not help. The issue is somewhere in the middle. I have not found a way to reproduce it consistently though. I will enable debug mode and wait for the next occurrence. |
@skinlayers Does this happen for a specific backend/exporter, or all backends/exporters in general? |
I also found that deploying a new revision of the Cloud Run service removes most (all?) of the 504 errors. But they come back after some time. I ruled out memory leaks as memory utilization never goes above 35% (with 512MiB allocated to each instance of the container). Quite the opposite actually. 504 errors consistently show up when memory utilization is at its lowest point, around 22%. |
In my CI environment, I just:
I will keep monitoring the 504 error rate, and introduce new uptime checks or synthetic probes on a regular basis. |
SLO Generator Version
v2.5.2
Python Version
3.9
What happened?
After designing synthetic probes and availability/latency SLOs in the CI environment, a lot of 504 errors started to show up in the logs, in a seemingly random pattern. As a result, the error budget is burning quite fast during the outages. The incident usually resolves itself in under 10 minutes, then shows up again later on.
What did you expect?
No 504 errors and no timeouts when calling the SLO Generator.
Screenshots
No response
Relevant log output
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: