Multiwindow, Multi-Burn-Rate Alerts: is it achievable? #376
-
Hi, I'm a bit stuck around implementing multiwindow, multi-burn-rate alerts through the slo-generator, so I was hoping for some clarifications. Thanks for the help |
Beta Was this translation helpful? Give feedback.
Replies: 6 comments 1 reply
-
Hi @pgichevski and thanks for reaching out. From what I understand, the SLO Generator implements 5: Multiple Burn Rate Alerts in the SRE book. SLO definitions (for example this one) only configure a threshold ( error_budget_policies:
default:
steps:
- name: 1 hour
burn_rate_threshold: 9
alert: true
message_alert: Page to defend the SLO
message_ok: Last hour on track
window: 3600
- name: 12 hours
burn_rate_threshold: 3
alert: true
message_alert: Page to defend the SLO
message_ok: Last 12 hours on track
window: 43200
- name: 7 days
burn_rate_threshold: 1.5
alert: false
message_alert: Dev team dedicates 25% of engineers to the reliability backlog
message_ok: Last week on track
window: 604800
- name: 28 days
burn_rate_threshold: 1
alert: false
message_alert: Freeze release, unless related to reliability or security
message_ok: Unfreeze release, per the agreed roll-out policy
window: 2419200 This corresponds to the 1-hour, 6-hour and 3-day time windows in table 5.6 of the SRE book, with burn rates respectively equal to 14.4, 6 and 1. As a result, given this SLO definition and this config file, the SLO Generator will generate as many outputs as there are steps in $ slo-generator compute --slo-config=./samples/cloud_monitoring/slo_gae_app_availability.yaml --config=./samples/config.yaml
INFO - gae-app-availability | 1 hour | SLI: 86.2069 % | SLO: 95.0 % | Gap: -8.79 % | BR: 2.8 / 9.0 | Alert: 0 | Good: 100 | Bad: 16
INFO - gae-app-availability | 12 hours | SLI: 87.7437 % | SLO: 95.0 % | Gap: -7.26 % | BR: 2.5 / 3.0 | Alert: 0 | Good: 1260 | Bad: 176
INFO - gae-app-availability | 7 days | SLI: 87.4147 % | SLO: 95.0 % | Gap: -7.59 % | BR: 2.5 / 1.5 | Alert: 1 | Good: 17045 | Bad: 2454
INFO - gae-app-availability | 28 days | SLI: 79.4544 % | SLO: 95.0 % | Gap: -15.55% | BR: 4.1 / 1.0 | Alert: 1 | Good: 52343 | Bad: 13535
INFO - Run finished successfully in 10.3s.
INFO - Run summary | SLO Configs: 1 | Duration: 10.3s For the first two windows, the actual burn rate (BR) is lower than the configured thresholds (2.8 vs. 9.0, and 2.5 vs. 3.0) so there is no alerting. The last two windows trigger alerting as BR is higher than the configured thresholds (2.5 vs. 1.5, and 4.1 vs 1.0). |
Beta Was this translation helpful? Give feedback.
-
@ocervell Anything to add/amend? |
Beta Was this translation helpful? Give feedback.
-
Thanks for reaching @lvaylet. Can you please explain something that confuses me. When alerting, shouldn't be that we monitor long and short periods in which the burn rate (of the SLO target window - usually 7 and 30 days) is above a threshold, and then trigger an alert? So, the burn rate of the SLO target, and not the burn rate of the short/long periods/windows. Or it is the same thing? Thanks again |
Beta Was this translation helpful? Give feedback.
-
There are multiple approaches. I guess that is exactly why Chapter 5 was written in the first place : to present these strategies, and there is no right or wrong answer. The SLO Generator alerts on Burn Rate on a single window. For example, in the example above, we want to be alerted if we are consuming the error budget 9 times faster than expected over a 1-hour window, so we can react and page the SREs on call so they can defend the SLO. Each line/alert of the report is computed separately, independently of the other lines/windows. Here is the code that loops over the N policies. And here is the code that computes the N Burn Rates. |
Beta Was this translation helpful? Give feedback.
-
It was on our roadmap at some point, but we didn't get enough traction at the time. It is definitely the next stage mentioned in the SRE handbook, and I think it could be implemented easily by having two windows in the error budget policy YAML (like a |
Beta Was this translation helpful? Give feedback.
-
Thanks a lot @ocervell for the insights. @pgichevski does that answer your questions? |
Beta Was this translation helpful? Give feedback.
It was on our roadmap at some point, but we didn't get enough traction at the time. It is definitely the next stage mentioned in the SRE handbook, and I think it could be implemented easily by having two windows in the error budget policy YAML (like a
windows
list instead of the currrentwindow
int) - not that it would incur additional computations on the monitoring backend, but I guess you are ready for it ;)