Multiwindow, Multi-Burn-Rate Alerts: is it achievable? #376

pgichevski · 2023-11-08T13:45:28Z

pgichevski
Nov 8, 2023

Hi, I'm a bit stuck around implementing multiwindow, multi-burn-rate alerts through the slo-generator, so I was hoping for some clarifications.
Assuming I want to have an alert if the burn rate is greater than 14.4 in a long window of 1h, and a short window of 5m, where the SLO target period is 30 days.
If I want to achieve this through error budget policies, I can set the windows for 1h and 5m, but that will mean I will compute the SLO around those windows, and not the 30 days window.
For reference, the multiwindow, multi-burn-rate alert is explained at https://sre.google/workbook/alerting-on-slos/ (point 6).

Thanks for the help

Answered by ocervell

Nov 16, 2023

It was on our roadmap at some point, but we didn't get enough traction at the time. It is definitely the next stage mentioned in the SRE handbook, and I think it could be implemented easily by having two windows in the error budget policy YAML (like a windows list instead of the currrent window int) - not that it would incur additional computations on the monitoring backend, but I guess you are ready for it ;)

View full answer

lvaylet · 2023-11-10T13:24:52Z

lvaylet
Nov 10, 2023
Maintainer

Hi @pgichevski and thanks for reaching out.

From what I understand, the SLO Generator implements 5: Multiple Burn Rate Alerts in the SRE book.

SLO definitions (for example this one) only configure a threshold (goal: 0.95). Then the time windows over which the error budget burn rates are computed are defined in the global configuration file. For example, this sample configuration file defines 4 time windows for 1 hour, 12 hours, 7 days and 28 days. Then each time window has a specific burn rate to monitor:

error_budget_policies:
  default:
    steps:
    - name: 1 hour
      burn_rate_threshold: 9
      alert: true
      message_alert: Page to defend the SLO
      message_ok: Last hour on track
      window: 3600
    - name: 12 hours
      burn_rate_threshold: 3
      alert: true
      message_alert: Page to defend the SLO
      message_ok: Last 12 hours on track
      window: 43200
    - name: 7 days
      burn_rate_threshold: 1.5
      alert: false
      message_alert: Dev team dedicates 25% of engineers to the reliability backlog
      message_ok: Last week on track
      window: 604800
    - name: 28 days
      burn_rate_threshold: 1
      alert: false
      message_alert: Freeze release, unless related to reliability or security
      message_ok: Unfreeze release, per the agreed roll-out policy
      window: 2419200

This corresponds to the 1-hour, 6-hour and 3-day time windows in table 5.6 of the SRE book, with burn rates respectively equal to 14.4, 6 and 1.

As a result, given this SLO definition and this config file, the SLO Generator will generate as many outputs as there are steps in error_budget_policies in config.yaml. So here, 4 outputs:

$ slo-generator compute --slo-config=./samples/cloud_monitoring/slo_gae_app_availability.yaml --config=./samples/config.yaml
INFO - gae-app-availability             | 1 hour   | SLI: 86.2069 % | SLO: 95.0 % | Gap: -8.79 % | BR: 2.8 / 9.0 | Alert: 0 | Good: 100      | Bad: 16      
INFO - gae-app-availability             | 12 hours | SLI: 87.7437 % | SLO: 95.0 % | Gap: -7.26 % | BR: 2.5 / 3.0 | Alert: 0 | Good: 1260     | Bad: 176     
INFO - gae-app-availability             | 7 days   | SLI: 87.4147 % | SLO: 95.0 % | Gap: -7.59 % | BR: 2.5 / 1.5 | Alert: 1 | Good: 17045    | Bad: 2454    
INFO - gae-app-availability             | 28 days  | SLI: 79.4544 % | SLO: 95.0 % | Gap: -15.55% | BR: 4.1 / 1.0 | Alert: 1 | Good: 52343    | Bad: 13535   
INFO - Run finished successfully in 10.3s.
INFO - Run summary | SLO Configs: 1 | Duration: 10.3s

For the first two windows, the actual burn rate (BR) is lower than the configured thresholds (2.8 vs. 9.0, and 2.5 vs. 3.0) so there is no alerting. The last two windows trigger alerting as BR is higher than the configured thresholds (2.5 vs. 1.5, and 4.1 vs 1.0).

0 replies

lvaylet · 2023-11-10T13:44:39Z

lvaylet
Nov 10, 2023
Maintainer

@ocervell Anything to add/amend?

0 replies

pgichevski · 2023-11-10T14:29:55Z

pgichevski
Nov 10, 2023
Author

Thanks for reaching @lvaylet. Can you please explain something that confuses me. When alerting, shouldn't be that we monitor long and short periods in which the burn rate (of the SLO target window - usually 7 and 30 days) is above a threshold, and then trigger an alert? So, the burn rate of the SLO target, and not the burn rate of the short/long periods/windows. Or it is the same thing? Thanks again

0 replies

lvaylet · 2023-11-13T15:31:11Z

lvaylet
Nov 13, 2023
Maintainer

There are multiple approaches. I guess that is exactly why Chapter 5 was written in the first place : to present these strategies, and there is no right or wrong answer.

The SLO Generator alerts on Burn Rate on a single window. For example, in the example above, we want to be alerted if we are consuming the error budget 9 times faster than expected over a 1-hour window, so we can react and page the SREs on call so they can defend the SLO.

Each line/alert of the report is computed separately, independently of the other lines/windows. Here is the code that loops over the N policies. And here is the code that computes the N Burn Rates.

0 replies

ocervell · 2023-11-16T10:40:27Z

ocervell
Nov 16, 2023
Collaborator

It was on our roadmap at some point, but we didn't get enough traction at the time. It is definitely the next stage mentioned in the SRE handbook, and I think it could be implemented easily by having two windows in the error budget policy YAML (like a windows list instead of the currrent window int) - not that it would incur additional computations on the monitoring backend, but I guess you are ready for it ;)

0 replies

lvaylet · 2023-11-16T10:47:45Z

lvaylet
Nov 16, 2023
Maintainer

Thanks a lot @ocervell for the insights. @pgichevski does that answer your questions?

1 reply

pgichevski Nov 16, 2023
Author

thank you @lvaylet @ocervell. If I understood correctly the current implementation will do calculations based on a single window. Even as @ocervell suggests having multiple windows, that will still not be enough to implement multiwindow, multi-burn-rate alerts, as that one requires computation of the burn rate in respect to the SLO target period (this is usually 7 or 30 days). If you are interested you can take a look at datadog's information related burn rates: https://docs.datadoghq.com/service_management/service_level_objectives/burn_rate/

Thank you for your time :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiwindow, Multi-Burn-Rate Alerts: is it achievable? #376

{{title}}

Replies: 6 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Multiwindow, Multi-Burn-Rate Alerts: is it achievable? #376

pgichevski Nov 8, 2023

Replies: 6 comments · 1 reply

lvaylet Nov 10, 2023 Maintainer

lvaylet Nov 10, 2023 Maintainer

pgichevski Nov 10, 2023 Author

lvaylet Nov 13, 2023 Maintainer

ocervell Nov 16, 2023 Collaborator

lvaylet Nov 16, 2023 Maintainer

pgichevski Nov 16, 2023 Author

pgichevski
Nov 8, 2023

Replies: 6 comments 1 reply

lvaylet
Nov 10, 2023
Maintainer

lvaylet
Nov 10, 2023
Maintainer

pgichevski
Nov 10, 2023
Author

lvaylet
Nov 13, 2023
Maintainer

ocervell
Nov 16, 2023
Collaborator

lvaylet
Nov 16, 2023
Maintainer

pgichevski Nov 16, 2023
Author