🐛 [BUG] - incorrect calculation for prometheus provider #319

maksim-paskal · 2023-01-24T16:36:00Z

SLO Generator Version

v2.3.3

Python Version

3.9.13

What happened?

I am using ServiceLevelObjective in sre.google.com/v2 with this spec

spec:
  backend: prometheus
  method: good_bad_ratio
  service_level_indicator:
    filter_good: envoy_cluster_external_upstream_rq{app="x",envoy_cluster_name="y",kubernetes_namespace="z",envoy_response_code!~"5.."}
    filter_valid: envoy_cluster_external_upstream_rq{app="x",envoy_cluster_name="y",kubernetes_namespace="z"}

calculation ends with SLI is not between 0 and 1 (value = 1.000091)

with DEBUG=1 it seems that for 100% SLI prometheus returns sometime filter_good > filter_valid the DEBUG logs

....
slo_generator.backends.prometheus - DEBUG - Query: sum(increase(envoy_cluster_external_upstream_rq{app="x",envoy_cluster_name="y",kubernetes_namespace="z",envoy_response_code!~"5.."}[3600s]))
slo_generator.backends.prometheus - DEBUG - {'data': {'result': [{'metric': {},
                      'value': [1674577674.326, '44.197872287004095']}],
          'resultType': 'vector'},
 'status': 'success'}
...
slo_generator.backends.prometheus - DEBUG - Query: sum(increase(envoy_cluster_external_upstream_rq{app="x",envoy_cluster_name="y",kubernetes_namespace="z"}[3600s]))
slo_generator.backends.prometheus - DEBUG - {'data': {'result': [{'metric': {},
                      'value': [1674577674.959, '44.193849641808484']}],
          'resultType': 'vector'},
 'status': 'success'}
slo_generator.backends.prometheus - DEBUG - Good events: 44.197872287004095 | Bad events: -0.004022645195611574
slo_generator.report - DEBUG - paket-static-availability        | 1 hour   | Backend response: (44.197872287004095, -0.004022645195611574)

What did you expect?

calculation of SLI return 1

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

lvaylet · 2023-01-25T08:58:06Z

Hi @maksim-paskal, thanks for raising this issue.

Any chance you could export the timeseries that generated these results and share them here (assuming there is nothing confidential) so I can reproduce the issue on my machine?

Then can you also confirm the filter_good query has a double quote after envoy_cluster_name="y, just before ,kubernetes_namespace="z"? Otherwise I am afraid the query might not give the expected results.

Finally, I am a bit surprised by the values returned by Prometheus. An SLI is usually computed by dividing the number of good events by the number of valid (= good + bad) events. These two numbers are usually integers. Here the logs show floating-point values. I am not a Prometheus expert but is it possible a count or sum is missing?

maksim-paskal · 2023-01-25T09:39:01Z

@lvaylet, thanks for quick responce.

Sory for typo in filter_good, change it in issue description, this is a real promQL they are works sometime, sometime return this error.

My example data from Prometheus (we actualy using Thanos Query v0.26.0)

Query:
envoy_cluster_external_upstream_rq{app="x",envoy_cluster_name="y",kubernetes_namespace="z"}[1m]

Returns:
envoy_cluster_external_upstream_rq{app="x", envoy_cluster_name="y",kubernetes_namespace="z"} 

253117 @1674638934.622
253125 @1674638940.809
253127 @1674638955.809
253162 @1674638970.809
253197 @1674638985.809

Query:
increase(envoy_cluster_external_upstream_rq{app="x",envoy_cluster_name="y",kubernetes_namespace="z"}[1m])

Returns:
195.75282786645047

It's some time int, sometime float in different windows, increase and rate in PromQL is calculating per-second metrics I think it maybe float

lvaylet · 2023-01-25T12:38:23Z

Thanks @maksim-paskal. I need to investigate.

For the record, what type of SLI are you computing here? Availability? Also, what does envoy_cluster_external_upstream_rq represent? Requests with response class or response code, as hinted in this documentation?

maksim-paskal · 2023-01-26T17:16:34Z

envoy_cluster_external_upstream_rq is a upstream counter of specific HTTP response codes (e.g., 201, 302, etc.), we plan to use it to calculate availability of our service. You can simulate this environment with this files, you need Docker:

docker-compose.yml

version: '3.7'
services:
  prometheus:
    image: prom/prometheus:v2.36.2
    volumes:
    - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
    ports:
    - 9090:9090
    command:
    - '--config.file=/etc/prometheus/prometheus.yml'
  envoy:
    image: envoyproxy/envoy:v1.21.5
    volumes:
    - ./envoy.yml:/etc/envoy/envoy.yaml:ro
    ports:
    - 10000:10000

prometheus.yml

global:
  scrape_interval:     5s
  evaluation_interval: 5s
scrape_configs:
- job_name: 'prometheus'
  static_configs:
  - targets: ['localhost:9090']
- job_name: 'envoy'
  metrics_path: /stats/prometheus
  static_configs:
  - targets: ['envoy:9901']

envoy.yml

admin:
  address:
    socket_address: { address: 0.0.0.0, port_value: 9901 }

static_resources:
  listeners:
  - name: listener_0
    address:
      socket_address: { address: 0.0.0.0, port_value: 10000 }
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: ingress_http
          codec_type: AUTO
          route_config:
            name: local_route
            virtual_hosts:
            - name: local_service
              domains: ["*"]
              routes:
              - match: { prefix: "/" }
                route: { cluster: some_service }
          http_filters:
          - name: envoy.filters.http.router
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
  clusters:
  - name: some_service
    connect_timeout: 0.25s
    type: STATIC
    lb_policy: ROUND_ROBIN
    load_assignment:
      cluster_name: some_service
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: 127.0.0.1
                port_value: 9901

# run prometheus and envoy
docker-compose up

# generate some records, for example with https://github.com/tsenart/vegeta
echo "GET http://localhost:10000/ready" | vegeta attack -duration=60s -output=/dev/null

open Prometheus http://127.0.0.1:9090

increase(envoy_cluster_external_upstream_rq{envoy_response_code="200"}[1d])

mveroone · 2023-06-14T09:29:37Z

I think the issue is that if a new datapoint is added to prometheus' TSDB between the Good & the Valid query, you get an offset between them, leading to this kind of behaviour.
You would need to send a finite timeframe to prometheus and make sure that this timeframe is further from "now" than the scrape interval of these metrics.

The only alternative is to make Prometheus perform the division and only query an SLI from it, to ensure consistency. (which may require developpement, depending on the backend current implementation.) But the downside from that is you cannot export good & bad event metrics anymore, by doing so.

In my opinion, this issue is probably similar to #343 (although with different backends)

bkamin29 · 2023-06-30T14:24:32Z

a workaround could be use good/bad instead of good/valid

lvaylet · 2023-07-05T14:07:43Z

@maksim-paskal I just discussed the issue with @bkamin29 and @mveroone.

We are pretty sure this behavior is caused by the tiny delay between the two requests (one for good and another one for valid). They are called and executed a few milliseconds apart, resulting in the same window length but slightly different start/end times. As a consequence, as we are looking at two different time horizons, the most recent one might have a slightly different number of good/valid events. Also note that the backend itself can be busy ingesting and persisting "old" data points between the two calls, and account for more data points during the second call.

Two options to mitigate this behavior:

I can work on integrating the @ modifier into the queries sent to Prometheus, so both calls have the exact same time horizon. I can use the timestamp of the report generation request for this value. Just note that this addition will not prevent the provider from ingesting/persisting new data points between the two calls.
You can use the query_sli method instead of the good_bad_ratio one, and delegate the computation of the good/valid ratio to Prometheus itself. That would result in a single request and a single call. However, with this approach, you give up on the ability to export the number of good and bad events.

maksim-paskal added bug Something isn't working triage labels Jan 24, 2023

maksim-paskal assigned lvaylet Jan 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 [BUG] - incorrect calculation for prometheus provider #319

🐛 [BUG] - incorrect calculation for prometheus provider #319

maksim-paskal commented Jan 24, 2023 •

edited

Loading

lvaylet commented Jan 25, 2023

maksim-paskal commented Jan 25, 2023 •

edited

Loading

lvaylet commented Jan 25, 2023

maksim-paskal commented Jan 26, 2023

mveroone commented Jun 14, 2023 •

edited

Loading

bkamin29 commented Jun 30, 2023

lvaylet commented Jul 5, 2023

🐛 [BUG] - incorrect calculation for prometheus provider #319

🐛 [BUG] - incorrect calculation for prometheus provider #319

Comments

maksim-paskal commented Jan 24, 2023 • edited Loading

SLO Generator Version

Python Version

What happened?

What did you expect?

Code of Conduct

lvaylet commented Jan 25, 2023

maksim-paskal commented Jan 25, 2023 • edited Loading

lvaylet commented Jan 25, 2023

maksim-paskal commented Jan 26, 2023

mveroone commented Jun 14, 2023 • edited Loading

bkamin29 commented Jun 30, 2023

lvaylet commented Jul 5, 2023

maksim-paskal commented Jan 24, 2023 •

edited

Loading

maksim-paskal commented Jan 25, 2023 •

edited

Loading

mveroone commented Jun 14, 2023 •

edited

Loading