-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 [BUG] - incorrect calculation for prometheus provider #319
Comments
Hi @maksim-paskal, thanks for raising this issue. Any chance you could export the timeseries that generated these results and share them here (assuming there is nothing confidential) so I can reproduce the issue on my machine? Then can you also confirm the Finally, I am a bit surprised by the values returned by Prometheus. An SLI is usually computed by dividing the number of good events by the number of valid (= good + bad) events. These two numbers are usually integers. Here the logs show floating-point values. I am not a Prometheus expert but is it possible a |
@lvaylet, thanks for quick responce. Sory for typo in My example data from Prometheus (we actualy using Thanos Query v0.26.0)
It's some time int, sometime float in different windows, |
Thanks @maksim-paskal. I need to investigate. For the record, what type of SLI are you computing here? Availability? Also, what does |
docker-compose.ymlversion: '3.7'
services:
prometheus:
image: prom/prometheus:v2.36.2
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
ports:
- 9090:9090
command:
- '--config.file=/etc/prometheus/prometheus.yml'
envoy:
image: envoyproxy/envoy:v1.21.5
volumes:
- ./envoy.yml:/etc/envoy/envoy.yaml:ro
ports:
- 10000:10000 prometheus.ymlglobal:
scrape_interval: 5s
evaluation_interval: 5s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'envoy'
metrics_path: /stats/prometheus
static_configs:
- targets: ['envoy:9901'] envoy.ymladmin:
address:
socket_address: { address: 0.0.0.0, port_value: 9901 }
static_resources:
listeners:
- name: listener_0
address:
socket_address: { address: 0.0.0.0, port_value: 10000 }
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: ingress_http
codec_type: AUTO
route_config:
name: local_route
virtual_hosts:
- name: local_service
domains: ["*"]
routes:
- match: { prefix: "/" }
route: { cluster: some_service }
http_filters:
- name: envoy.filters.http.router
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
clusters:
- name: some_service
connect_timeout: 0.25s
type: STATIC
lb_policy: ROUND_ROBIN
load_assignment:
cluster_name: some_service
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: 127.0.0.1
port_value: 9901 # run prometheus and envoy
docker-compose up
# generate some records, for example with https://github.com/tsenart/vegeta
echo "GET http://localhost:10000/ready" | vegeta attack -duration=60s -output=/dev/null open Prometheus http://127.0.0.1:9090
|
I think the issue is that if a new datapoint is added to prometheus' TSDB between the Good & the Valid query, you get an offset between them, leading to this kind of behaviour. The only alternative is to make Prometheus perform the division and only query an SLI from it, to ensure consistency. (which may require developpement, depending on the backend current implementation.) But the downside from that is you cannot export good & bad event metrics anymore, by doing so. In my opinion, this issue is probably similar to #343 (although with different backends) |
a workaround could be use good/bad instead of good/valid |
@maksim-paskal I just discussed the issue with @bkamin29 and @mveroone. We are pretty sure this behavior is caused by the tiny delay between the two requests (one for Two options to mitigate this behavior:
|
SLO Generator Version
v2.3.3
Python Version
3.9.13
What happened?
I am using
ServiceLevelObjective
insre.google.com/v2
with this speccalculation ends with
SLI is not between 0 and 1 (value = 1.000091)
with
DEBUG=1
it seems that for 100% SLI prometheus returns sometimefilter_good > filter_valid
the DEBUG logsWhat did you expect?
calculation of SLI return 1
Code of Conduct
The text was updated successfully, but these errors were encountered: