Improve monitoring errors #2785

lambdanis · 2024-08-09T02:53:00Z

Review tetragon_errors_total and other "errors" metrics
Stop reporting non-errors (as in: no action needed) as errors. Define separate metrics for "casual fails" if needed.
Refactor errors metrics to make them more useful (all of these are subject to review)
- Delete tetragon_errors_total{type="handler_error"} - it duplicates handler_errors_total
- Consider further splitting tetragon_errors_total
- Consider merging event cache errors metrics into one with event_type and entry_type labels
- Consider merging kprobe ok and errors metrics into one with status label
Standardize on error label (rather than error_type, type, etc)

The text was updated successfully, but these errors were encountered:

lambdanis added the area/metrics Related to prometheus metrics label Aug 9, 2024

lambdanis added this to Tetragon metrics hardening Aug 9, 2024

This was referenced Sep 1, 2024

Replace process cache evictions and misses metrics #2857

Merged

Refactor and rename eventcache metrics #2861

Merged

Remove tetragon_errors_total{type="handler_error"} metric #2862

Merged

Replace missing process info metric #2863

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve monitoring errors #2785

Improve monitoring errors #2785

lambdanis commented Aug 9, 2024 •

edited

Loading

Improve monitoring errors #2785

Improve monitoring errors #2785

Comments

lambdanis commented Aug 9, 2024 • edited Loading

lambdanis commented Aug 9, 2024 •

edited

Loading