The NetObserv operator uses flowlogs-pipeline to generate metrics out of flow logs. These metrics are meant to be collected by a Prometheus instance (not part of NetObserv deployment). In OpenShift, they are collected either by Cluster Monitoring or User Workload Monitoring.
There are two ways to configure metrics:
- By enabling or disabling any of the predefined metrics
- Using the FlowMetrics API to create custom metrics
They can be configured in the FlowCollector
custom resource, via spec.processor.metrics.includeList
. It is a list of metric names that tells which ones to generate.
The names correspond to the names in Prometheus without their prefix. For example, namespace_egress_packets_total
will show up as netobserv_namespace_egress_packets_total
in Prometheus.
Note that the more metrics you add, the bigger is the impact on Prometheus workload resources. Some metrics in particular have a bigger cardinality, such as all metrics starting with workload_
, which may result in stressing Prometheus if too many of them are enabled. It is recommended to monitor the impact on Prometheus when adding more metrics.
Available names are: (names followed by *
are enabled by default; names followed by **
are also enabled by default when Loki is disabled)
namespace_egress_bytes_total
namespace_egress_packets_total
namespace_ingress_bytes_total
namespace_ingress_packets_total
namespace_flows_total
*
node_egress_bytes_total
node_egress_packets_total
node_ingress_bytes_total
*
node_ingress_packets_total
node_flows_total
workload_egress_bytes_total
workload_egress_packets_total
workload_ingress_bytes_total
*
workload_ingress_packets_total
workload_flows_total
When the PacketDrop
feature is enabled in spec.agent.ebpf.features
(with privileged mode), additional metrics are available:
namespace_drop_bytes_total
namespace_drop_packets_total
*
node_drop_bytes_total
node_drop_packets_total
workload_drop_bytes_total
**
workload_drop_packets_total
**
When the FlowRTT
feature is enabled in spec.agent.ebpf.features
, additional metrics are available:
namespace_rtt_seconds
*
node_rtt_seconds
workload_rtt_seconds
**
When the DNSTracking
feature is enabled in spec.agent.ebpf.features
, additional metrics are available:
namespace_dns_latency_seconds
*
node_dns_latency_seconds
workload_dns_latency_seconds
**
The FlowMetrics API (spec reference) has been designed to give you full control on the metrics generation out of the NetObserv' enriched NetFlow data. It allows to create counters or histograms with any set of fields as Prometheus labels, and using any filters from the fields. Just a recommendation: be careful about the metrics cardinality when creating new metrics. High cardinality metrics can stress the Prometheus instance.
The full list of fields is available there. The "Cardinality" column gives information about the implied metrics cardinality. Fields flagged as fine
are safe to use as labels. Fields flagged as careful
need some extra attention: if you want to use them as labels, it is recommended to narrow down the cardinality with filters. For example, you may safely use DstPort
as a label if you also restrict which DstPort
are allowed with a MatchRegex
filter.
Be also aware that for each field used as a label, the fields cardinality is potentially multiplied - and this is especially true when mixing Source and Destination fields. For instance, using SrcK8S_Name
or DstK8S_Name
(ie. Pod/Node/Service names) alone as a label might be reasonable, but using both SrcK8S_Name
and DstK8S_Name
in the same metric potentially generates the square of the cardinality of Pods/Nodes/Services.
Don't hesitate to reach out if you need more guidance.
Some of those fields require special features to be enabled in FlowCollector
, such as TimeFlowRttNs
via spec.agent.ebpf.features
or Src/DstK8S_Zone
via spec.processor.addZone
.
Currently, FlowMetric
resources need to be created in the namespace defined in FlowCollector
spec.namespace
, which is netobserv
by default. This may change in the future.
Here is an example of a FlowMetric resource that generates a metric tracking ingress bytes received from cluster external sources, labeled by destination host and workload:
apiVersion: flows.netobserv.io/v1alpha1
kind: FlowMetric
metadata:
name: flowmetric-cluster-external-ingress-traffic
spec:
metricName: cluster_external_ingress_bytes_total
type: Counter
valueField: Bytes
direction: Ingress
labels: [DstK8S_HostName,DstK8S_Namespace,DstK8S_OwnerName,DstK8S_OwnerType]
filters:
- field: SrcSubnetLabel
matchType: Absence
In this example, selecting just the cluster external traffic is done by matching only flows where SrcSubnetLabel
is absent. This assumes the subnet labels feature is enabled (via spec.processor.subnetLabels
) and configured to recognize IP ranges used in the cluster. In OpenShift, this is enabled and configured by default.
Refer to the spec reference for more information about each field.
Here is a similar example for an histogram. Histograms are typically used for latencies. This example shows RTT latency for cluster external ingress traffic.
apiVersion: flows.netobserv.io/v1alpha1
kind: FlowMetric
metadata:
name: flowmetric-cluster-external-ingress-rtt
spec:
metricName: cluster_external_ingress_rtt_seconds
type: Histogram
valueField: TimeFlowRttNs
direction: Ingress
labels: [DstK8S_HostName,DstK8S_Namespace,DstK8S_OwnerName,DstK8S_OwnerType]
filters:
- field: SrcSubnetLabel
matchType: Absence
- field: TimeFlowRttNs
matchType: Presence
divider: "1000000000"
buckets: [".001", ".005", ".01", ".02", ".03", ".04", ".05", ".075", ".1", ".25", "1"]
type
here is Histogram
since it looks for a latency value (TimeFlowRttNs
),
and we define custom buckets that should offer a decent precision on RTT ranging roughly between 5ms and 250ms.
Since the RTT is provided as nanos in flows, we use a divider of 1 billion to convert into seconds (which is standard in Prometheus guidelines).
You can find more examples in https://github.com/netobserv/network-observability-operator/tree/main/config/samples/flowmetrics.
Optionally, you can generate charts for dashboards in the OpenShift Console (administrator view, Dashboard menu), by filling the charts
section of the FlowMetric
resources.
Here is an example for the flowmetric-cluster-external-ingress-traffic
resource described above:
# ...
charts:
- dashboardName: Main
title: External ingress traffic
unit: Bps
type: SingleStat
queries:
- promQL: "sum(rate($METRIC[2m]))"
legend: ""
- dashboardName: Main
sectionName: External
title: Top external ingress traffic per workload
unit: Bps
type: StackArea
queries:
- promQL: "sum(rate($METRIC{DstK8S_Namespace!=\"\"}[2m])) by (DstK8S_Namespace, DstK8S_OwnerName)"
legend: "{{DstK8S_Namespace}} / {{DstK8S_OwnerName}}"
This creates two panels:
- a textual "single stat" that shows global external ingress rate summed across all dimensions
- a timeseries graph showing the same metric per destination workload
For more information about the query language, refer to the Prometheus documentation. And again, refer to the spec reference for more information about each field.
Another example for histograms:
# ...
charts:
- dashboardName: Main
title: External ingress TCP latency
unit: seconds
type: SingleStat
queries:
- promQL: "histogram_quantile(0.99, sum(rate($METRIC_bucket[2m])) by (le)) > 0"
legend: "p99"
- dashboardName: Main
sectionName: External
title: "Top external ingress sRTT per workload, p50 (ms)"
unit: seconds
type: Line
queries:
- promQL: "histogram_quantile(0.5, sum(rate($METRIC_bucket{DstK8S_Namespace!=\"\"}[2m])) by (le,DstK8S_Namespace,DstK8S_OwnerName))*1000 > 0"
legend: "{{DstK8S_Namespace}} / {{DstK8S_OwnerName}}"
- dashboardName: Main
sectionName: External
title: "Top external ingress sRTT per workload, p99 (ms)"
unit: seconds
type: Line
queries:
- promQL: "histogram_quantile(0.99, sum(rate($METRIC_bucket{DstK8S_Namespace!=\"\"}[2m])) by (le,DstK8S_Namespace,DstK8S_OwnerName))*1000 > 0"
legend: "{{DstK8S_Namespace}} / {{DstK8S_OwnerName}}"
This example uses the histogram_quantile
function, to show p50 and p99.
You may also be interested in showing averages of histograms: this is done by dividing $METRIC_sum
by $METRIC_count
metrics, which are automatically generated when you create an histogram. With the above example, it would be:
promQL: "(sum(rate($METRIC_sum{DstK8S_Namespace!=\"\"}[2m])) by (DstK8S_Namespace,DstK8S_OwnerName) / sum(rate($METRIC_count{DstK8S_Namespace!=\"\"}[2m])) by (DstK8S_Namespace,DstK8S_OwnerName))*1000"