NETOBSERV-1994: remove unneeded bpf map update calls #466

msherif1234 · 2024-11-25T17:49:30Z

Description

bpf code was doing unnecessary calls to bpf_map_update_elem while the map is already been updated in place. which also added unneeded cpu load

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

msherif1234 · 2024-11-25T17:52:51Z

/ok-to-test

github-actions · 2024-11-25T17:54:58Z

New image:
quay.io/netobserv/netobserv-ebpf-agent:9868e0f

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=9868e0f make set-agent-image

jotak · 2024-11-26T10:47:20Z

bpf/flows.c

@@ -152,7 +139,8 @@ static inline int flow_monitor(struct __sk_buff *skb, u8 direction) {
            if (trace_messages) {
                bpf_printk("error adding flow %d\n", ret);
            }
-
+            // Update global counter for hashmap update errors
+            increase_counter(HASHMAP_FLOWS_DROPPED);


we shouldn't report drops here, since the flows are still sent via RB, that would be misleading.
if the intent is to track hashmap update errors, we can already know that by looking at the RB usage metric that already exists (ie. if RB is used it's because there was a failed update)

ok yeah wanted to know when hmap not used and we do rb but I agree its duplicate I will remove it

jotak

LGTM

jotak · 2024-11-26T11:27:10Z

(I didn't find noticeable improvement in CPU though ...)

codecov · 2024-11-26T11:42:34Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 29.57%. Comparing base (294ae3f) to head (ab8e306).

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #466   +/-   ##
=======================================
  Coverage   29.56%   29.57%           
=======================================
  Files          50       50           
  Lines        4867     4866    -1     
=======================================
  Hits         1439     1439           
+ Misses       3322     3321    -1     
  Partials      106      106

Flag	Coverage Δ
unittests	`29.57% <ø> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
pkg/ebpf/bpf_x86_bpfel.go	`0.00% <ø> (ø)`
pkg/tracer/tracer.go	`0.00% <ø> (ø)`

openshift-ci · 2024-11-26T12:36:48Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from jotak. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tohojo

A few nits, but also a more general comment: you're not handling failure of bpf_map_update(...,BPF_NOEXIST), which means that if two threads end up doing this concurrently, one of the creation attempts will be lost.

This is no worse than before the patch, though (before the patch, one attempt would silently overwrite the other, now one will just fail). But, well, it's possible to do better, so I would suggest handling the errors where possible :)

bpf/dns_tracker.h

bpf/flows.c

msherif1234 · 2024-11-26T13:24:39Z

A few nits, but also a more general comment: you're not handling failure of bpf_map_update(...,BPF_NOEXIST), which means that if two threads end up doing this concurrently, one of the creation attempts will be lost.

This is no worse than before the patch, though (before the patch, one attempt would silently overwrite the other, now one will just fail). But, well, it's possible to do better, so I would suggest handling the errors where possible :)

are u referring to DNS update or this is more of general comment and u need better handling for map update errors ? can u expand more if that is the case ?

msherif1234 · 2024-11-26T15:26:59Z

/ok-to-test

github-actions · 2024-11-26T15:29:15Z

New image:
quay.io/netobserv/netobserv-ebpf-agent:2920e5b

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=2920e5b make set-agent-image

msherif1234 · 2024-11-26T16:40:58Z

/ok-to-test

github-actions · 2024-11-26T16:43:06Z

New image:
quay.io/netobserv/netobserv-ebpf-agent:4a20325

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=4a20325 make set-agent-image

msherif1234 · 2024-11-26T17:18:10Z

/ok-to-test

tohojo · 2024-11-26T17:17:29Z

bpf/flows.c

+                    (flow_metrics *)bpf_map_lookup_elem(&aggregated_flows, &id);
+                if (aggregate_flow != NULL) {
+                    update_existing_flow(aggregate_flow, &pkt, dns_errno, len);
+                }


You could keep the HASHMAP_FLOWS_DROPPED counter and update it in an "else" branch here. I would expect this to basically never happen, but just to be on the safe side, and since you already have that counter.

yeah that is why I felt its less useful but I can reuse it when eexist fail to lkup

tohojo · 2024-11-26T17:20:05Z

bpf/flows.c

-            // In this case, we send the single-packet flow via ringbuffer as in the worst case we can have
-            // a repeated INTERSECTION of flows (different flows aggregating different packets),
-            // which can be re-aggregated at userpace.
-            // other possible values https://chromium.googlesource.com/chromiumos/docs/+/master/constants/errnos.md
            if (trace_messages) {
                bpf_printk("error adding flow %d\n", ret);
            }


I would not personally consider EEXIST an error that needs logging (here, or in other places you have trace logging). Only if the subsequent lookup then fails to return an entry (see comment below).

the thing is map update can fail for many reasons ebusy or e2big so i need to capture those as well , now if eexist error will be happening alot I can filter out the trace for exist error and with counter back we will know when we drop flows

Yeah, so that's basically what I meant: Use the trace message only for errors other than EEXIST (just move the log statement a bit further down where you're handling those anyway)

github-actions · 2024-11-26T17:20:22Z

New image:
quay.io/netobserv/netobserv-ebpf-agent:6042039

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=6042039 make set-agent-image

Signed-off-by: Mohamed Mahmoud <[email protected]>

github-actions · 2024-11-27T10:37:21Z

New image:
quay.io/netobserv/netobserv-ebpf-agent:b156a29

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=b156a29 make set-agent-image

jotak

LGTM
I tested the first version which worked well, will need to retest when you tell me it's stable

openshift-ci-robot · 2024-11-28T15:38:12Z

jotak · 2024-11-29T09:03:16Z

bpf/flows.c

-            // Update global counter for hashmap update errors
-            increase_counter(HASHMAP_FLOWS_DROPPED);
-        }
+        update_existing_flow(aggregate_flow, &pkt, dns_errno, len);


must swap len and dns_errno

Suggested change

update_existing_flow(aggregate_flow, &pkt, dns_errno, len);

update_existing_flow(aggregate_flow, &pkt, len, dns_errno);

jotak · 2024-11-29T09:03:41Z

bpf/flows.c

+                flow_metrics *aggregate_flow =
+                    (flow_metrics *)bpf_map_lookup_elem(&aggregated_flows, &id);
+                if (aggregate_flow != NULL) {
+                    update_existing_flow(aggregate_flow, &pkt, dns_errno, len);


here as well

Suggested change

update_existing_flow(aggregate_flow, &pkt, dns_errno, len);

update_existing_flow(aggregate_flow, &pkt, len, dns_errno);

jotak

While testing it was showing unexpected low Bps on my workload test ... that's because of wrong argument ordering

openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Nov 25, 2024

jotak reviewed Nov 26, 2024

View reviewed changes

github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Nov 26, 2024

msherif1234 requested a review from jotak November 26, 2024 11:24

jotak approved these changes Nov 26, 2024

View reviewed changes

openshift-ci bot assigned jotak Nov 26, 2024

openshift-ci bot added the lgtm label Nov 26, 2024

msherif1234 force-pushed the remove_map_update branch from a3250fd to 873f357 Compare November 26, 2024 12:36

openshift-ci bot removed the lgtm label Nov 26, 2024

msherif1234 force-pushed the remove_map_update branch from 873f357 to 3e7f125 Compare November 26, 2024 12:39

msherif1234 requested a review from jotak November 26, 2024 12:39

tohojo reviewed Nov 26, 2024

View reviewed changes

bpf/dns_tracker.h Outdated Show resolved Hide resolved

bpf/flows.c Outdated Show resolved Hide resolved

msherif1234 force-pushed the remove_map_update branch from 3e7f125 to ab8e306 Compare November 26, 2024 13:05

msherif1234 requested a review from tohojo November 26, 2024 13:23

msherif1234 force-pushed the remove_map_update branch from ab8e306 to d88371b Compare November 26, 2024 15:24

openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Nov 26, 2024

msherif1234 force-pushed the remove_map_update branch from d88371b to 7c1855e Compare November 26, 2024 16:40

github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Nov 26, 2024

openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Nov 26, 2024

msherif1234 force-pushed the remove_map_update branch from 7c1855e to 27be1cb Compare November 26, 2024 17:17

github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Nov 26, 2024

openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Nov 26, 2024

tohojo reviewed Nov 26, 2024

View reviewed changes

remove unneeded bpf map update calls

58cbf78

Signed-off-by: Mohamed Mahmoud <[email protected]>

msherif1234 force-pushed the remove_map_update branch from 27be1cb to 58cbf78 Compare November 26, 2024 18:34

github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Nov 26, 2024

msherif1234 requested a review from tohojo November 26, 2024 18:34

jotak added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Nov 27, 2024

jotak approved these changes Nov 28, 2024

View reviewed changes

openshift-ci bot added the lgtm label Nov 28, 2024

jotak changed the title ~~remove unneeded bpf map update calls~~ NETOBSERV-1994: remove unneeded bpf map update calls Nov 28, 2024

openshift-ci-robot added the jira/valid-reference label Nov 28, 2024

jotak reviewed Nov 29, 2024

View reviewed changes

jotak requested changes Nov 29, 2024

View reviewed changes

openshift-ci bot removed the lgtm label Nov 29, 2024

jotak mentioned this pull request Nov 29, 2024

NETOBSERV-1996: in-kernel de-duplication #470

Draft

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NETOBSERV-1994: remove unneeded bpf map update calls #466

NETOBSERV-1994: remove unneeded bpf map update calls #466

msherif1234 commented Nov 25, 2024

msherif1234 commented Nov 25, 2024

github-actions bot commented Nov 25, 2024

jotak Nov 26, 2024

msherif1234 Nov 26, 2024

jotak left a comment

jotak commented Nov 26, 2024

codecov bot commented Nov 26, 2024 •

edited

Loading

openshift-ci bot commented Nov 26, 2024

tohojo left a comment

msherif1234 commented Nov 26, 2024

msherif1234 commented Nov 26, 2024

github-actions bot commented Nov 26, 2024

msherif1234 commented Nov 26, 2024

github-actions bot commented Nov 26, 2024

msherif1234 commented Nov 26, 2024

tohojo Nov 26, 2024

msherif1234 Nov 26, 2024

tohojo Nov 26, 2024

msherif1234 Nov 26, 2024

tohojo Nov 26, 2024

github-actions bot commented Nov 26, 2024

github-actions bot commented Nov 27, 2024

jotak left a comment

openshift-ci-robot commented Nov 28, 2024 •

edited by openshift-ci bot

Loading

Description

Dependencies

Checklist

jotak Nov 29, 2024

jotak Nov 29, 2024

jotak left a comment

	update_existing_flow(aggregate_flow, &pkt, dns_errno, len);
	update_existing_flow(aggregate_flow, &pkt, len, dns_errno);

NETOBSERV-1994: remove unneeded bpf map update calls #466

Are you sure you want to change the base?

NETOBSERV-1994: remove unneeded bpf map update calls #466

Conversation

msherif1234 commented Nov 25, 2024

Description

Dependencies

Checklist

msherif1234 commented Nov 25, 2024

github-actions bot commented Nov 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jotak left a comment

Choose a reason for hiding this comment

jotak commented Nov 26, 2024

codecov bot commented Nov 26, 2024 • edited Loading

Codecov Report

openshift-ci bot commented Nov 26, 2024

tohojo left a comment

Choose a reason for hiding this comment

msherif1234 commented Nov 26, 2024

msherif1234 commented Nov 26, 2024

github-actions bot commented Nov 26, 2024

msherif1234 commented Nov 26, 2024

github-actions bot commented Nov 26, 2024

msherif1234 commented Nov 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Nov 26, 2024

github-actions bot commented Nov 27, 2024

jotak left a comment

Choose a reason for hiding this comment

openshift-ci-robot commented Nov 28, 2024 • edited by openshift-ci bot Loading

Description

Dependencies

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jotak left a comment

Choose a reason for hiding this comment

codecov bot commented Nov 26, 2024 •

edited

Loading

openshift-ci-robot commented Nov 28, 2024 •

edited by openshift-ci bot

Loading