Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NETOBSERV-1994: remove unneeded bpf map update calls #466

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

msherif1234
Copy link
Contributor

Description

bpf code was doing unnecessary calls to bpf_map_update_elem while the map is already been updated in place. which also added unneeded cpu load

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
    • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
    • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
    • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
    • Standard QE validation, with pre-merge tests unless stated otherwise.
    • Regression tests only (e.g. refactoring with no user-facing change).
    • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

@msherif1234
Copy link
Contributor Author

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Nov 25, 2024
Copy link

New image:
quay.io/netobserv/netobserv-ebpf-agent:9868e0f

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=9868e0f make set-agent-image

bpf/flows.c Outdated
@@ -152,7 +139,8 @@ static inline int flow_monitor(struct __sk_buff *skb, u8 direction) {
if (trace_messages) {
bpf_printk("error adding flow %d\n", ret);
}

// Update global counter for hashmap update errors
increase_counter(HASHMAP_FLOWS_DROPPED);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shouldn't report drops here, since the flows are still sent via RB, that would be misleading.
if the intent is to track hashmap update errors, we can already know that by looking at the RB usage metric that already exists (ie. if RB is used it's because there was a failed update)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok yeah wanted to know when hmap not used and we do rb but I agree its duplicate I will remove it

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Nov 26, 2024
Copy link
Member

@jotak jotak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jotak
Copy link
Member

jotak commented Nov 26, 2024

(I didn't find noticeable improvement in CPU though ...)

Copy link

codecov bot commented Nov 26, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 29.57%. Comparing base (294ae3f) to head (ab8e306).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #466   +/-   ##
=======================================
  Coverage   29.56%   29.57%           
=======================================
  Files          50       50           
  Lines        4867     4866    -1     
=======================================
  Hits         1439     1439           
+ Misses       3322     3321    -1     
  Partials      106      106           
Flag Coverage Δ
unittests 29.57% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
pkg/ebpf/bpf_x86_bpfel.go 0.00% <ø> (ø)
pkg/tracer/tracer.go 0.00% <ø> (ø)

Copy link

openshift-ci bot commented Nov 26, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from jotak. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

@tohojo tohojo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nits, but also a more general comment: you're not handling failure of bpf_map_update(...,BPF_NOEXIST), which means that if two threads end up doing this concurrently, one of the creation attempts will be lost.

This is no worse than before the patch, though (before the patch, one attempt would silently overwrite the other, now one will just fail). But, well, it's possible to do better, so I would suggest handling the errors where possible :)

bpf/dns_tracker.h Outdated Show resolved Hide resolved
bpf/flows.c Outdated Show resolved Hide resolved
@msherif1234
Copy link
Contributor Author

A few nits, but also a more general comment: you're not handling failure of bpf_map_update(...,BPF_NOEXIST), which means that if two threads end up doing this concurrently, one of the creation attempts will be lost.

This is no worse than before the patch, though (before the patch, one attempt would silently overwrite the other, now one will just fail). But, well, it's possible to do better, so I would suggest handling the errors where possible :)

are u referring to DNS update or this is more of general comment and u need better handling for map update errors ? can u expand more if that is the case ?

@msherif1234
Copy link
Contributor Author

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Nov 26, 2024
Copy link

New image:
quay.io/netobserv/netobserv-ebpf-agent:2920e5b

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=2920e5b make set-agent-image

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Nov 26, 2024
@msherif1234
Copy link
Contributor Author

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Nov 26, 2024
Copy link

New image:
quay.io/netobserv/netobserv-ebpf-agent:4a20325

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=4a20325 make set-agent-image

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Nov 26, 2024
@msherif1234
Copy link
Contributor Author

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Nov 26, 2024
(flow_metrics *)bpf_map_lookup_elem(&aggregated_flows, &id);
if (aggregate_flow != NULL) {
update_existing_flow(aggregate_flow, &pkt, dns_errno, len);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could keep the HASHMAP_FLOWS_DROPPED counter and update it in an "else" branch here. I would expect this to basically never happen, but just to be on the safe side, and since you already have that counter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that is why I felt its less useful but I can reuse it when eexist fail to lkup

// In this case, we send the single-packet flow via ringbuffer as in the worst case we can have
// a repeated INTERSECTION of flows (different flows aggregating different packets),
// which can be re-aggregated at userpace.
// other possible values https://chromium.googlesource.com/chromiumos/docs/+/master/constants/errnos.md
if (trace_messages) {
bpf_printk("error adding flow %d\n", ret);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not personally consider EEXIST an error that needs logging (here, or in other places you have trace logging). Only if the subsequent lookup then fails to return an entry (see comment below).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the thing is map update can fail for many reasons ebusy or e2big so i need to capture those as well , now if eexist error will be happening alot I can filter out the trace for exist error and with counter back we will know when we drop flows

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so that's basically what I meant: Use the trace message only for errors other than EEXIST (just move the log statement a bit further down where you're handling those anyway)

Copy link

New image:
quay.io/netobserv/netobserv-ebpf-agent:6042039

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=6042039 make set-agent-image

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Nov 26, 2024
@jotak jotak added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Nov 27, 2024
Copy link

New image:
quay.io/netobserv/netobserv-ebpf-agent:b156a29

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=b156a29 make set-agent-image

Copy link
Member

@jotak jotak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
I tested the first version which worked well, will need to retest when you tell me it's stable

@openshift-ci openshift-ci bot added the lgtm label Nov 28, 2024
@jotak jotak changed the title remove unneeded bpf map update calls NETOBSERV-1994: remove unneeded bpf map update calls Nov 28, 2024
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Nov 28, 2024

@msherif1234: This pull request references NETOBSERV-1994 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

Description

bpf code was doing unnecessary calls to bpf_map_update_elem while the map is already been updated in place. which also added unneeded cpu load

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

// Update global counter for hashmap update errors
increase_counter(HASHMAP_FLOWS_DROPPED);
}
update_existing_flow(aggregate_flow, &pkt, dns_errno, len);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

must swap len and dns_errno

Suggested change
update_existing_flow(aggregate_flow, &pkt, dns_errno, len);
update_existing_flow(aggregate_flow, &pkt, len, dns_errno);

flow_metrics *aggregate_flow =
(flow_metrics *)bpf_map_lookup_elem(&aggregated_flows, &id);
if (aggregate_flow != NULL) {
update_existing_flow(aggregate_flow, &pkt, dns_errno, len);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here as well

Suggested change
update_existing_flow(aggregate_flow, &pkt, dns_errno, len);
update_existing_flow(aggregate_flow, &pkt, len, dns_errno);

Copy link
Member

@jotak jotak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While testing it was showing unexpected low Bps on my workload test ... that's because of wrong argument ordering

@openshift-ci openshift-ci bot removed the lgtm label Nov 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/valid-reference ok-to-test To set manually when a PR is safe to test. Triggers image build on PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants