-
Notifications
You must be signed in to change notification settings - Fork 378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: fix for cluster aggregation performance #631
Open
ssg2526
wants to merge
4
commits into
siimon:master
Choose a base branch
from
ssg2526:fix/cluster-performance
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really hesitant to involve files, since that suddenly involves the page cache, file system, more syscalls and possibly other surprises like filesystem permissions, running out of disk space and cleaning up. How much of an improvement did this yield? Did you benchmark before or after Node.js v18.6.0, which has this optimization? How much disk I/O load did your system have when benchmarking? I think IPC is "supposed" to be faster than involving the filesystem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benchmarks that I have performed are on our own application. The screenshots of the results are shared in the
this Issue Link. The benchmarks were done on a 16 core machine and approximately 250-300 metrics (total size ~500 Kbs). At P99 and above levels we are getting ~10x improvement in app performance. The only disk I/Os we have is logging but that is also not happening on /tmp mounted drive on which we are writing our metrics.
The major choking point was with IPC and creating the map and hashing the object. The detailed bifurcation that I have done on my local machine which is 8 core machine is given In the screenshot below. and the code used is in below given zip file. The zip file contains the node modules as well because I have added some logs in prom-client to get the time data of each parts. Here in screenshot if we see the worst IPC time is always higher. I had run multiple iterations and same/similar results are found. The total aggregation time which is 72ms in this screenshot out of which 68ms is taken for hashing and building the map. I haven't tested with node v18.6.0 yet. will test the same code with that as well and share the results in the same thread
cluster_test.zip
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In these benchmarks (issue link) the scraping interval used was every 5 seconds and the throughput on the app was about 1100-1200 RPS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @zbjornson
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @zbjornson @SimenB we went live in production with these changes about 3 months ago and everything is working fine for us. We are not seeing any additional latencies post this change. We serve more than 100,000 RPS at peak for this service. If required I can share the production results as well.
I understand that introducing file system can pose different challenges for the library, But we can explore communication with HTTP or some other mode of communication (Not IPC) when we have larger size of the metrics as the current solution will cause very high tail latencies (>P99) at high throughput and larger metrics sizes.
Requesting your inputs on this, In terms of how we can take this forward as we don't want to diverge from the library maintaining our own custom version.