-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use invoke_one when possible #448
base: dev
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Shall I run some benchmarks comparing perf on H100?
Please do. I expect a maximum difference of 0.5% to 1%. |
🙅 Something's not right:
|
About 50% slower for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if the perf regression is worth investigating. If we can create a minimal repro we can ping our CG devs. I'm also ok with shelving the PR for now.
I cannot reproduce the performance regression with my local RTX8000:
Could this issue be H100-specific? |
That's expected on a <sm_90 arch. From sm_90 going forward the function will use a different code path leveraging the new |
Update: I ran the same exact benchmarks on another H100 node (CTK 12.3) today and wasn't able to reproduce the regression I initially reported. Will run another test on a different node and report back. |
@sleeepyjack Any updates on H100 perf results? |
auto const tile = cg::tiled_partition<CGSize>(block); | ||
auto const found = ref.find(tile, key); | ||
|
||
#if defined(CUCO_HAS_CG_INVOKE_ONE) | ||
cg::invoke_one(tile, [&]() { | ||
*(output_begin + idx) = found == ref.end() ? ref.empty_key_sentinel() : *found; | ||
}); | ||
#else | ||
if (tile.thread_rank() == 0) { | ||
*(output_begin + idx) = found == ref.end() ? ref.empty_key_sentinel() : *found; | ||
} | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For context: This is the section I've been focussing on in my investigation.
I tested another H100 HBM3 node with CTK 12.3 and the performance regression is still present although much less pronounced (around 5-6% less throughput). The instruction diff between the baseline and this version is straightforward: if (tile.thread_rank() == 0) {...} compiles to
which sets The
Compared to the baseline, the new version additionally computes the mask of participating threads ( I have some profiles which I will share via Slack since I had to use an internal toolchain. |
This PR updates new open addressing implementations to use
cg::invoke_one
when possible.It doesn't change legacy implementations like multimap or dynamic map, etc.