-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hyper log log plus plus(HLL++) #2522
base: branch-24.12
Are you sure you want to change the base?
Conversation
b6f5cf5
to
526a61f
Compare
src/main/cpp/src/HLLPP.cu
Outdated
rmm::cuda_stream_view stream, | ||
rmm::device_async_resource_ref mr) | ||
{ | ||
CUDF_EXPECTS(precision >= 4 && precision <= 18, "HLL++ requires precision in range: [4, 18]"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use std::numeric_limits<>::digits
instead of hardcoded values 4
and 18
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto input_cols = std::vector<int64_t const*>(input_iter, input_iter + input.num_children()); | ||
auto d_inputs = cudf::detail::make_device_uvector_async(input_cols, stream, mr); | ||
auto result = cudf::make_numeric_column( | ||
cudf::data_type{cudf::type_id::INT64}, input.size(), cudf::mask_state::ALL_VALID, stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need such all-valid null mask? How about cudf::mask_state::UNALLOCATED
?
auto result = cudf::make_numeric_column( | ||
cudf::data_type{cudf::type_id::INT64}, input.size(), cudf::mask_state::ALL_VALID, stream); | ||
// evaluate from struct<long, ..., long> | ||
thrust::for_each_n(rmm::exec_policy(stream), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try to use exec_policy_nosync
as much as possible.
thrust::for_each_n(rmm::exec_policy(stream), | |
thrust::for_each_n(rmm::exec_policy_nosync(stream), |
* The input sketch values must be given in the format `LIST<INT8>`. | ||
* | ||
* @param input The sketch column which constains `LIST<INT8> values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
INT8
or INT64
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, in estimate_from_hll_sketches
I see that the input is STRUCT<LONG, LONG, ....>
instead of LIST<>
. Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's STRUCT<LONG, LONG, ....> consistent with Spark. The input is columnar data, e.g.: sketch 0 is composed of by all the data of the children at index 0.
Updated the function comments, refer to commit
No description provided.