Hyper log log plus plus(HLL++) #2522

res-life · 2024-10-21T12:45:50Z

No description provided.

ttnghia · 2024-11-01T03:14:10Z

src/main/cpp/src/HLLPP.cu

+                                                         rmm::cuda_stream_view stream,
+                                                         rmm::device_async_resource_ref mr)
+{
+  CUDF_EXPECTS(precision >= 4 && precision <= 18, "HLL++ requires precision in range: [4, 18]");


We can use std::numeric_limits<>::digits instead of hardcoded values 4 and 18.

cuCo hardcoded 4, and Spark also hardcoded 4.

ttnghia · 2024-11-01T03:16:48Z

src/main/cpp/src/HLLPP.cu

+  auto input_cols = std::vector<int64_t const*>(input_iter, input_iter + input.num_children());
+  auto d_inputs   = cudf::detail::make_device_uvector_async(input_cols, stream, mr);
+  auto result     = cudf::make_numeric_column(
+    cudf::data_type{cudf::type_id::INT64}, input.size(), cudf::mask_state::ALL_VALID, stream);


Do we need such all-valid null mask? How about cudf::mask_state::UNALLOCATED?

ttnghia · 2024-11-01T03:17:16Z

src/main/cpp/src/HLLPP.cu

+  auto result     = cudf::make_numeric_column(
+    cudf::data_type{cudf::type_id::INT64}, input.size(), cudf::mask_state::ALL_VALID, stream);
+  // evaluate from struct<long, ..., long>
+  thrust::for_each_n(rmm::exec_policy(stream),


Try to use exec_policy_nosync as much as possible.

Suggested change

thrust::for_each_n(rmm::exec_policy(stream),

thrust::for_each_n(rmm::exec_policy_nosync(stream),

ttnghia · 2024-11-01T03:19:15Z

src/main/java/com/nvidia/spark/rapids/jni/HLLPP.java

+   * The input sketch values must be given in the format `LIST<INT8>`.
+   *
+   * @param input         The sketch column which constains `LIST<INT8> values.


INT8 or INT64?

In addition, in estimate_from_hll_sketches I see that the input is STRUCT<LONG, LONG, ....> instead of LIST<>. Why?

It's STRUCT<LONG, LONG, ....> consistent with Spark. The input is columnar data, e.g.: sketch 0 is composed of by all the data of the children at index 0.
Updated the function comments, refer to commit

res-life requested a review from ttnghia October 21, 2024 12:45

res-life force-pushed the hll branch 3 times, most recently from b6f5cf5 to 526a61f Compare October 31, 2024 11:34

res-life changed the title ~~[Do not review] Hyper log log plus plus(HLL++)~~ Hyper log log plus plus(HLL++) Oct 31, 2024

res-life force-pushed the hll branch from 526a61f to b7abf6e Compare October 31, 2024 12:47

ttnghia reviewed Nov 1, 2024

View reviewed changes

Chong Gao added 3 commits November 21, 2024 13:26

Add HLL++ evaluation function

03c0f5a

Update function comments

df8b223

Fix

2daca3f

res-life force-pushed the hll branch from 11e97a9 to 2daca3f Compare November 21, 2024 07:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hyper log log plus plus(HLL++) #2522

Hyper log log plus plus(HLL++) #2522

res-life commented Oct 21, 2024

ttnghia Nov 1, 2024

res-life Nov 4, 2024

ttnghia Nov 1, 2024

ttnghia Nov 1, 2024

ttnghia Nov 1, 2024

ttnghia Nov 1, 2024 •

edited

Loading

res-life Nov 4, 2024

	thrust::for_each_n(rmm::exec_policy(stream),
	thrust::for_each_n(rmm::exec_policy_nosync(stream),

Hyper log log plus plus(HLL++) #2522

Are you sure you want to change the base?

Hyper log log plus plus(HLL++) #2522

Conversation

res-life commented Oct 21, 2024

ttnghia Nov 1, 2024

Choose a reason for hiding this comment

res-life Nov 4, 2024

Choose a reason for hiding this comment

ttnghia Nov 1, 2024

Choose a reason for hiding this comment

ttnghia Nov 1, 2024

Choose a reason for hiding this comment

ttnghia Nov 1, 2024

Choose a reason for hiding this comment

ttnghia Nov 1, 2024 • edited Loading

Choose a reason for hiding this comment

res-life Nov 4, 2024

Choose a reason for hiding this comment

ttnghia Nov 1, 2024 •

edited

Loading