[improvement] Some performance related changes to evaluate #1426

mdonkers · 2024-10-17T08:59:20Z

Summary

Based on some CPU and memory analysis we ran on our production systems, we noticed a few things that could be improved.

I'd like to present these and discuss whether it makes sense to merge these into the project (otherwise we'll likely keep these on a local fork). Also some changes perhaps need a bit of work or could be done in another way. All feedback welcome!

Allow String column provider to take the name, so that it can use different settings

For String column types its possible to define a ColStrProvider, but this can only be defined once and globally. While ColStrProvider is great for pre-sizing buffers when creating batches, you can imagine not every column needs as much space. Some may be limited to a few characters, while others (e.g. log bodies) can be hundreds of characters.
Adding the name of the column as input to the ColStrProvider allows the internal logic to size buffers differently, or maybe include other logic.
This change does not affect users of the API.

Prevent hashing enums just for checking validity

We are creating quite big batches, and have quite a few enums in our records. In CPU profiles we noticed a lot of time was spent in map lookups, just for adding the enum to the batch. This is done to check whether the enum is valid, since they don't need to be a continuous range in ClickHouse.

The optimization takes a bit of overhead up-front, by checking if the Enum definition is a continuous range and capturing the lower and upper bound. Because if continuous, then for validating we only need to check if the number falls between lower and upper bound. This simple boolean logic is much faster overall than the map lookup.

Allow tuple type to prevent array creation overhead

This is probably the change with the least impact, though perhaps could be improved. Because normally when using Tuple type, the values need to be inserted as Array / Slice type. But depending on how data is internally presented before inserting, it might mean new slices need to be allocated.
So I'm wondering if a specific Tuple type makes sense, with specific types for the most common lengths (like Tuple2, Tuple3 and Tuple4).

Usage would look something like this:

type ValueWithTypeTuple struct {
	V string
	T int8
}

var EmptyValueTuple = ValueWithTypeTuple{V: "", T: ValueTypeEmpty}

// Get implements the column.Tuple2 interface from ClickHouse. It returns two values that can be inserted as Tuple.
func (avt ValueWithTypeTuple) Get() (any, any) {
	return &avt.V, int8(avt.T)
}

Checklist

Delete items not relevant to your PR:

Unit and integration tests covering the common scenarios were added
A human-readable description of the changes was provided to include in CHANGELOG
For significant changes, documentation in https://github.com/ClickHouse/clickhouse-docs was updated with further explanations or tutorials

- Allow tuple type to prevent array creation overhead - Prevent hashing enums just for checking validity - Allow String column provider to take the name, so that it can use different settings

mdonkers · 2024-10-17T18:44:15Z

@jkaflik by the way I don't expect the PR to be ready for merging in its current state.
But wanted to discuss on the approach etc before spending a lot of time. Everything works and compiles though.

jkaflik · 2024-11-04T08:44:46Z

@SpencerTorres could you take a look please?

Some performance related changes to evaluate

362c740

- Allow tuple type to prevent array creation overhead - Prevent hashing enums just for checking validity - Allow String column provider to take the name, so that it can use different settings

jkaflik self-requested a review October 17, 2024 10:38

jkaflik added enhancement performance labels Oct 17, 2024

jkaflik requested review from SpencerTorres and removed request for jkaflik November 4, 2024 08:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[improvement] Some performance related changes to evaluate #1426

[improvement] Some performance related changes to evaluate #1426

mdonkers commented Oct 17, 2024

mdonkers commented Oct 17, 2024

jkaflik commented Nov 4, 2024

[improvement] Some performance related changes to evaluate #1426

Are you sure you want to change the base?

[improvement] Some performance related changes to evaluate #1426

Conversation

mdonkers commented Oct 17, 2024

Summary

Allow String column provider to take the name, so that it can use different settings

Prevent hashing enums just for checking validity

Allow tuple type to prevent array creation overhead

Checklist

mdonkers commented Oct 17, 2024

jkaflik commented Nov 4, 2024