Investigate making trie type small know about constant values for CJK Unified Ideographs and Hangul Syllables #5820

hsivonen · 2024-11-14T07:07:40Z

Trie type small and fast differ in how they handle BMP code points above 0xFFF. The characters in this range that can be very frequent in text (as opposed to e.g. General Punctuation) and that affect a particularly large number of users are CJK characters.

Of these, it is common for all CJK Unified Ideographs to have a single trie value in common and for all Hangul Syllables to have (another) single trie value in common.

Thus, if trie type small special-cased these after branching on the 0xFFF bound but before running the general above-0xFFF code, chances are that most of the performance benefit of trie type fast could be had with trie type small (for Japanese this would obviously only speed up kanji but not kana).

Unfortunately, there isn't an exact one way of how this could go.

In the normalizer case, CJK Unified Ideographs get the default trie value of 0, but the range could be extended to also cover code points before and after the precise CJK Unified Ideographs range. Also, the caller code could statically provide the wanted trie value.

In the normalizer case, Hangul Syllables get the trie value of 1 and the caller could statically provide this value.

In the collator case, the trie is never queried for Hangul Syllables, so it's not worthwhile to branch on that range in the trie code. Also, CJK Unified Ideographs gets a constant trie value in the root collation in the implicithan case, but 1) the exact value cannot be hard-coded in the caller (as it is chosen by the data builder and could change across CLDR versions) and 2) there isn't a constant for the whole block in the typical CJK tailorings (which should probably be built with trie type fast even if the root has trie type small) or in the unihan-mode root.

sffc · 2024-11-14T22:49:22Z

Working group discussion:

@hsivonen The idea is that when we are looking up CJK ideographs or Hangul syllables, instead of having them in separate tries, we could combine them, and the lookup after the first branch to see if the code point is less than 0xFFF, and then the next branch could be to check whether it's in the Hangul range or not (which means it is otherwise a CJK ideograph), and that would potentially give the speed benefits of a fast trie, but for data that otherwise would need extra branching that a small trie requires.
@sffc This is is in the same category as normalize_str,in which after this new check/branch proposed, there's no more branching, correct?
@hsivonen No. We would need to include the combination of start, length, and value, both for range 1 and range 2. Maybe this would be an API hack. If we then wanted to take this to the Collator, then it would be an internal implementation detail.
@sffc OK so basically this is creating another special CodePointTrie API for Normalizer.
@hsivonen We could start with the hack for Normalizer first, yes.
@sffc It's not without cost. With the extra branching, there is extra code size and maintenance costs. It's not clear that the cost is worth it. You could just use trie type "fast".
@hsivonen If you look at the sizes for the NFK and UTS46 tries, there is a fairly substantial difference.

hsivonen added the A-performance Area: Performance (CPU, Memory) label Nov 14, 2024

hsivonen mentioned this issue Nov 14, 2024

Trie-based normalization passthrough is slower than in ICU4C #2431

Open

hsivonen added the discuss-priority Discuss at the next ICU4X meeting label Nov 14, 2024

sffc mentioned this issue Nov 14, 2024

Small-only and Fast-only getters for CodePointTrie #5796

Open

sffc added C-collator Component: Collation, normalization and removed discuss-priority Discuss at the next ICU4X meeting labels Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate making trie type small know about constant values for CJK Unified Ideographs and Hangul Syllables #5820

Investigate making trie type small know about constant values for CJK Unified Ideographs and Hangul Syllables #5820

hsivonen commented Nov 14, 2024

sffc commented Nov 14, 2024

Investigate making trie type small know about constant values for CJK Unified Ideographs and Hangul Syllables #5820

Investigate making trie type small know about constant values for CJK Unified Ideographs and Hangul Syllables #5820

Comments

hsivonen commented Nov 14, 2024

sffc commented Nov 14, 2024