Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate making trie type small know about constant values for CJK Unified Ideographs and Hangul Syllables #5820

Open
hsivonen opened this issue Nov 14, 2024 · 1 comment
Labels
A-performance Area: Performance (CPU, Memory) C-collator Component: Collation, normalization

Comments

@hsivonen
Copy link
Member

Trie type small and fast differ in how they handle BMP code points above 0xFFF. The characters in this range that can be very frequent in text (as opposed to e.g. General Punctuation) and that affect a particularly large number of users are CJK characters.

Of these, it is common for all CJK Unified Ideographs to have a single trie value in common and for all Hangul Syllables to have (another) single trie value in common.

Thus, if trie type small special-cased these after branching on the 0xFFF bound but before running the general above-0xFFF code, chances are that most of the performance benefit of trie type fast could be had with trie type small (for Japanese this would obviously only speed up kanji but not kana).

Unfortunately, there isn't an exact one way of how this could go.

In the normalizer case, CJK Unified Ideographs get the default trie value of 0, but the range could be extended to also cover code points before and after the precise CJK Unified Ideographs range. Also, the caller code could statically provide the wanted trie value.

In the normalizer case, Hangul Syllables get the trie value of 1 and the caller could statically provide this value.

In the collator case, the trie is never queried for Hangul Syllables, so it's not worthwhile to branch on that range in the trie code. Also, CJK Unified Ideographs gets a constant trie value in the root collation in the implicithan case, but 1) the exact value cannot be hard-coded in the caller (as it is chosen by the data builder and could change across CLDR versions) and 2) there isn't a constant for the whole block in the typical CJK tailorings (which should probably be built with trie type fast even if the root has trie type small) or in the unihan-mode root.

@sffc
Copy link
Member

sffc commented Nov 14, 2024

Working group discussion:

  • @hsivonen The idea is that when we are looking up CJK ideographs or Hangul syllables, instead of having them in separate tries, we could combine them, and the lookup after the first branch to see if the code point is less than 0xFFF, and then the next branch could be to check whether it's in the Hangul range or not (which means it is otherwise a CJK ideograph), and that would potentially give the speed benefits of a fast trie, but for data that otherwise would need extra branching that a small trie requires.
  • @sffc This is is in the same category as normalize_str,in which after this new check/branch proposed, there's no more branching, correct?
  • @hsivonen No. We would need to include the combination of start, length, and value, both for range 1 and range 2. Maybe this would be an API hack. If we then wanted to take this to the Collator, then it would be an internal implementation detail.
  • @sffc OK so basically this is creating another special CodePointTrie API for Normalizer.
  • @hsivonen We could start with the hack for Normalizer first, yes.
  • @sffc It's not without cost. With the extra branching, there is extra code size and maintenance costs. It's not clear that the cost is worth it. You could just use trie type "fast".
  • @hsivonen If you look at the sizes for the NFK and UTS46 tries, there is a fairly substantial difference.

@sffc sffc added C-collator Component: Collation, normalization and removed discuss-priority Discuss at the next ICU4X meeting labels Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-performance Area: Performance (CPU, Memory) C-collator Component: Collation, normalization
Projects
None yet
Development

No branches or pull requests

2 participants