UCA 16 move numerics after digits; CLDR stop reordering by gc #762

markusicu · 2024-04-06T00:34:25Z

for PAG issue 99: "re-align the DUCET & CLDR: 20A8 RUPEE SIGN & FDFC RIAL SIGN"
- makes these sort in CLDR like letter sequences, as it has done in the DUCET
for PAG issue 101: "re-align the DUCET & CLDR: order of groups below letters"
- changes both DUCET & CLDR to sort non-digit numerics after digits
- as a result, both sort orders are nearly the same
- exceptions: ten Tibetan contractions, and CLDR tailorings of U+FFFE & U+FFFF

UTC-179: https://www.unicode.org/L2/L2024/24061.htm
PAG report -->
Section 7.4 re-align the DUCET & CLDR: order of groups below letters

[179-C38] Consensus: In the UCA DUCET, move the non-decimal-digit numerics to sort right after decimal digits. For Unicode Version 16.0. See document L2/24-064 item 7.4.
[179-A123] Action Item for Ken Whistler, PAG: In the UCA DUCET, move the non-decimal-digit numerics to sort right after decimal digits. For Unicode Version 16.0. See document L2/24-064 item 7.4.

Remaining differences between the sort orders:

~/unitools/mine/Generated/UCA/16.0.0$ diff -u Ducet/allkeys_DUCET.txt CollationAuxiliary/allkeys_CLDR.txt

--- Ducet/allkeys_DUCET.txt	2024-04-05 17:13:58.981995514 -0700
+++ CollationAuxiliary/allkeys_CLDR.txt	2024-04-05 17:13:50.253316966 -0700
@@ -1,11 +1,12 @@
-# allkeys_DUCET.txt
-# Date: 2024-04-06, 00:13:58 GMT
+# allkeys_CLDR.txt
+# Date: 2024-04-06, 00:13:49 GMT
 # © 2024 Unicode®, Inc.
 # Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
 # For terms of use, see https://www.unicode.org/terms_of_use.html
 # UCA Version: 16.0.0
 # UCD Version: 16.0.0
-# For a description of the format and usage, see CollationTest.html
+# For a description of the format and usage, see
+# http://www.unicode.org/reports/tr35/tr35-collation.html#Root_Data_Files
 
 @version 16.0.0
 
@@ -1620,6 +1621,7 @@
 20E7  ; [.0000.011B.0002] # COMBINING ANNUITY SYMBOL
 20E8  ; [.0000.011C.0002] # COMBINING TRIPLE UNDERDOT
 20E9  ; [.0000.011D.0002] # COMBINING WIDE BRIDGE ABOVE
+FFFE  ; [.0001.0020.0002] # <noncharacter-FFFE>
 0009  ; [*0201.0020.0002] # <CHARACTER TABULATION>
 000A  ; [*0202.0020.0002] # <LINE FEED (LF)>
 000B  ; [*0203.0020.0002] # <LINE TABULATION>
@@ -21467,9 +21469,19 @@
 0F6A  ; [.3793.0020.0004][.0000.011F.0004] # TIBETAN LETTER FIXED-FORM RA
 0FB2  ; [.3794.0020.0002] # TIBETAN SUBJOINED LETTER RA
 0FBC  ; [.3794.0020.0004][.0000.011F.0004] # TIBETAN SUBJOINED LETTER FIXED-FORM RA
+0FB2 0F71 ; [.3794.0020.0002][.37AA.0020.0002] # TIBETAN SUBJOINED LETTER RA, TIBETAN VOWEL SIGN AA
+0FB2 0F71 0F72 ; [.3794.0020.0002][.37AC.0020.0002] # TIBETAN SUBJOINED LETTER RA, TIBETAN VOWEL SIGN AA, TIBETAN VOWEL SIGN I
+0FB2 0F73 ; [.3794.0020.0002][.37AC.0020.0002] # TIBETAN SUBJOINED LETTER RA, TIBETAN VOWEL SIGN II
+0FB2 0F71 0F74 ; [.3794.0020.0002][.37B0.0020.0002] # TIBETAN SUBJOINED LETTER RA, TIBETAN VOWEL SIGN AA, TIBETAN VOWEL SIGN U
+0FB2 0F75 ; [.3794.0020.0002][.37B0.0020.0002] # TIBETAN SUBJOINED LETTER RA, TIBETAN VOWEL SIGN UU
 0F6C  ; [.3795.0020.0002] # TIBETAN LETTER RRA
 0F63  ; [.3796.0020.0002] # TIBETAN LETTER LA
 0FB3  ; [.3797.0020.0002] # TIBETAN SUBJOINED LETTER LA
+0FB3 0F71 ; [.3797.0020.0002][.37AA.0020.0002] # TIBETAN SUBJOINED LETTER LA, TIBETAN VOWEL SIGN AA
+0FB3 0F71 0F72 ; [.3797.0020.0002][.37AC.0020.0002] # TIBETAN SUBJOINED LETTER LA, TIBETAN VOWEL SIGN AA, TIBETAN VOWEL SIGN I
+0FB3 0F73 ; [.3797.0020.0002][.37AC.0020.0002] # TIBETAN SUBJOINED LETTER LA, TIBETAN VOWEL SIGN II
+0FB3 0F71 0F74 ; [.3797.0020.0002][.37B0.0020.0002] # TIBETAN SUBJOINED LETTER LA, TIBETAN VOWEL SIGN AA, TIBETAN VOWEL SIGN U
+0FB3 0F75 ; [.3797.0020.0002][.37B0.0020.0002] # TIBETAN SUBJOINED LETTER LA, TIBETAN VOWEL SIGN UU
 0F64  ; [.3798.0020.0002] # TIBETAN LETTER SHA
 0FB4  ; [.3799.0020.0002] # TIBETAN SUBJOINED LETTER SHA
 0F65  ; [.379A.0020.0002] # TIBETAN LETTER SSA
@@ -39417,6 +39429,5 @@
 2FA14 ; [.FB85.0020.0002][.A291.0000.0000] # CJK COMPATIBILITY IDEOGRAPH-2FA14
 2F88F ; [.FB85.0020.0002][.A392.0000.0000] # CJK COMPATIBILITY IDEOGRAPH-2F88F
 2FA1D ; [.FB85.0020.0002][.A600.0000.0000] # CJK COMPATIBILITY IDEOGRAPH-2FA1D
-FFFE  ; [.FBC1.0020.0002][.FFFE.0000.0000] # <noncharacter-FFFE>
-FFFF  ; [.FBC1.0020.0002][.FFFF.0000.0000] # <noncharacter-FFFF>
 FFFD  ; [.FFFD.0020.0002] # REPLACEMENT CHARACTER
+FFFF  ; [.FFFE.0020.0002] # <noncharacter-FFFF>

Ken-Whistler · 2024-04-06T01:00:50Z

Markus, you can pick up a small revision of unisift.c from kenfiles/uca160/ to fix the botched edit in the comment.

For PAG issue 101 "re-align the DUCET & CLDR: order of groups below letters" From Ken: UCA 16.0 delta 17 This implements the move of the range of non-decimal numerics, so they get primary weights *after* 0..9 and are no longer marked as variables. The change to unidata.txt is diffable, although it involves a large change: 570 lines of input for these numeric entries were moved down from before the extenders to between the entries for DIGIT NINE (and others numerically equivalent to 9) and LATIN SMALL LETTER A. And then there are a few comment lines of explanation added. The change to allkeys.txt is simply describable, but not really diffable unless you ignore the primary weight assignments. The numerics are no longer variables, but now have primary weights in the range 2187..237F, so sort after DIGIT NINE (with primary weight of 2186), but ahead of LATIN SMALL LETTER A. Primary weights from LATIN SMALL LETTER A onward were unaffected, but the move of the numerics shifted all the primary weights for extenders, currency signs, and digits. The size of the generated file is identical to the previous one, which is a good sign. The number of primary weights is also identical, as expected. The first non-variable is still U+02D0 MODIFIER LETTER TRIANGULAR COLON, as expected, but its primary weight is 212A, instead of 2323. I assume your code automatically adjusts to identify the weight of the first non-variable. I also regenerated decomps.txt. It isn't impacted by the numerics rearrangement, but it does pick up the additional synthetic decomposition added for the Tulu-Tigalari looped virama. You will also need to pick up a small change to the sifter source code in order to be able to replicate this output: sifter/unisift.c The change is very small -- I simply had to comment out two lines in the branch in the main sift dealing with numerics which set the identified characters to variables. The rest just all falls out automatically given the change in the input file.

markusicu · 2024-04-06T02:42:46Z

Markus, you can pick up a small revision of unisift.c from kenfiles/uca160/ to fix the botched edit in the comment.

Thanks -- I changed the first commit to replace the file there with your fixed version.

- for PAG issue 99: "re-align the DUCET & CLDR: 20A8 RUPEE SIGN & FDFC RIAL SIGN" - for PAG issue 101: "re-align the DUCET & CLDR: order of groups below letters"

markusicu · 2024-05-01T20:07:48Z

unicodetools/src/main/java/org/unicode/draft/ScriptCount.java

@@ -83,7 +83,7 @@ public int compareTo(SecondaryInfo arg0) {
    }

    static class SecondaryCounts {
-        private final UCA uca = UCA.buildCollator(null);
+        private final UCA uca = UCA.buildDucetCollator();


FYI: The null was for the Remap object which I removed. null meant DUCET, not CLDR.

The internal function now takes two primary weights, which could be -1 for the DUCET. Rather than make several call sites even less readable, I created a function that says "DUCET" and does not take parameters.

markusicu · 2024-05-01T20:09:24Z

unicodetools/src/main/java/org/unicode/text/UCA/UCA.java

+     * Initializes the collation from a stream of rules in the allkeys.txt format. If the source is
+     * null, uses the normal Unicode data files, which need to be in BASE_DIR.
+     */
+    public UCA(String sourceFile, String unicodeVersion) throws java.io.IOException {


FYI: Same for the UCA constructor. The implementation function now takes two primaries instead of the obsolete class Remap, but I added a convenience constructor for the DUCET, without additional parameters that would have to be "nulled".

markusicu · 2024-05-01T20:10:33Z

unicodetools/src/main/java/org/unicode/text/UCA/UCA.java

@@ -1799,61 +1809,4 @@ public static UCA buildCollator(Remap primaryRemap) {
    UCA_Statistics getStatistics() {
        return ucaData.statistics;
    }
-
-    public static final class Remap {


FYI: We no longer perform a permutation! 🎉

markusicu · 2024-05-01T20:11:21Z

unicodetools/src/main/java/org/unicode/text/UCA/UCA.java

-        private int variableHigh;
-        private int firstDucetNonVariable;


FYI: We still need to carry these. The code now just moves these two primaries around rather than the otherwise obsolete Remap object.

markusicu · 2024-05-01T20:13:17Z

unicodetools/src/main/java/org/unicode/text/UCA/UCA_Data.java


    public int variableLow = '\uFFFF';
    public int nonVariableLow = '\uFFFF'; // HACK '\u089A';
    public int variableHigh = '\u0000';
+    boolean hasExplicitVariableHigh = false;


FYI: This being true replaces a test for primaryRemap!=null. It's true for CLDR where the caller provides the variableHigh on the last punctuation character, as opposed to false for the DUCET, where the allkeys.txt parser figures it out from the data.

markusicu · 2024-05-01T20:15:10Z

unicodetools/src/main/java/org/unicode/text/UCA/WriteCollationData.java

-                        cldrCollator = buildCldrCollator(false);
-
-                        cldrCollator.overrideCE("\uFFFE", 0x1, 0x20, 2);
-                        cldrCollator.overrideCE("\uFFFF", 0xFFFE, 0x20, 2);


FYI: These overrides are both here and inside buildCldrCollator(boolean). The ones inside are used by passing in true. Seems cleaner inside. (See also issue #794)

Minor: better to use an a meaningful enum rather than a boolean, so that people can tell immediately what
cldrCollator = buildCldrCollator(true); means rather than guess ("does false mean 'don't build'??"). Better to have cldrCollator = buildCldrCollator(UCA.Style.addFFFx);

Not a blocker though!

markusicu · 2024-05-01T20:16:04Z

unicodetools/src/main/java/org/unicode/text/UCA/WriteCollationData.java

-
-        final int oldVariableHigh = CEList.getPrimary(ducet.getVariableHighCE());
+
+        final int ducetVariableHigh = CEList.getPrimary(ducet.getVariableHighCE());


FYI: Renamed from oldVariableHigh for clarity.

markusicu · 2024-05-01T20:16:52Z

unicodetools/src/main/java/org/unicode/text/UCA/WriteCollationData.java

-        final int oldVariableHigh = CEList.getPrimary(ducet.getVariableHighCE());
+
+        final int ducetVariableHigh = CEList.getPrimary(ducet.getVariableHighCE());
+        int cldrVariableHigh = 0;


FYI: Used to be in class Remap.

markusicu · 2024-05-01T20:17:44Z

unicodetools/src/main/java/org/unicode/text/UCA/WriteCollationData.java

-                case UCD_Types.FORMAT:
-                    if (ducetPrimary >= firstScriptPrimary) {
-                        break;
+                    if (ducetPrimary <= ducetVariableHigh && ducetPrimary > cldrVariableHigh) {


FYI: We no longer reorder, but we need to find the last punctuation character for CLDR's tailored (lower) variable high primary.

markusicu · 2024-05-01T20:18:15Z

unicodetools/src/main/java/org/unicode/text/UCA/WriteCollationData.java

-        primaryRemap
-                .addItems(spaces)
-                .addItems(punctuation)
-                .setVariableHigh()


FYI: This is where the old code found the CLDR variableHigh.

echeran

LGTM

macchiati

Looks great. Just one minor note.

macchiati · 2024-05-01T22:12:48Z

unicodetools/src/main/java/org/unicode/text/UCA/WriteCollationData.java

-                        cldrCollator = buildCldrCollator(false);
-
-                        cldrCollator.overrideCE("\uFFFE", 0x1, 0x20, 2);
-                        cldrCollator.overrideCE("\uFFFF", 0xFFFE, 0x20, 2);


Minor: better to use an a meaningful enum rather than a boolean, so that people can tell immediately what
cldrCollator = buildCldrCollator(true); means rather than guess ("does false mean 'don't build'??"). Better to have cldrCollator = buildCldrCollator(UCA.Style.addFFFx);

Not a blocker though!

markusicu · 2024-05-01T22:40:01Z

@macchiati re

better to use an a meaningful enum rather than a boolean

I agree, but the boolean was your idea :-)

I will merge this as is, and I already created an issue for whether we need this option at all -- hopefully not, I would like to remove it. --> issue #794

Ken-Whistler

Changes for sifter look correct. No comment on the complicated unicodetools UCA changes.

markusicu added the uca label Apr 6, 2024

eggrobin added the wait-for-UTC label Apr 6, 2024

markusicu force-pushed the uca16d17-numerics branch from 4cac19b to f13b7cf Compare April 6, 2024 02:41

eggrobin mentioned this pull request Apr 24, 2024

NamesList-16.0.0d17.txt #784

Merged

CLDR root collation: stop reordering primaries by gc

7df3b00

- for PAG issue 99: "re-align the DUCET & CLDR: 20A8 RUPEE SIGN & FDFC RIAL SIGN" - for PAG issue 101: "re-align the DUCET & CLDR: order of groups below letters"

markusicu mentioned this pull request May 1, 2024

try to move getCollator(type) from WriteCollationData to UCA #793

Open

markusicu force-pushed the uca16d17-numerics branch from f13b7cf to 7df3b00 Compare May 1, 2024 20:02

markusicu removed the wait-for-UTC label May 1, 2024

markusicu requested review from echeran, macchiati, nedley, pedberg-icu, josh-hadley and Ken-Whistler May 1, 2024 20:05

markusicu commented May 1, 2024

View reviewed changes

markusicu marked this pull request as ready for review May 1, 2024 20:19

echeran approved these changes May 1, 2024

View reviewed changes

macchiati approved these changes May 1, 2024

View reviewed changes

markusicu merged commit 5755926 into unicode-org:main May 1, 2024
27 checks passed

markusicu deleted the uca16d17-numerics branch May 1, 2024 22:45

Ken-Whistler reviewed May 1, 2024

View reviewed changes

markusicu mentioned this pull request Jun 5, 2024

CLDR-17226 UCA 16 beta jun05 unicode-org/cldr#3783

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UCA 16 move numerics after digits; CLDR stop reordering by gc #762

UCA 16 move numerics after digits; CLDR stop reordering by gc #762

markusicu commented Apr 6, 2024 •

edited

Loading

Ken-Whistler commented Apr 6, 2024

markusicu commented Apr 6, 2024

markusicu May 1, 2024

markusicu May 1, 2024

markusicu May 1, 2024

markusicu May 1, 2024

markusicu May 1, 2024

markusicu May 1, 2024

macchiati May 1, 2024

markusicu May 1, 2024

markusicu May 1, 2024

markusicu May 1, 2024

markusicu May 1, 2024

echeran left a comment

macchiati left a comment

macchiati May 1, 2024

markusicu commented May 1, 2024

Ken-Whistler left a comment


		final int oldVariableHigh = CEList.getPrimary(ducet.getVariableHighCE());

		final int ducetVariableHigh = CEList.getPrimary(ducet.getVariableHighCE());

UCA 16 move numerics after digits; CLDR stop reordering by gc #762

UCA 16 move numerics after digits; CLDR stop reordering by gc #762

Conversation

markusicu commented Apr 6, 2024 • edited Loading

Ken-Whistler commented Apr 6, 2024

markusicu commented Apr 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

echeran left a comment

Choose a reason for hiding this comment

macchiati left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markusicu commented May 1, 2024

Ken-Whistler left a comment

Choose a reason for hiding this comment

markusicu commented Apr 6, 2024 •

edited

Loading