-
-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCA 16 move numerics after digits; CLDR stop reordering by gc #762
Conversation
Markus, you can pick up a small revision of unisift.c from kenfiles/uca160/ to fix the botched edit in the comment. |
For PAG issue 101 "re-align the DUCET & CLDR: order of groups below letters" From Ken: UCA 16.0 delta 17 This implements the move of the range of non-decimal numerics, so they get primary weights *after* 0..9 and are no longer marked as variables. The change to unidata.txt is diffable, although it involves a large change: 570 lines of input for these numeric entries were moved down from before the extenders to between the entries for DIGIT NINE (and others numerically equivalent to 9) and LATIN SMALL LETTER A. And then there are a few comment lines of explanation added. The change to allkeys.txt is simply describable, but not really diffable unless you ignore the primary weight assignments. The numerics are no longer variables, but now have primary weights in the range 2187..237F, so sort after DIGIT NINE (with primary weight of 2186), but ahead of LATIN SMALL LETTER A. Primary weights from LATIN SMALL LETTER A onward were unaffected, but the move of the numerics shifted all the primary weights for extenders, currency signs, and digits. The size of the generated file is identical to the previous one, which is a good sign. The number of primary weights is also identical, as expected. The first non-variable is still U+02D0 MODIFIER LETTER TRIANGULAR COLON, as expected, but its primary weight is 212A, instead of 2323. I assume your code automatically adjusts to identify the weight of the first non-variable. I also regenerated decomps.txt. It isn't impacted by the numerics rearrangement, but it does pick up the additional synthetic decomposition added for the Tulu-Tigalari looped virama. You will also need to pick up a small change to the sifter source code in order to be able to replicate this output: sifter/unisift.c The change is very small -- I simply had to comment out two lines in the branch in the main sift dealing with numerics which set the identified characters to variables. The rest just all falls out automatically given the change in the input file.
4cac19b
to
f13b7cf
Compare
Thanks -- I changed the first commit to replace the file there with your fixed version. |
- for PAG issue 99: "re-align the DUCET & CLDR: 20A8 RUPEE SIGN & FDFC RIAL SIGN" - for PAG issue 101: "re-align the DUCET & CLDR: order of groups below letters"
f13b7cf
to
7df3b00
Compare
@@ -83,7 +83,7 @@ public int compareTo(SecondaryInfo arg0) { | |||
} | |||
|
|||
static class SecondaryCounts { | |||
private final UCA uca = UCA.buildCollator(null); | |||
private final UCA uca = UCA.buildDucetCollator(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: The null was for the Remap object which I removed. null meant DUCET, not CLDR.
The internal function now takes two primary weights, which could be -1 for the DUCET. Rather than make several call sites even less readable, I created a function that says "DUCET" and does not take parameters.
* Initializes the collation from a stream of rules in the allkeys.txt format. If the source is | ||
* null, uses the normal Unicode data files, which need to be in BASE_DIR. | ||
*/ | ||
public UCA(String sourceFile, String unicodeVersion) throws java.io.IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: Same for the UCA constructor. The implementation function now takes two primaries instead of the obsolete class Remap, but I added a convenience constructor for the DUCET, without additional parameters that would have to be "nulled".
@@ -1799,61 +1809,4 @@ public static UCA buildCollator(Remap primaryRemap) { | |||
UCA_Statistics getStatistics() { | |||
return ucaData.statistics; | |||
} | |||
|
|||
public static final class Remap { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: We no longer perform a permutation! 🎉
private int variableHigh; | ||
private int firstDucetNonVariable; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: We still need to carry these. The code now just moves these two primaries around rather than the otherwise obsolete Remap object.
|
||
public int variableLow = '\uFFFF'; | ||
public int nonVariableLow = '\uFFFF'; // HACK '\u089A'; | ||
public int variableHigh = '\u0000'; | ||
boolean hasExplicitVariableHigh = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: This being true replaces a test for primaryRemap!=null. It's true for CLDR where the caller provides the variableHigh on the last punctuation character, as opposed to false for the DUCET, where the allkeys.txt parser figures it out from the data.
cldrCollator = buildCldrCollator(false); | ||
|
||
cldrCollator.overrideCE("\uFFFE", 0x1, 0x20, 2); | ||
cldrCollator.overrideCE("\uFFFF", 0xFFFE, 0x20, 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: These overrides are both here and inside buildCldrCollator(boolean). The ones inside are used by passing in true. Seems cleaner inside. (See also issue #794)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: better to use an a meaningful enum rather than a boolean, so that people can tell immediately what
cldrCollator = buildCldrCollator(true);
means rather than guess ("does false mean 'don't build'??"). Better to have cldrCollator = buildCldrCollator(UCA.Style.addFFFx);
Not a blocker though!
|
||
final int oldVariableHigh = CEList.getPrimary(ducet.getVariableHighCE()); | ||
|
||
final int ducetVariableHigh = CEList.getPrimary(ducet.getVariableHighCE()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: Renamed from oldVariableHigh for clarity.
final int oldVariableHigh = CEList.getPrimary(ducet.getVariableHighCE()); | ||
|
||
final int ducetVariableHigh = CEList.getPrimary(ducet.getVariableHighCE()); | ||
int cldrVariableHigh = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: Used to be in class Remap.
case UCD_Types.FORMAT: | ||
if (ducetPrimary >= firstScriptPrimary) { | ||
break; | ||
if (ducetPrimary <= ducetVariableHigh && ducetPrimary > cldrVariableHigh) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: We no longer reorder, but we need to find the last punctuation character for CLDR's tailored (lower) variable high primary.
primaryRemap | ||
.addItems(spaces) | ||
.addItems(punctuation) | ||
.setVariableHigh() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: This is where the old code found the CLDR variableHigh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Just one minor note.
cldrCollator = buildCldrCollator(false); | ||
|
||
cldrCollator.overrideCE("\uFFFE", 0x1, 0x20, 2); | ||
cldrCollator.overrideCE("\uFFFF", 0xFFFE, 0x20, 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: better to use an a meaningful enum rather than a boolean, so that people can tell immediately what
cldrCollator = buildCldrCollator(true);
means rather than guess ("does false mean 'don't build'??"). Better to have cldrCollator = buildCldrCollator(UCA.Style.addFFFx);
Not a blocker though!
@macchiati re
I agree, but the boolean was your idea :-) I will merge this as is, and I already created an issue for whether we need this option at all -- hopefully not, I would like to remove it. --> issue #794 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes for sifter look correct. No comment on the complicated unicodetools UCA changes.
UTC-179: https://www.unicode.org/L2/L2024/24061.htm
PAG report -->
Section 7.4 re-align the DUCET & CLDR: order of groups below letters
Remaining differences between the sort orders: