-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLDR-16982 Fix the likely test generation to be script-first #3176
CLDR-16982 Fix the likely test generation to be script-first #3176
Conversation
This is a draft PR so that Frank can check out the change to the file in ICU |
…e need to flesh out data to catch cases where the dominant language in a region has a script with a different default language.
docs/ldml/tr35.md
Outdated
@@ -2235,19 +2231,21 @@ This allows for implementations that use those denormalized subtags to use the d | |||
The reverse operation removes fields that would be added by the first operation. | |||
|
|||
1. First get max = AddLikelySubtags(inputLocale). If an error is signaled, return it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also clarify what is the "it" in "return it" here? Is the "it" referring to "error" or referring to the inputLocale? For example, if the inputLocale is "qaa-CH-u-nu-latn", what should the algorithm return ?
A. "qaa-CH-u-nu-latn" (the inputLocale)
B. "und" (the return value of AddLikelySubtags(inputLocale))
C. an error (the other possible return value of AddLikelySubtags(inputLocale))
ICU4X sample impl: unicode-org/icu4x#3874 |
1. If there is no match, either return | ||
1. an error value, or | ||
2. the match for "und" (in APIs where a valid language tag is required). | ||
1. If there is no match, signal an error and stop. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I find another bug in this algorithm. This algorithm does not return earlier if we already have LSR and do not need to perform lookup to maximize. Because of that , if there are no match following the lookup, we will still return error. For example,
qaa-Cyrl-CH ; FAIL ; ;
This is what the algorithm requested (because the spec currently does not have an early return for LSR)
but in ICU and ICU4X implementation, we currently just reutrn. But why should we returan error for this case? How about if we have qaa-Cyrl-CH-u-nu-thai? we should just return qaa-Cyrl-CH-u-nu-thai right?
3. Get the components of the max (_languagemax_, _scriptmax_, _regionmax_). | ||
4. Then for _trial_ in {_languagemax_, _languagemax_regionmax_, _languagemax_scriptmax_} | ||
* If AddLikelySubtags(_trial_) = max, then return _trial_ + variants. | ||
5. If you do not get a match, return max + variants. | ||
* If AddLikelySubtags(_trial_) = max, then return _trial_ + variants + extensions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since any call to AddLikelySubtags(trial) may get an error to indicate no match inside the AddLikelySubtags(trial) (which is different than the "no match of " "== max" here), we need to define what the algorithm should do to handle that case first before handle the comparison against max (the == max part). Without specifying that may imply the error should be propagated up and terminate the lookup loop. We should explicitly mention the error from the calling of AddLikelySubtags(trial) should not be propogate up but ignored and consider no matching only and continue the next look up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, consider the case of
Remove Likely Subtags ("qaa-Cyrl-CH-u-nu-thai") here, what will happen?
I agree that we should just skip out if all three fields are filled. I'd
considered how to do this, but as you point out, it is really necessary.
Will add.
…On Wed, Aug 16, 2023 at 10:55 AM Frank Yung-Fong Tang < ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In docs/ldml/tr35.md
<#3176 (comment)>:
> 3. Get the components of the max (_languagemax_, _scriptmax_, _regionmax_).
4. Then for _trial_ in {_languagemax_, _languagemax_regionmax_, _languagemax_scriptmax_}
- * If AddLikelySubtags(_trial_) = max, then return _trial_ + variants.
-5. If you do not get a match, return max + variants.
+ * If AddLikelySubtags(_trial_) = max, then return _trial_ + variants + extensions.
Since any call to AddLikelySubtags(*trial*) may get an error to indicate
no match, we need to define what the algorith should do in that case.
Without specifying that may imply the error should be propagated up and
terminate the lookup loop. We should explicitly mention the error from the
calling of AddLikelySubtags(*trial*) should not be propogate up but
ignored and consider no matching only and continue the next look up.
—
Reply to this email directly, view it on GitHub
<#3176 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMC7TM3GU33C752ALM3XVUCQ5ANCNFSM6AAAAAA3NJEQSY>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Fixed the problem that Frank found. So if the tests pass I think we are good to go. |
Changes in 43e7e98 |
Done, please review.
…On Wed, Aug 16, 2023 at 1:21 PM Mark Davis Ⓤ ***@***.***> wrote:
I agree that we should just skip out if all three fields are filled. I'd
considered how to do this, but as you point out, it is really necessary.
Will add.
On Wed, Aug 16, 2023 at 10:55 AM Frank Yung-Fong Tang <
***@***.***> wrote:
> ***@***.**** commented on this pull request.
> ------------------------------
>
> In docs/ldml/tr35.md
> <#3176 (comment)>:
>
> > 3. Get the components of the max (_languagemax_, _scriptmax_, _regionmax_).
> 4. Then for _trial_ in {_languagemax_, _languagemax_regionmax_, _languagemax_scriptmax_}
> - * If AddLikelySubtags(_trial_) = max, then return _trial_ + variants.
> -5. If you do not get a match, return max + variants.
> + * If AddLikelySubtags(_trial_) = max, then return _trial_ + variants + extensions.
>
> Since any call to AddLikelySubtags(*trial*) may get an error to indicate
> no match, we need to define what the algorith should do in that case.
> Without specifying that may imply the error should be propagated up and
> terminate the lookup loop. We should explicitly mention the error from the
> calling of AddLikelySubtags(*trial*) should not be propogate up but
> ignored and consider no matching only and continue the next look up.
>
> —
> Reply to this email directly, view it on GitHub
> <#3176 (review)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ACJLEMC7TM3GU33C752ALM3XVUCQ5ANCNFSM6AAAAAA3NJEQSY>
> .
> You are receiving this because you authored the thread.Message ID:
> ***@***.***>
>
|
I'd like to go ahead and merge this, then make any necessary changes in a follow-on PR. |
CLDR-16982
This arose as the result of ICU-20777 Merge likelysubtag implementation unicode-org/icu#2538.
The original impetus was to correct problems where und-Adlm-BF mapped to fr_Adlm_BF instead of ff_Adlm_BF. The problem is that if the data didn't have both the script and region, it would try the region first. It would get fr as the language — which is clearly wrong, because fr is never written with Adlm. Clearly the script is a much stronger signal.
So it looks like the best way to solve that is to check for script first (try und-Adlm before und_BF).
common/supplemental/likelySubtags.xml
common/testData/localeIdentifiers/likelySubtags.txt
common/validity/script.xml
docs/ldml/tr35.md
tools/cldr-code/src/main/java/org/unicode/cldr/tool/GenerateLikelyTestData.java
tools/cldr-code/src/main/java/org/unicode/cldr/tool/LikelySubtags.java
tools/cldr-code/src/main/java/org/unicode/cldr/util/LanguageTagCanonicalizer.java
tools/cldr-code/src/test/java/org/unicode/cldr/unittest/TestPersonNameFormatter.java
tools/cldr-code/src/main/java/org/unicode/cldr/util/SupplementalDataInfo.java
tools/cldr-code/src/test/java/org/unicode/cldr/unittest/LikelySubtagsTest.java
tools/cldr-code/src/test/java/org/unicode/cldr/unittest/TestPersonNameFormatter.java
Fix Zanb (mistakenly grouped with Z... special scripts).
This PR completes the ticket.
ALLOW_MANY_COMMITS=true