CLDR-16982 Fix the likely test generation to be script-first #3176

macchiati · 2023-08-11T18:52:09Z

This arose as the result of ICU-20777 Merge likelysubtag implementation unicode-org/icu#2538.

The original impetus was to correct problems where und-Adlm-BF mapped to fr_Adlm_BF instead of ff_Adlm_BF. The problem is that if the data didn't have both the script and region, it would try the region first. It would get fr as the language — which is clearly wrong, because fr is never written with Adlm. Clearly the script is a much stronger signal.

So it looks like the best way to solve that is to check for script first (try und-Adlm before und_BF).

common/supplemental/likelySubtags.xml

Added extra information for und_..., which were needed because of the script change. Note: I realized would be relatively simple to check for superfluous data.
fixed some cases that were mapping to unknown language. That is not going to be helpful in practice.

common/testData/localeIdentifiers/likelySubtags.txt

Many fixes to the test file resulted

common/validity/script.xml

Zanb was wrongly marked as 'special' (caught by a test)

docs/ldml/tr35.md

Modified the spec to:
- return if all fields set
- test for script before region
- clean up error returns
- clear up typos and clarify
- provide heading for Remove Likely Subtags, favoring script, plus better example.

tools/cldr-code/src/main/java/org/unicode/cldr/tool/GenerateLikelyTestData.java

add various checks
add example (qaa) of language without data in LikelySubtags.xml

tools/cldr-code/src/main/java/org/unicode/cldr/tool/LikelySubtags.java

adjusted to match spec changes, eg to check for script first
added code under flag to check und handling (normally off)

tools/cldr-code/src/main/java/org/unicode/cldr/util/LanguageTagCanonicalizer.java
tools/cldr-code/src/test/java/org/unicode/cldr/unittest/TestPersonNameFormatter.java

Small change to catch when maximize fails

tools/cldr-code/src/main/java/org/unicode/cldr/util/SupplementalDataInfo.java

added an origin field for the attribute (missed that last release)
added check for when there is no languageToPopulationMap for a given language

tools/cldr-code/src/test/java/org/unicode/cldr/unittest/LikelySubtagsTest.java

added testUndAllScriptsAndRegions to check for the expected relations between the und_ prefixed likely data.
added testToAttributeValidityStatus to check for unexpected values. One of these (Zanb) needed a data fix (above)
Other results I turned into warnings for now, because it would take more time for a cleanup. Note: There are other tickets for removing the ZZ and 001 lines.
- Bad status=macroregion for [001, 143, 419, EZ]
- Bad status=deprecated for [adp, ajt, blg, daf, drh, dud, ggn, izi, jar, ktr, kwq, kxe, kxl, kzh, kzj, kzt, ppa, rna, smd, snb, swc, tdu, tsf, uok]
- Bad status=unknown for [ZZ]

tools/cldr-code/src/test/java/org/unicode/cldr/unittest/TestPersonNameFormatter.java

Fix Zanb (mistakenly grouped with Z... special scripts).
This PR completes the ticket.

ALLOW_MANY_COMMITS=true

macchiati · 2023-08-11T18:53:10Z

This is a draft PR so that Frank can check out the change to the file in ICU

…e need to flesh out data to catch cases where the dominant language in a region has a script with a different default language.

FrankYFTang · 2023-08-15T21:24:09Z

docs/ldml/tr35.md

@@ -2235,19 +2231,21 @@ This allows for implementations that use those denormalized subtags to use the d
 The reverse operation removes fields that would be added by the first operation.

 1. First get max = AddLikelySubtags(inputLocale). If an error is signaled, return it.


Could we also clarify what is the "it" in "return it" here? Is the "it" referring to "error" or referring to the inputLocale? For example, if the inputLocale is "qaa-CH-u-nu-latn", what should the algorithm return ?
A. "qaa-CH-u-nu-latn" (the inputLocale)
B. "und" (the return value of AddLikelySubtags(inputLocale))
C. an error (the other possible return value of AddLikelySubtags(inputLocale))

sffc · 2023-08-15T23:44:27Z

ICU4X sample impl: unicode-org/icu4x#3874

FrankYFTang · 2023-08-16T17:51:21Z

docs/ldml/tr35.md

-   1. If there is no match, either return
-      1.  an error value, or
-      2.  the match for "und" (in APIs where a valid language tag is required).
+   1. If there is no match, signal an error and stop.


I think I find another bug in this algorithm. This algorithm does not return earlier if we already have LSR and do not need to perform lookup to maximize. Because of that , if there are no match following the lookup, we will still return error. For example,

qaa-Cyrl-CH ; FAIL ; ;

This is what the algorithm requested (because the spec currently does not have an early return for LSR)
but in ICU and ICU4X implementation, we currently just reutrn. But why should we returan error for this case? How about if we have qaa-Cyrl-CH-u-nu-thai? we should just return qaa-Cyrl-CH-u-nu-thai right?

FrankYFTang · 2023-08-16T17:55:15Z

docs/ldml/tr35.md

 3. Get the components of the max (_languagemax_, _scriptmax_, _regionmax_).
 4. Then for _trial_ in {_languagemax_, _languagemax_regionmax_, _languagemax_scriptmax_}
-   * If AddLikelySubtags(_trial_) = max, then return _trial_ + variants.
-5. If you do not get a match, return max + variants.
+   * If AddLikelySubtags(_trial_) = max, then return _trial_ + variants + extensions.


Since any call to AddLikelySubtags(trial) may get an error to indicate no match inside the AddLikelySubtags(trial) (which is different than the "no match of " "== max" here), we need to define what the algorithm should do to handle that case first before handle the comparison against max (the == max part). Without specifying that may imply the error should be propagated up and terminate the lookup loop. We should explicitly mention the error from the calling of AddLikelySubtags(trial) should not be propogate up but ignored and consider no matching only and continue the next look up.

For example, consider the case of
Remove Likely Subtags ("qaa-Cyrl-CH-u-nu-thai") here, what will happen?

macchiati · 2023-08-16T20:21:39Z

I agree that we should just skip out if all three fields are filled. I'd considered how to do this, but as you point out, it is really necessary. Will add.

…

On Wed, Aug 16, 2023 at 10:55 AM Frank Yung-Fong Tang < ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In docs/ldml/tr35.md <#3176 (comment)>: > 3. Get the components of the max (_languagemax_, _scriptmax_, _regionmax_). 4. Then for _trial_ in {_languagemax_, _languagemax_regionmax_, _languagemax_scriptmax_} - * If AddLikelySubtags(_trial_) = max, then return _trial_ + variants. -5. If you do not get a match, return max + variants. + * If AddLikelySubtags(_trial_) = max, then return _trial_ + variants + extensions. Since any call to AddLikelySubtags(*trial*) may get an error to indicate no match, we need to define what the algorith should do in that case. Without specifying that may imply the error should be propagated up and terminate the lookup loop. We should explicitly mention the error from the calling of AddLikelySubtags(*trial*) should not be propogate up but ignored and consider no matching only and continue the next look up. — Reply to this email directly, view it on GitHub <#3176 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMC7TM3GU33C752ALM3XVUCQ5ANCNFSM6AAAAAA3NJEQSY> . You are receiving this because you authored the thread.Message ID: ***@***.***>

macchiati · 2023-08-16T21:26:00Z

Fixed the problem that Frank found. So if the tests pass I think we are good to go.

macchiati · 2023-08-16T21:27:17Z

Changes in 43e7e98

macchiati · 2023-08-16T21:52:44Z

Done, please review.

…

On Wed, Aug 16, 2023 at 1:21 PM Mark Davis Ⓤ ***@***.***> wrote: I agree that we should just skip out if all three fields are filled. I'd considered how to do this, but as you point out, it is really necessary. Will add. On Wed, Aug 16, 2023 at 10:55 AM Frank Yung-Fong Tang < ***@***.***> wrote: > ***@***.**** commented on this pull request. > ------------------------------ > > In docs/ldml/tr35.md > <#3176 (comment)>: > > > 3. Get the components of the max (_languagemax_, _scriptmax_, _regionmax_). > 4. Then for _trial_ in {_languagemax_, _languagemax_regionmax_, _languagemax_scriptmax_} > - * If AddLikelySubtags(_trial_) = max, then return _trial_ + variants. > -5. If you do not get a match, return max + variants. > + * If AddLikelySubtags(_trial_) = max, then return _trial_ + variants + extensions. > > Since any call to AddLikelySubtags(*trial*) may get an error to indicate > no match, we need to define what the algorith should do in that case. > Without specifying that may imply the error should be propagated up and > terminate the lookup loop. We should explicitly mention the error from the > calling of AddLikelySubtags(*trial*) should not be propogate up but > ignored and consider no matching only and continue the next look up. > > — > Reply to this email directly, view it on GitHub > <#3176 (review)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ACJLEMC7TM3GU33C752ALM3XVUCQ5ANCNFSM6AAAAAA3NJEQSY> > . > You are receiving this because you authored the thread.Message ID: > ***@***.***> >

macchiati · 2023-08-16T22:43:34Z

I'd like to go ahead and merge this, then make any necessary changes in a follow-on PR.

CLDR-16982 Fix the likely test generation to be script-first

25a3723

macchiati requested review from FrankYFTang, sffc, pedberg-icu and markusicu August 11, 2023 18:52

macchiati mentioned this pull request Aug 11, 2023

ICU-20777 Merge likelysubtag implementation unicode-org/icu#2538

Merged

7 tasks

macchiati added 8 commits August 11, 2023 17:30

CLDR-16982 Test catches a case that needs additional data. That is, w…

8346748

…e need to flesh out data to catch cases where the dominant language in a region has a script with a different default language.

CLDR-16982 Fix typo in data

2f7f9d6

CLDR-16982 Fix typo.

f64a556

CLDR-16982 Update tests & generation

54ba208

CLDR-16982 test fixes

848fe96

CLDR-16982 Test validity

e96ebb2

CLDR-16982 minor updates

f75650f

CLDR-16982 Proposed spec fix

a9a7d97

macchiati marked this pull request as ready for review August 14, 2023 23:51

macchiati added 2 commits August 15, 2023 08:22

CLDR-16982 Add test for unsupported language, fix spec

9fea717

CLDR-16982 minor spec change

2ba08db

FrankYFTang reviewed Aug 15, 2023

View reviewed changes

sffc mentioned this pull request Aug 15, 2023

Update likely subtags algorithm to have correct error case behavior unicode-org/icu4x#3874

Merged

CLDR-16982 clarify the error handling

12bf11e

FrankYFTang reviewed Aug 16, 2023

View reviewed changes

CLDR-16982 Handled the case where all the fields are present

43e7e98

macchiati requested a review from FrankYFTang August 16, 2023 21:26

CLDR-16982 Fixed typos

5becd27

CLDR-16982 Fix formatting noticed when reviewing formatted text.

53cd515

FrankYFTang approved these changes Aug 17, 2023

View reviewed changes

macchiati merged commit 7127ebe into unicode-org:main Aug 17, 2023
7 checks passed

macchiati deleted the CLDR-16982-Fix-the-likely-test-generation-to-be-script-first branch August 17, 2023 13:59

molllyn1 mentioned this pull request Dec 20, 2023

[Snyk] Security upgrade jsdom from 16.5.2 to 16.6.0 molllyn1/cldr#95

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLDR-16982 Fix the likely test generation to be script-first #3176

CLDR-16982 Fix the likely test generation to be script-first #3176

macchiati commented Aug 11, 2023 •

edited

Loading

macchiati commented Aug 11, 2023

FrankYFTang Aug 15, 2023 •

edited

Loading

sffc commented Aug 15, 2023

FrankYFTang Aug 16, 2023 •

edited

Loading

FrankYFTang Aug 16, 2023 •

edited

Loading

FrankYFTang Aug 16, 2023

macchiati commented Aug 16, 2023 via email

macchiati commented Aug 16, 2023

macchiati commented Aug 16, 2023

macchiati commented Aug 16, 2023 via email

macchiati commented Aug 16, 2023

		@@ -2235,19 +2231,21 @@ This allows for implementations that use those denormalized subtags to use the d
		The reverse operation removes fields that would be added by the first operation.

		1. First get max = AddLikelySubtags(inputLocale). If an error is signaled, return it.

CLDR-16982 Fix the likely test generation to be script-first #3176

CLDR-16982 Fix the likely test generation to be script-first #3176

Conversation

macchiati commented Aug 11, 2023 • edited Loading

macchiati commented Aug 11, 2023

FrankYFTang Aug 15, 2023 • edited Loading

Choose a reason for hiding this comment

sffc commented Aug 15, 2023

FrankYFTang Aug 16, 2023 • edited Loading

Choose a reason for hiding this comment

FrankYFTang Aug 16, 2023 • edited Loading

Choose a reason for hiding this comment

FrankYFTang Aug 16, 2023

Choose a reason for hiding this comment

macchiati commented Aug 16, 2023 via email

macchiati commented Aug 16, 2023

macchiati commented Aug 16, 2023

macchiati commented Aug 16, 2023 via email

macchiati commented Aug 16, 2023

macchiati commented Aug 11, 2023 •

edited

Loading

FrankYFTang Aug 15, 2023 •

edited

Loading

FrankYFTang Aug 16, 2023 •

edited

Loading

FrankYFTang Aug 16, 2023 •

edited

Loading