Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLDR-16982 Fix the likely test generation to be script-first #3176

Conversation

macchiati
Copy link
Member

@macchiati macchiati commented Aug 11, 2023

CLDR-16982

This arose as the result of ICU-20777 Merge likelysubtag implementation unicode-org/icu#2538.

The original impetus was to correct problems where und-Adlm-BF mapped to fr_Adlm_BF instead of ff_Adlm_BF. The problem is that if the data didn't have both the script and region, it would try the region first. It would get fr as the language — which is clearly wrong, because fr is never written with Adlm. Clearly the script is a much stronger signal.

So it looks like the best way to solve that is to check for script first (try und-Adlm before und_BF).

common/supplemental/likelySubtags.xml

  • Added extra information for und_..., which were needed because of the script change. Note: I realized would be relatively simple to check for superfluous data.
  • fixed some cases that were mapping to unknown language. That is not going to be helpful in practice.

common/testData/localeIdentifiers/likelySubtags.txt

  • Many fixes to the test file resulted

common/validity/script.xml

  • Zanb was wrongly marked as 'special' (caught by a test)

docs/ldml/tr35.md

  • Modified the spec to:
    • return if all fields set
    • test for script before region
    • clean up error returns
    • clear up typos and clarify
    • provide heading for Remove Likely Subtags, favoring script, plus better example.

tools/cldr-code/src/main/java/org/unicode/cldr/tool/GenerateLikelyTestData.java

  • add various checks
  • add example (qaa) of language without data in LikelySubtags.xml

tools/cldr-code/src/main/java/org/unicode/cldr/tool/LikelySubtags.java

  • adjusted to match spec changes, eg to check for script first
  • added code under flag to check und handling (normally off)

tools/cldr-code/src/main/java/org/unicode/cldr/util/LanguageTagCanonicalizer.java
tools/cldr-code/src/test/java/org/unicode/cldr/unittest/TestPersonNameFormatter.java

  • Small change to catch when maximize fails

tools/cldr-code/src/main/java/org/unicode/cldr/util/SupplementalDataInfo.java

  • added an origin field for the attribute (missed that last release)
  • added check for when there is no languageToPopulationMap for a given language

tools/cldr-code/src/test/java/org/unicode/cldr/unittest/LikelySubtagsTest.java

  • added testUndAllScriptsAndRegions to check for the expected relations between the und_ prefixed likely data.
  • added testToAttributeValidityStatus to check for unexpected values. One of these (Zanb) needed a data fix (above)
  • Other results I turned into warnings for now, because it would take more time for a cleanup. Note: There are other tickets for removing the ZZ and 001 lines.
    • Bad status=macroregion for [001, 143, 419, EZ]
    • Bad status=deprecated for [adp, ajt, blg, daf, drh, dud, ggn, izi, jar, ktr, kwq, kxe, kxl, kzh, kzj, kzt, ppa, rna, smd, snb, swc, tdu, tsf, uok]
    • Bad status=unknown for [ZZ]

tools/cldr-code/src/test/java/org/unicode/cldr/unittest/TestPersonNameFormatter.java

  • Fix Zanb (mistakenly grouped with Z... special scripts).

  • This PR completes the ticket.

ALLOW_MANY_COMMITS=true

@macchiati
Copy link
Member Author

This is a draft PR so that Frank can check out the change to the file in ICU

@macchiati macchiati marked this pull request as ready for review August 14, 2023 23:51
@@ -2235,19 +2231,21 @@ This allows for implementations that use those denormalized subtags to use the d
The reverse operation removes fields that would be added by the first operation.

1. First get max = AddLikelySubtags(inputLocale). If an error is signaled, return it.
Copy link
Contributor

@FrankYFTang FrankYFTang Aug 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also clarify what is the "it" in "return it" here? Is the "it" referring to "error" or referring to the inputLocale? For example, if the inputLocale is "qaa-CH-u-nu-latn", what should the algorithm return ?
A. "qaa-CH-u-nu-latn" (the inputLocale)
B. "und" (the return value of AddLikelySubtags(inputLocale))
C. an error (the other possible return value of AddLikelySubtags(inputLocale))

@sffc
Copy link
Member

sffc commented Aug 15, 2023

ICU4X sample impl: unicode-org/icu4x#3874

1. If there is no match, either return
1. an error value, or
2. the match for "und" (in APIs where a valid language tag is required).
1. If there is no match, signal an error and stop.
Copy link
Contributor

@FrankYFTang FrankYFTang Aug 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I find another bug in this algorithm. This algorithm does not return earlier if we already have LSR and do not need to perform lookup to maximize. Because of that , if there are no match following the lookup, we will still return error. For example,

qaa-Cyrl-CH ; FAIL ; ;

This is what the algorithm requested (because the spec currently does not have an early return for LSR)
but in ICU and ICU4X implementation, we currently just reutrn. But why should we returan error for this case? How about if we have qaa-Cyrl-CH-u-nu-thai? we should just return qaa-Cyrl-CH-u-nu-thai right?

3. Get the components of the max (_languagemax_, _scriptmax_, _regionmax_).
4. Then for _trial_ in {_languagemax_, _languagemax_regionmax_, _languagemax_scriptmax_}
* If AddLikelySubtags(_trial_) = max, then return _trial_ + variants.
5. If you do not get a match, return max + variants.
* If AddLikelySubtags(_trial_) = max, then return _trial_ + variants + extensions.
Copy link
Contributor

@FrankYFTang FrankYFTang Aug 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since any call to AddLikelySubtags(trial) may get an error to indicate no match inside the AddLikelySubtags(trial) (which is different than the "no match of " "== max" here), we need to define what the algorithm should do to handle that case first before handle the comparison against max (the == max part). Without specifying that may imply the error should be propagated up and terminate the lookup loop. We should explicitly mention the error from the calling of AddLikelySubtags(trial) should not be propogate up but ignored and consider no matching only and continue the next look up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, consider the case of
Remove Likely Subtags ("qaa-Cyrl-CH-u-nu-thai") here, what will happen?

@macchiati
Copy link
Member Author

macchiati commented Aug 16, 2023 via email

@macchiati
Copy link
Member Author

Fixed the problem that Frank found. So if the tests pass I think we are good to go.

@macchiati
Copy link
Member Author

Changes in 43e7e98

@macchiati
Copy link
Member Author

macchiati commented Aug 16, 2023 via email

@macchiati
Copy link
Member Author

I'd like to go ahead and merge this, then make any necessary changes in a follow-on PR.

@macchiati macchiati merged commit 7127ebe into unicode-org:main Aug 17, 2023
7 checks passed
@macchiati macchiati deleted the CLDR-16982-Fix-the-likely-test-generation-to-be-script-first branch August 17, 2023 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants