Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLDR-15034 DTD @MATCH: add validity/bcp47 and semver match types #3109

Merged
merged 4 commits into from
Aug 3, 2023

Conversation

srl295
Copy link
Member

@srl295 srl295 commented Jul 21, 2023

For the Keyboard spec, we needed to match two additional @MATCH types to the DTD.

  • @MATCH:semver matches a semantic version (semver.org) such as 1.0.0 or 1.2.3-BETA - this is used for the keyboard version
  • @MATCH:validity/bcp47 matches any valid bcp47 id, such as nod-Lana or de-CH-t-k0-windows-extended-var - this is used for the keyboard locale id
  • plus tests

(cherry picked from commit 632c286d8abd7c20a082e749cf67c3b0b264816a)

CLDR-15034

  • This PR completes the ticket.

ALLOW_MANY_COMMITS=true

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • tools/cldr-code/src/main/java/org/unicode/cldr/util/MatchValue.java is different
  • tools/cldr-code/src/test/java/org/unicode/cldr/util/TestMatchValue.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@@ -13,6 +13,9 @@
import com.ibm.icu.text.UnicodeSet.SpanCondition;
import com.ibm.icu.util.ULocale;
import com.ibm.icu.util.VersionInfo;
import com.vdurmont.semver4j.Semver;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, any license issues for this code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, it's MIT licensed, and pulled in as a dependency.

pedberg-icu
pedberg-icu previously approved these changes Jul 22, 2023
@pedberg-icu
Copy link
Contributor

Looks good pending any license issues with the "vdurmont" code which I do not remember being used before in CLDR

Copy link
Member

@macchiati macchiati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description for the PR needs much more information about what is being done and why.

@miloush
Copy link
Contributor

miloush commented Jul 22, 2023

The https://semver.org/ at the bottom suggests regular expression to check the validity of these version numbers. Wouldn't be better to use the suggested regex rather than introducing a new dependency that relies on throwing exceptions?

@srl295
Copy link
Member Author

srl295 commented Jul 22, 2023

The description for the PR needs much more information about what is being done and why.

Sorry. I will work more on these PR descriptions.

@srl295
Copy link
Member Author

srl295 commented Jul 22, 2023 via email

@srl295 srl295 requested a review from macchiati July 25, 2023 17:31
@srl295
Copy link
Member Author

srl295 commented Jul 25, 2023

The description for the PR needs much more information about what is being done and why.

Done

@srl295
Copy link
Member Author

srl295 commented Jul 25, 2023

The https://semver.org/ at the bottom suggests regular expression to check the validity of these version numbers. Wouldn't be better to use the suggested regex rather than introducing a new dependency that relies on throwing exceptions?

  • The @MATCH check is only used during unit tests of the file.
  • We might want to parse the version, using the parser
  • We could use the regex instead.

@srl295
Copy link
Member Author

srl295 commented Jul 31, 2023

@macchiati has this addressed the request?

@srl295
Copy link
Member Author

srl295 commented Aug 1, 2023

@macchiati is the PR clear now?

Copy link
Member

@macchiati macchiati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Can you explain what semver is doing?
  2. The test on bcp47 is not even well-formedness. We have code in CLDR to check for validity of bcp47 language tags, and should use that instead.

@srl295
Copy link
Member Author

srl295 commented Aug 2, 2023

  • Can you explain what semver is doing?

Semantic Versioning ( https://semver.org ) is a commonly used standard for version numbers. It is used for the version number of keyboards.

  • The test on bcp47 is not even well-formedness. We have code in CLDR to check for validity of bcp47 language tags, and should use that instead.

OK. I will look for that, thanks.

@miloush
Copy link
Contributor

miloush commented Aug 2, 2023

One of the things I wanted to point out is that Semantic Version does not allow traditional 4-part version/build numbers, so things like 10.0.14393.1000 are invalid.

@srl295
Copy link
Member Author

srl295 commented Aug 2, 2023

One of the things I wanted to point out is that Semantic Version does not allow traditional 4-part version/build numbers, so things like 10.0.14393.1000 are invalid.

right, although you could use 10.0.14393-BUILD1000 … it is a well known system though, which seems to make sense for files that will be interchanged.

@miloush
Copy link
Contributor

miloush commented Aug 2, 2023

Interchange is not a convincing reason to me. You can interchange other syntaxes too.

A better reason is we want an implementation to be able to pick from or sort a list of versions for the user in an order from newest to latest, and symver provides a defined order widely used in industry.

Not that this wouldn't be true for the 4 part numbers, but I think anytime we are trying to restrict syntax on something, we should have a good a reason to create such complexity/annoyance and be able to demonstrate that the chosen solution is the least disruptive one to achieve the goal. This includes restricting NMTOKENs to ASCII range and other potential restrictions we currently have in the draft.

In the case of versioning, the way I see it is that there are two reasonably established systems in different parts of the industry and whatever we pick will inconvenient one or the other. So provided we actually need a system that e.g. provides well-defined order, I don't mind which one is picked.

@srl295
Copy link
Member Author

srl295 commented Aug 2, 2023 via email

@srl295
Copy link
Member Author

srl295 commented Aug 2, 2023

  • The test on bcp47 is not even well-formedness. We have code in CLDR to check for validity of bcp47 language tags, and should use that instead.

@macchiati I'm not finding such code unless it is LanguageTagParser? That seems like it accepts non-bcp47 ids as well (en_US_POSIX)

assertAll(
"is=true",
not_semver.stream()
.map(v -> () -> assertFalse(m.is(v), v + ": Should NOT be a semver")));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this works, but forEach should be used where you aren't using the results of the match.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually the map is to an assertion (an Executable), and they are all rolled up into the assertAll call. This is very handy, because if two out of five fail, it will say so rather than stopping at the first failure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

org.opentest4j.MultipleFailuresError: is=true (2 failures)
        org.opentest4j.AssertionFailedError: mt-Latn: Should be bcp47 ==> expected: <true> but was: <false>
        org.opentest4j.AssertionFailedError: und-US-u-rg-ustx-tz-uschi: Should be bcp47 ==> expected: <true> but was: <false>
     …

@@ -108,6 +114,37 @@ public static MatchValue of(String command) {
}
}

public static class BCP47LocaleMatchValue extends MatchValue {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should at least check for validity of the LSRV components using the calls as LocaleMatchValue; see

        lang = new ValidityMatchValue(LstrType.language, statuses, false);

and following.

For the extensions, I checked, and it turns out that we would need to do a bit of work to pull some pieces together to do a full validity check. So that can be a follow-on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll file a ticket for the extension validity checking

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@macchiati this checks for well-formed bcp47. If we check for valid bcp47, it would not be forwards-compatible with codes assigned that this version of CLDR doesn't know about.

Perhaps in the future we could add more parameters to switch validity on as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We update each version of CLDR, and the chances of someone wanting to submit a keyboard for a language between the time that ISO approves a language code and CLDR releases is very small. And a validity check will help prevent malformed locale IDs, a recurrent problem.

Copy link
Member Author

@srl295 srl295 Aug 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, but tooling from CLDR version, say, v44 could be used to validate a keyboard for a language code not assigned until 2025. So the DTD should not be defined in terms of "bcp47 codes valid at the time of CLDR release".

These keyboard files, unlike locale data, are used outside of the CLDR repository.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps what we should have is a separate check, that verifies that all keyboards in CLDR have valid codes. That would be separate from passing a DTD @MATCH test

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@macchiati I've added a subtask, https://unicode-org.atlassian.net/browse/CLDR-16950 to enforce valid locale IDs or keyboards which are (or are contemplating being!) included in CLDR's common data.

Can we merge this or perhaps rename to bcp47-wellformed or something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to @MATCH validity/bcp47-wellformed

@srl295 srl295 requested a review from macchiati August 2, 2023 19:49
- @match:semver matches a semantic version (semver.org)
- @match:validity/bcp47 matches any valid bcp47 id

- plus tests

(cherry picked from commit 632c286d8abd7c20a082e749cf67c3b0b264816a)
- use LanguageTagParser to check for well-formed bcp47
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • tools/cldr-code/pom.xml is different
  • tools/pom.xml is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

- this only checks wellformedness, not validity
- CLDR-16950 is for the Keyboard tests that in-repo keyboards use only valid ids
Copy link
Member

@macchiati macchiati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than that, lgtm

public static class BCP47LocaleMatchValue extends MatchValue {
static final UnicodeSet basechars = new UnicodeSet("[A-Za-z0-9_]");

public BCP47LocaleMatchValue() {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename this also, for clarity

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@macchiati done

@srl295
Copy link
Member Author

srl295 commented Aug 3, 2023

@macchiati (and all) PTAL. I've also verified that the matcher works within DTDs, evidence below. ( The actual keyboard DTD update will follow #3114)

      Error: (TestPaths.java:608) osx/ar-t-k0-osx.xml/keyboard@locale, expected match to: ⟪validity/bcp47-wellformed⟫ actual: «aa-BB-CCC-DDDD-EEEEE-u-u»

given a temporary patch below:

diff --git a/keyboards/dtd/ldmlKeyboard.dtd b/keyboards/dtd/ldmlKeyboard.dtd
index 847885d76f..47827a0c28 100644
--- a/keyboards/dtd/ldmlKeyboard.dtd
+++ b/keyboards/dtd/ldmlKeyboard.dtd
@@ -14,7 +14,7 @@ Please see CLDR-15034 for the latest information. -->
 
 <!ELEMENT keyboard ( version, generation?, info?, names, settings?, import*, keyMap+, displayMap?, layer*, vkeys*, transforms*, reorders?, backspaces? ) >
 <!ATTLIST keyboard locale CDATA #REQUIRED >
-    <!--@MATCH:any/TODO-->
+    <!--@MATCH:validity/bcp47-wellformed-->
 
 <!ELEMENT version EMPTY >
 <!ATTLIST version platform CDATA #REQUIRED >
diff --git a/keyboards/osx/ar-t-k0-osx.xml b/keyboards/osx/ar-t-k0-osx.xml
index 5ace031716..7af2ede731 100644
--- a/keyboards/osx/ar-t-k0-osx.xml
+++ b/keyboards/osx/ar-t-k0-osx.xml
@@ -1,6 +1,6 @@
 <?xml version="1.0" encoding="UTF-8" ?>
 <!DOCTYPE keyboard SYSTEM "../dtd/ldmlKeyboard.dtd">
-<keyboard locale="ar-t-k0-osx">
+<keyboard locale="aa-BB-CCC-DDDD-EEEEE-u-u">
        <version platform="10.9" number="$Revision$"/>
        <names>
                <name value="Arabic"/>
@@ -363,4 +363,3 @@
                <map iso="A03" to=" "/> <!-- space -->
        </keyMap>
 </keyboard>

@srl295
Copy link
Member Author

srl295 commented Aug 3, 2023

@macchiati thanks

@srl295 srl295 merged commit fb4e2c9 into unicode-org:main Aug 3, 2023
7 checks passed
@srl295 srl295 deleted the kbd/semver2 branch August 3, 2023 16:15
srl295 added a commit to srl295/cldr that referenced this pull request Oct 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants