CLDR-15034 DTD @MATCH: add validity/bcp47 and semver match types #3109

srl295 · 2023-07-21T22:28:12Z

For the Keyboard spec, we needed to match two additional @MATCH types to the DTD.

@MATCH:semver matches a semantic version (semver.org) such as 1.0.0 or 1.2.3-BETA - this is used for the keyboard version
@MATCH:validity/bcp47 matches any valid bcp47 id, such as nod-Lana or de-CH-t-k0-windows-extended-var - this is used for the keyboard locale id
plus tests

(cherry picked from commit 632c286d8abd7c20a082e749cf67c3b0b264816a)

CLDR-15034

This PR completes the ticket.

ALLOW_MANY_COMMITS=true

jira-pull-request-webhook · 2023-07-21T22:43:04Z

Notice: the branch changed across the force-push!

tools/cldr-code/src/main/java/org/unicode/cldr/util/MatchValue.java is different
tools/cldr-code/src/test/java/org/unicode/cldr/util/TestMatchValue.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

pedberg-icu · 2023-07-22T04:31:52Z

tools/cldr-code/src/main/java/org/unicode/cldr/util/MatchValue.java

@@ -13,6 +13,9 @@
 import com.ibm.icu.text.UnicodeSet.SpanCondition;
 import com.ibm.icu.util.ULocale;
 import com.ibm.icu.util.VersionInfo;
+import com.vdurmont.semver4j.Semver;


Hmm, any license issues for this code?

no, it's MIT licensed, and pulled in as a dependency.

pedberg-icu · 2023-07-22T04:35:21Z

Looks good pending any license issues with the "vdurmont" code which I do not remember being used before in CLDR

macchiati

The description for the PR needs much more information about what is being done and why.

miloush · 2023-07-22T13:20:00Z

The https://semver.org/ at the bottom suggests regular expression to check the validity of these version numbers. Wouldn't be better to use the suggested regex rather than introducing a new dependency that relies on throwing exceptions?

srl295 · 2023-07-22T13:44:39Z

The description for the PR needs much more information about what is being done and why.

Sorry. I will work more on these PR descriptions.

srl295 · 2023-07-22T14:51:11Z

MIT license

srl295 · 2023-07-25T17:32:06Z

The description for the PR needs much more information about what is being done and why.

Done

srl295 · 2023-07-25T17:35:45Z

The https://semver.org/ at the bottom suggests regular expression to check the validity of these version numbers. Wouldn't be better to use the suggested regex rather than introducing a new dependency that relies on throwing exceptions?

The @MATCH check is only used during unit tests of the file.
We might want to parse the version, using the parser
We could use the regex instead.

srl295 · 2023-07-31T17:56:26Z

@macchiati has this addressed the request?

srl295 · 2023-08-01T19:08:25Z

@macchiati is the PR clear now?

macchiati

Can you explain what semver is doing?
The test on bcp47 is not even well-formedness. We have code in CLDR to check for validity of bcp47 language tags, and should use that instead.

srl295 · 2023-08-02T12:52:47Z

Can you explain what semver is doing?

Semantic Versioning ( https://semver.org ) is a commonly used standard for version numbers. It is used for the version number of keyboards.

The test on bcp47 is not even well-formedness. We have code in CLDR to check for validity of bcp47 language tags, and should use that instead.

OK. I will look for that, thanks.

miloush · 2023-08-02T12:58:52Z

One of the things I wanted to point out is that Semantic Version does not allow traditional 4-part version/build numbers, so things like 10.0.14393.1000 are invalid.

srl295 · 2023-08-02T13:03:06Z

One of the things I wanted to point out is that Semantic Version does not allow traditional 4-part version/build numbers, so things like 10.0.14393.1000 are invalid.

right, although you could use 10.0.14393-BUILD1000 … it is a well known system though, which seems to make sense for files that will be interchanged.

miloush · 2023-08-02T13:45:44Z

Interchange is not a convincing reason to me. You can interchange other syntaxes too.

A better reason is we want an implementation to be able to pick from or sort a list of versions for the user in an order from newest to latest, and symver provides a defined order widely used in industry.

Not that this wouldn't be true for the 4 part numbers, but I think anytime we are trying to restrict syntax on something, we should have a good a reason to create such complexity/annoyance and be able to demonstrate that the chosen solution is the least disruptive one to achieve the goal. This includes restricting NMTOKENs to ASCII range and other potential restrictions we currently have in the draft.

In the case of versioning, the way I see it is that there are two reasonably established systems in different parts of the industry and whatever we pick will inconvenient one or the other. So provided we actually need a system that e.g. provides well-defined order, I don't mind which one is picked.

srl295 · 2023-08-02T14:56:35Z

Semver also defines not just ordering, but assertions, such as greater than or equal to, and so on. I agree that these are both well-established, in fact, ICU and CLDR use the 4-part number as well (four bytes), but there are many variations there. Most newer software components that I see are using semver, which has worked out quite well especially in package managers and such (i include maven which is almost-semver).

srl295 · 2023-08-02T15:43:56Z

The test on bcp47 is not even well-formedness. We have code in CLDR to check for validity of bcp47 language tags, and should use that instead.

@macchiati I'm not finding such code unless it is LanguageTagParser? That seems like it accepts non-bcp47 ids as well (en_US_POSIX)

macchiati · 2023-08-02T18:48:03Z

tools/cldr-code/src/test/java/org/unicode/cldr/util/TestMatchValue.java

+        assertAll(
+                "is=true",
+                not_semver.stream()
+                        .map(v -> () -> assertFalse(m.is(v), v + ": Should NOT be a semver")));


this works, but forEach should be used where you aren't using the results of the match.

actually the map is to an assertion (an Executable), and they are all rolled up into the assertAll call. This is very handy, because if two out of five fail, it will say so rather than stopping at the first failure.

org.opentest4j.MultipleFailuresError: is=true (2 failures) org.opentest4j.AssertionFailedError: mt-Latn: Should be bcp47 ==> expected: <true> but was: <false> org.opentest4j.AssertionFailedError: und-US-u-rg-ustx-tz-uschi: Should be bcp47 ==> expected: <true> but was: <false> …

macchiati · 2023-08-02T18:51:56Z

tools/cldr-code/src/main/java/org/unicode/cldr/util/MatchValue.java

@@ -108,6 +114,37 @@ public static MatchValue of(String command) {
        }
    }

+    public static class BCP47LocaleMatchValue extends MatchValue {


You should at least check for validity of the LSRV components using the calls as LocaleMatchValue; see

lang = new ValidityMatchValue(LstrType.language, statuses, false);

and following.

For the extensions, I checked, and it turns out that we would need to do a bit of work to pull some pieces together to do a full validity check. So that can be a follow-on.

I'll file a ticket for the extension validity checking

@macchiati this checks for well-formed bcp47. If we check for valid bcp47, it would not be forwards-compatible with codes assigned that this version of CLDR doesn't know about.

Perhaps in the future we could add more parameters to switch validity on as well?

We update each version of CLDR, and the chances of someone wanting to submit a keyboard for a language between the time that ISO approves a language code and CLDR releases is very small. And a validity check will help prevent malformed locale IDs, a recurrent problem.

That's true, but tooling from CLDR version, say, v44 could be used to validate a keyboard for a language code not assigned until 2025. So the DTD should not be defined in terms of "bcp47 codes valid at the time of CLDR release".

These keyboard files, unlike locale data, are used outside of the CLDR repository.

Perhaps what we should have is a separate check, that verifies that all keyboards in CLDR have valid codes. That would be separate from passing a DTD @MATCH test

@macchiati I've added a subtask, https://unicode-org.atlassian.net/browse/CLDR-16950 to enforce valid locale IDs or keyboards which are (or are contemplating being!) included in CLDR's common data.

Can we merge this or perhaps rename to bcp47-wellformed or something?

That works.

updated to @MATCH validity/bcp47-wellformed

@match

- @match:semver matches a semantic version (semver.org) - @match:validity/bcp47 matches any valid bcp47 id - plus tests (cherry picked from commit 632c286d8abd7c20a082e749cf67c3b0b264816a)

- use LanguageTagParser to check for well-formed bcp47

jira-pull-request-webhook · 2023-08-02T19:50:19Z

Notice: the branch changed across the force-push!

tools/cldr-code/pom.xml is different
tools/pom.xml is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

- this only checks wellformedness, not validity - CLDR-16950 is for the Keyboard tests that in-repo keyboards use only valid ids

macchiati

Other than that, lgtm

macchiati · 2023-08-03T04:19:09Z

tools/cldr-code/src/main/java/org/unicode/cldr/util/MatchValue.java

+    public static class BCP47LocaleMatchValue extends MatchValue {
+        static final UnicodeSet basechars = new UnicodeSet("[A-Za-z0-9_]");
+
+        public BCP47LocaleMatchValue() {}


Let's rename this also, for clarity

@macchiati done

srl295 · 2023-08-03T14:11:54Z

@macchiati (and all) PTAL. I've also verified that the matcher works within DTDs, evidence below. ( The actual keyboard DTD update will follow #3114)

      Error: (TestPaths.java:608) osx/ar-t-k0-osx.xml/keyboard@locale, expected match to: ⟪validity/bcp47-wellformed⟫ actual: «aa-BB-CCC-DDDD-EEEEE-u-u»

given a temporary patch below:

diff --git a/keyboards/dtd/ldmlKeyboard.dtd b/keyboards/dtd/ldmlKeyboard.dtd
index 847885d76f..47827a0c28 100644
--- a/keyboards/dtd/ldmlKeyboard.dtd
+++ b/keyboards/dtd/ldmlKeyboard.dtd
@@ -14,7 +14,7 @@ Please see CLDR-15034 for the latest information. -->
 
 <!ELEMENT keyboard ( version, generation?, info?, names, settings?, import*, keyMap+, displayMap?, layer*, vkeys*, transforms*, reorders?, backspaces? ) >
 <!ATTLIST keyboard locale CDATA #REQUIRED >
-    <!--@MATCH:any/TODO-->
+    <!--@MATCH:validity/bcp47-wellformed-->
 
 <!ELEMENT version EMPTY >
 <!ATTLIST version platform CDATA #REQUIRED >
diff --git a/keyboards/osx/ar-t-k0-osx.xml b/keyboards/osx/ar-t-k0-osx.xml
index 5ace031716..7af2ede731 100644
--- a/keyboards/osx/ar-t-k0-osx.xml
+++ b/keyboards/osx/ar-t-k0-osx.xml
@@ -1,6 +1,6 @@
 <?xml version="1.0" encoding="UTF-8" ?>
 <!DOCTYPE keyboard SYSTEM "../dtd/ldmlKeyboard.dtd">
-<keyboard locale="ar-t-k0-osx">
+<keyboard locale="aa-BB-CCC-DDDD-EEEEE-u-u">
        <version platform="10.9" number="$Revision$"/>
        <names>
                <name value="Arabic"/>
@@ -363,4 +363,3 @@
                <map iso="A03" to=" "/> <!-- space -->
        </keyMap>
 </keyboard>

srl295 · 2023-08-03T14:12:22Z

@macchiati thanks

…er match types (unicode-org#3109)" This reverts commit fb4e2c9.

srl295 requested review from macchiati, miloush, pedberg-icu and DraganBesevic July 21, 2023 22:28

srl295 self-assigned this Jul 21, 2023

srl295 force-pushed the kbd/semver2 branch from 68c63be to 3b90c77 Compare July 21, 2023 22:43

pedberg-icu reviewed Jul 22, 2023

View reviewed changes

pedberg-icu previously approved these changes Jul 22, 2023

View reviewed changes

macchiati requested changes Jul 22, 2023

View reviewed changes

srl295 requested a review from macchiati July 25, 2023 17:31

srl295 mentioned this pull request Jul 25, 2023

CLDR-15034 kbd: Keyboard 3.0 DTD and Data Files #3114

Merged

1 task

srl295 added the keyboard label Jul 31, 2023

macchiati requested changes Aug 2, 2023

View reviewed changes

srl295 dismissed pedberg-icu’s stale review via 820a763 August 2, 2023 19:49

srl295 requested a review from macchiati August 2, 2023 19:49

srl295 added 2 commits August 2, 2023 14:49

CLDR-15034 DTD @match: add validity/bcp47 and semver match types

2725487

- @match:semver matches a semantic version (semver.org) - @match:validity/bcp47 matches any valid bcp47 id - plus tests (cherry picked from commit 632c286d8abd7c20a082e749cf67c3b0b264816a)

CLDR-15034 DTD @match: update validity/bcp47

78597b2

- use LanguageTagParser to check for well-formed bcp47

srl295 force-pushed the kbd/semver2 branch from 820a763 to 78597b2 Compare August 2, 2023 19:50

CLDR-15034 DTD @match: rename matcher to validity/bcp47-wellformed

cd6f80d

- this only checks wellformedness, not validity - CLDR-16950 is for the Keyboard tests that in-repo keyboards use only valid ids

macchiati requested changes Aug 3, 2023

View reviewed changes

CLDR-15034 DTD @match: rename matcher class as well

4328c15

srl295 requested review from macchiati and pedberg-icu August 3, 2023 14:01

macchiati approved these changes Aug 3, 2023

View reviewed changes

srl295 merged commit fb4e2c9 into unicode-org:main Aug 3, 2023
7 checks passed

srl295 deleted the kbd/semver2 branch August 3, 2023 16:15

srl295 added a commit to srl295/cldr that referenced this pull request Oct 24, 2023

CLDR-17188 Revert "CLDR-15034 DTD @match: add validity/bcp47 and semv…

8bc6a76

…er match types (unicode-org#3109)" This reverts commit fb4e2c9.

CLDR-15034 DTD @MATCH: add validity/bcp47 and semver match types #3109

CLDR-15034 DTD @MATCH: add validity/bcp47 and semver match types #3109

Conversation

srl295 commented Jul 21, 2023 • edited Loading

jira-pull-request-webhook bot commented Jul 21, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pedberg-icu commented Jul 22, 2023

macchiati left a comment

Choose a reason for hiding this comment

miloush commented Jul 22, 2023

srl295 commented Jul 22, 2023

srl295 commented Jul 22, 2023 via email

srl295 commented Jul 25, 2023

srl295 commented Jul 25, 2023

srl295 commented Jul 31, 2023

srl295 commented Aug 1, 2023

macchiati left a comment

Choose a reason for hiding this comment

srl295 commented Aug 2, 2023

miloush commented Aug 2, 2023

srl295 commented Aug 2, 2023

miloush commented Aug 2, 2023 • edited Loading

srl295 commented Aug 2, 2023 via email

srl295 commented Aug 2, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srl295 Aug 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jira-pull-request-webhook bot commented Aug 2, 2023

macchiati left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srl295 commented Aug 3, 2023

srl295 commented Aug 3, 2023

srl295 commented Jul 21, 2023 •

edited

Loading

miloush commented Aug 2, 2023 •

edited

Loading

srl295 Aug 2, 2023 •

edited

Loading