-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLDR-15034 DTD @MATCH: add validity/bcp47 and semver match types #3109
Conversation
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
@@ -13,6 +13,9 @@ | |||
import com.ibm.icu.text.UnicodeSet.SpanCondition; | |||
import com.ibm.icu.util.ULocale; | |||
import com.ibm.icu.util.VersionInfo; | |||
import com.vdurmont.semver4j.Semver; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, any license issues for this code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, it's MIT licensed, and pulled in as a dependency.
Looks good pending any license issues with the "vdurmont" code which I do not remember being used before in CLDR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description for the PR needs much more information about what is being done and why.
The https://semver.org/ at the bottom suggests regular expression to check the validity of these version numbers. Wouldn't be better to use the suggested regex rather than introducing a new dependency that relies on throwing exceptions? |
Sorry. I will work more on these PR descriptions. |
MIT license
|
Done |
|
@macchiati has this addressed the request? |
@macchiati is the PR clear now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Can you explain what semver is doing?
- The test on bcp47 is not even well-formedness. We have code in CLDR to check for validity of bcp47 language tags, and should use that instead.
Semantic Versioning ( https://semver.org ) is a commonly used standard for version numbers. It is used for the version number of keyboards.
OK. I will look for that, thanks. |
One of the things I wanted to point out is that Semantic Version does not allow traditional 4-part version/build numbers, so things like 10.0.14393.1000 are invalid. |
right, although you could use 10.0.14393-BUILD1000 … it is a well known system though, which seems to make sense for files that will be interchanged. |
Interchange is not a convincing reason to me. You can interchange other syntaxes too. A better reason is we want an implementation to be able to pick from or sort a list of versions for the user in an order from newest to latest, and symver provides a defined order widely used in industry. Not that this wouldn't be true for the 4 part numbers, but I think anytime we are trying to restrict syntax on something, we should have a good a reason to create such complexity/annoyance and be able to demonstrate that the chosen solution is the least disruptive one to achieve the goal. This includes restricting NMTOKENs to ASCII range and other potential restrictions we currently have in the draft. In the case of versioning, the way I see it is that there are two reasonably established systems in different parts of the industry and whatever we pick will inconvenient one or the other. So provided we actually need a system that e.g. provides well-defined order, I don't mind which one is picked. |
Semver also defines not just ordering, but assertions, such as greater than or equal to, and so on.
I agree that these are both well-established, in fact, ICU and CLDR use the 4-part number as well (four bytes), but there are many variations there.
Most newer software components that I see are using semver, which has worked out quite well especially in package managers and such (i include maven which is almost-semver).
|
@macchiati I'm not finding such code unless it is LanguageTagParser? That seems like it accepts non-bcp47 ids as well ( |
assertAll( | ||
"is=true", | ||
not_semver.stream() | ||
.map(v -> () -> assertFalse(m.is(v), v + ": Should NOT be a semver"))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this works, but forEach should be used where you aren't using the results of the match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually the map is to an assertion (an Executable), and they are all rolled up into the assertAll
call. This is very handy, because if two out of five fail, it will say so rather than stopping at the first failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
org.opentest4j.MultipleFailuresError: is=true (2 failures)
org.opentest4j.AssertionFailedError: mt-Latn: Should be bcp47 ==> expected: <true> but was: <false>
org.opentest4j.AssertionFailedError: und-US-u-rg-ustx-tz-uschi: Should be bcp47 ==> expected: <true> but was: <false>
…
@@ -108,6 +114,37 @@ public static MatchValue of(String command) { | |||
} | |||
} | |||
|
|||
public static class BCP47LocaleMatchValue extends MatchValue { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should at least check for validity of the LSRV components using the calls as LocaleMatchValue; see
lang = new ValidityMatchValue(LstrType.language, statuses, false);
and following.
For the extensions, I checked, and it turns out that we would need to do a bit of work to pull some pieces together to do a full validity check. So that can be a follow-on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll file a ticket for the extension validity checking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@macchiati this checks for well-formed bcp47. If we check for valid bcp47, it would not be forwards-compatible with codes assigned that this version of CLDR doesn't know about.
Perhaps in the future we could add more parameters to switch validity on as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We update each version of CLDR, and the chances of someone wanting to submit a keyboard for a language between the time that ISO approves a language code and CLDR releases is very small. And a validity check will help prevent malformed locale IDs, a recurrent problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, but tooling from CLDR version, say, v44 could be used to validate a keyboard for a language code not assigned until 2025. So the DTD should not be defined in terms of "bcp47 codes valid at the time of CLDR release".
These keyboard files, unlike locale data, are used outside of the CLDR repository.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps what we should have is a separate check, that verifies that all keyboards in CLDR have valid codes. That would be separate from passing a DTD @MATCH
test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@macchiati I've added a subtask, https://unicode-org.atlassian.net/browse/CLDR-16950 to enforce valid locale IDs or keyboards which are (or are contemplating being!) included in CLDR's common data.
Can we merge this or perhaps rename to bcp47-wellformed or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to @MATCH validity/bcp47-wellformed
- use LanguageTagParser to check for well-formed bcp47
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
- this only checks wellformedness, not validity - CLDR-16950 is for the Keyboard tests that in-repo keyboards use only valid ids
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than that, lgtm
public static class BCP47LocaleMatchValue extends MatchValue { | ||
static final UnicodeSet basechars = new UnicodeSet("[A-Za-z0-9_]"); | ||
|
||
public BCP47LocaleMatchValue() {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rename this also, for clarity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@macchiati done
@macchiati (and all) PTAL. I've also verified that the matcher works within DTDs, evidence below. ( The actual keyboard DTD update will follow #3114)
given a temporary patch below: diff --git a/keyboards/dtd/ldmlKeyboard.dtd b/keyboards/dtd/ldmlKeyboard.dtd
index 847885d76f..47827a0c28 100644
--- a/keyboards/dtd/ldmlKeyboard.dtd
+++ b/keyboards/dtd/ldmlKeyboard.dtd
@@ -14,7 +14,7 @@ Please see CLDR-15034 for the latest information. -->
<!ELEMENT keyboard ( version, generation?, info?, names, settings?, import*, keyMap+, displayMap?, layer*, vkeys*, transforms*, reorders?, backspaces? ) >
<!ATTLIST keyboard locale CDATA #REQUIRED >
- <!--@MATCH:any/TODO-->
+ <!--@MATCH:validity/bcp47-wellformed-->
<!ELEMENT version EMPTY >
<!ATTLIST version platform CDATA #REQUIRED >
diff --git a/keyboards/osx/ar-t-k0-osx.xml b/keyboards/osx/ar-t-k0-osx.xml
index 5ace031716..7af2ede731 100644
--- a/keyboards/osx/ar-t-k0-osx.xml
+++ b/keyboards/osx/ar-t-k0-osx.xml
@@ -1,6 +1,6 @@
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE keyboard SYSTEM "../dtd/ldmlKeyboard.dtd">
-<keyboard locale="ar-t-k0-osx">
+<keyboard locale="aa-BB-CCC-DDDD-EEEEE-u-u">
<version platform="10.9" number="$Revision$"/>
<names>
<name value="Arabic"/>
@@ -363,4 +363,3 @@
<map iso="A03" to=" "/> <!-- space -->
</keyMap>
</keyboard> |
@macchiati thanks |
…er match types (unicode-org#3109)" This reverts commit fb4e2c9.
For the Keyboard spec, we needed to match two additional
@MATCH
types to the DTD.@MATCH:semver
matches a semantic version (semver.org) such as1.0.0
or1.2.3-BETA
- this is used for the keyboard version@MATCH:validity/bcp47
matches any valid bcp47 id, such asnod-Lana
orde-CH-t-k0-windows-extended-var
- this is used for the keyboard locale id(cherry picked from commit 632c286d8abd7c20a082e749cf67c3b0b264816a)
CLDR-15034
ALLOW_MANY_COMMITS=true