-
-
Notifications
You must be signed in to change notification settings - Fork 745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ICU-22898 MF2: fix various parser bugs and add more tests #3092
ICU-22898 MF2: fix various parser bugs and add more tests #3092
Conversation
Sorry, the sharing of the test files PR started when I was in vacation and somehow didn't get on my radar when I got back. But I just found that the change broke Eclipse, and that now the json test files are packaged in the release jar files of ICU. I general having Maven consume files from outside it's own folder tree is hacky and error prone. We currently have that problem with the LICENSE file (I can't import icu4j in Eclipse anymore) The other problem with sharing is that it forces the c/c++ and Java implementations to be 100% in sync at all times. But becomes a PITA when something changes and we need to update the code. This PR being an example. TLDR: I am tempted to keep two copies of the test data. These json files are probably in the same bucket: they live in CLDR, but must be tested against in ICU. But the idea is that the ant task can copy the files in two places, icu4c and icu4j. I don't know if that is a good idea or not. Here is a fragment of the ant script: <fileset id="cldrTestData" dir="${cldrDir}/common/testData">
<!-- Add directories here to control which test data is installed. -->
<include name="localeIdentifiers/**"/> <!-- ... -->
<include name="personNameTest/**"/> <!-- Used in ExhaustivePersonNameTest -->
<include name="units/**"/> <!-- Used in UnitsTest tests -->
</fileset>
<copy todir="${testDataDir4C}">
<fileset refid="cldrTestData"/>
</copy>
<copy todir="${testDataDir4J}">
<fileset refid="cldrTestData"/>
</copy> We copy the same test files in |
Try:
There are 3 sub-folders there: |
@@ -62,10 +62,22 @@ int readCodePoint() { | |||
return c; | |||
} | |||
|
|||
// Backup a number of characters. | |||
// Backup a number of code points. | |||
void backup(int amount) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method works on code units (Java char
), not code points.
That's because int readCodePoint()
also works on code units.
It returns a code point, but the current offset in the input (cursor
) is expresses in code units.
So when it finds a code point above BMP it advances with 2
Reasons:
- When one wants to parse something from an offset, it is faster to express the offset in the same kind of units we use for storage. We store using
char
, offset in code units. Otherwise we must iterate all the way to the offset in code points. Our input buffer useschar
- Easier to deal with not-properly paired surrogates. The spec does not try to enforce the surrogate correctness, and I don't think it is the job of the MF2 to do that. If I pass
".....\uDC00...{$user}!
there is no reason to not format to".....\uDC00...John!
Even if there is disagreement about the reasons, and we decide to express offsets in code points, this fix is not sufficient.
We would also have to change getPosition()
, skip(int amount)
, gotoPosition(int position)
Because we either work on code points, or on code units, we should not mix and match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation. I see that the issue isn't really with backup()
, but how it's used in conjunction with readCodePoint()
(reading a code point that happens to be a wide char, and then calling backup(1)
to "push it back", is a bug).
In eee547c I reverted the change to backup()
and fixed the three places I could find where readCodePoint()
and backup()
were being used in that way.
} | ||
} | ||
|
||
void backupOneCodePoint() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if it is an unpaired low surrogate?
There is nothing to prevent such a thing from showing up in a message.
True, it is incorrect.
But we are not trying to reject that, we accept it.
And this code would misbehave.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like we need a test for this! (I'll add one.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, thank you very much for taking a stab at updating the Java implementation to the latest spec.
I was planning to do that, but it looks like trying to share the test files precipitated things somewhat :-)
I added some comments.
But if there is a decisions to treat the mf2 test files the same way we treat other CLDR test files (units, people names, locale ids) and separate the C++ / Java tests, I can take over the Java part. And in a different PR.
Or you can continue it, as you already did a big chunk of it (all?)
Your call.
Thanks again,
M
if (nr != null) { | ||
return nr; | ||
} | ||
// "The resolution of a text or literal MUST resolve to a string." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this change breaks selection as it is currently implemented.
Because it means that "1.0" does not match "1.00"
Might be desired, to not match, I don't know.
But any change here should be accompanied by a change in the plural matcher.
One might say that this is desired.
But the format is locale sensitive.
For example formatting 1 dollar I might get "1.00 dollars", and does not match |1|.
But formatting 1 Japanese yen will result in "1 ...", and that matches |1|.
So the developer using |1|
in their message creates something that behaves differently on different locales.
I proposed to interpret |1|
as a string, and 1
as a number, and it is not resolved:
unicode-org/message-format-wg#712
And Eemeli argued for it recently:
unicode-org/message-format-wg#842
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the plural matcher has to be changed? This change is to address situations where a number literal isn't annotated, like:
{
"src": "Format {123456789.9876} number",
"locale": "en-IN",
"exp": "Format 123456789.9876 number",
"comment": "Number literals are not formatted as numbers by default"
}
This change doesn't change the behavior for .match
on something with a :number
annotation, or at least, there aren't any tests that show that.
icu4j/main/core/src/main/java/com/ibm/icu/message2/MFParser.java
Outdated
Show resolved
Hide resolved
@@ -712,7 +721,7 @@ private MFDataModel.Declaration getDeclaration() throws MFParseException { | |||
MFDataModel.Expression expression; | |||
switch (declName) { | |||
case "input": | |||
skipMandatoryWhitespaces(); | |||
skipOptionalWhitespaces(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rule is input-declaration = input [s] variable-expression
So the space is mandatory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so? If it was mandatory, it would say input-declaration = input s variable-expression
(without the square brackets.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather not delete these extra tests.
They cover functionality not covered by the tests in the spec.
And that can't be covered by the spec.
They test that the implementation works end-to-end.
Some might be moved to the standard (CLDR).
But some can't, because they test that custom functions work, or that icu:skeletons
work (namespaced),
or that various types work (Date
, Calendar
, java.time
, or even long with :dateformat
.
And that the type "magically selects" the proper function (For example "...{$exp}..." formats as a date if the exp
is a date-like type (Date
/ Calendar
/ java.time
)
This comment also (or mostly?) apply to the other deleted files (FunctionsTests
, IcuFunctionsTest
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be less confusing to look at #3063 first. This PR includes that PR, plus additional changes.
In the description for #3063 I explained that I didn't delete any tests (unless they were duplicates), but rather, moved all the JSON filenames being read into a single file, because now the schema is the same for all the tests, so the same code can be re-used to read all of the files.
@@ -29,16 +29,5 @@ public void testNullInput() throws Exception { | |||
MFParser.parse(null); | |||
} | |||
|
|||
@Test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is not much / anything left if we remove this.
So the same comment that I had for deleted files applies: do we really want to delete extra tests?
I think that the more tests we have the better.
Even some don't come from the official CLDR spec.
We can separate these tests in a different folder for cleanliness.
But not delete them, unless they are redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #3092 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about deleting tests.
Of course; is there a way to add automated testing that changes don't break IDEs?
I don't want to unilaterally say yes or no to this; is it something to discuss in the TC meeting, maybe? (Note: I'll be on vacation from early next week until September 2, so I won't be at the next few meetings, but I can go back and read minutes later.) |
Right, first it was precipitated by sharing test files; then I wrote a random test generator and decided it was only fair to run ICU4J on some of the generated tests as well as ICU4C, so that's where some of the other changes came from.
Certainly (like I said in another comment, I think that has to be discussed with the TC). I think I've addressed all your comments, but let me know if I missed anything! |
eee547c
to
2b5e263
Compare
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
I've rebased this PR, so it's now ready for review.
|
/cc @echeran |
2b5e263
to
bf9d4c5
Compare
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
bf9d4c5
to
8c51816
Compare
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you,
Mihai
ICU4C: Escape curly braces when serializing and normalizing ICU4C: Escape '|' in patterns ICU4C: When normalizing input, escape optionally-escaped characters in patterns ICU4C/ICU4J: Allow trailing whitespace after a match ICU4C: Fix parser to iterate over code points, not code units Add tests with old reserved syntax as syntax-error tests
8c51816
to
4b376f1
Compare
Hooray! The files in the branch are the same across the force-push. 😃 ~ Your Friendly Jira-GitHub PR Checker Bot |
Checklist