[FEEDBACK] Message Format Unquoted Literals #724

macchiati · 2024-03-13T04:19:44Z

Summary

Consider relaxing constraints on literals, after v45

Background

Right now, unquoted literals are fairly narrowly constrained by
message.abnf
; here are the relevant lines:

unquoted = name / number-literal

; number-literal matches JSON number
(https://www.rfc-editor.org/rfc/rfc8259#section-6)

number-literal = \["-"\] (%x30 / (%x31-39 \*DIGIT)) \["." 1\*DIGIT\]
\[%i"e" \["-" / "+"\] 1\*DIGIT\]

; name matches https://www.w3.org/TR/REC-xml-names/#NT-NCName

name = name-start \*name-char

name-start = ALPHA / "\_"

/ %xC0-D6 / %xD8-F6 / %xF8-2FF

/ %x370-37D / %x37F-1FFF / %x200C-200D

/ %x2070-218F / %x2C00-2FEF / %x3001-D7FF

/ %xF900-FDCF / %xFDF0-FFFC / %x10000-EFFFF

name-char = name-start / DIGIT / "-" / "."

/ %xB7 / %x300-36F / %x203F-2040

Reason for reconsidering

However, for functions outside of the standard registry, this forces
many natural literals to use quotes. Here is an example from a function
that would handle MF1’s choice format:

\[0,1) {{{\$count} is zero or fraction}}

The natural literals to use would be intervals, which use [,(,),]
characters for ranges (the choice format would require some recasting
because it depends on ordering of variants. It currently uses >.) So
that would require

\|\[0,1)\| {{{\$count} is zero or fraction}}

Many Unicode symbols are included by XML’s NT-NCName (about 6,000
currently), while many are excluded (about 2,600 currently). But these
are literals, not identifiers, which is what name is
intended for. By expanding beyond identifier usage, it allows functions
to avoid requiring quoting in many cases. It also allows us to dispense
with the special formulation for number-literal.

The literals for number, date, etc could be specified elsewhere, but
wouldn’t have to be in the ABNF.

That would allow for various registries to have more sophisticated
literal without requiring quoting, and without privileging the
structured literals that we know about now.

Requirements

So, what restrictions on characters for a broadened definition of
unquoted literals would be required by a revised ABNF?

No ‘}’, because it would make .local $x = {literal} fail.
No ‘|’, because an initial one would conflict with quoting, and it is best to just forbid it anywhere in an unquoted literal to prevent confusion.
No ‘{’. Not strictly required, but for clarity wherever used.
None of the big blocks of ‘strange’ code points that XML forbids: controls, surrogates, private-use, noncharacters.
1. These are all immutable (Unicode Character Encoding Stability).
2. This also disallows the noncharacters that XML didn’t know about yet, before the noncharacter property was made immutable.
No whitespace, since variant uses that for separators between keys.
1. This could be done by just disallowing the “s” production characters, but that could be very confusing. {a b} looks too much like two items (the space is an A0 NO-BREAK SPACE). So it should be broadened to the Unicode Whitespace characters.
2. Unicode Whitespace is not guaranteed immutable, but has not changed for over a decade. Anyway, we would derive the code points as of now, so everything would be stable into the future.
(Any others?)

Not coincidentally, 2-3 are the characters in the reserved-escape
production.

Detailed Proposal

This would result in the following change:

OLD

unquoted = name / number-literal

; number-literal matches JSON number
(https://www.rfc-editor.org/rfc/rfc8259#section-6)

number-literal = \["-"\] (%x30 / (%x31-39 \*DIGIT)) \["." 1\*DIGIT\]
\[%i"e" \["-" / "+"\]

// The characters include the following (though name-char and
number-literal additions are positional):

// name-start is \[\\: A-Z \_ a-z \x{C0}-\x{D6} \x{D8}-\x{F6}
\x{F8}-\x{2FF} \x{370}-\x{37D} \x{37F}-\x{1FFF} \x{200C}-\x{200D}
\x{2070}-\x{218F} \x{2C00}-\x{2FEF} \x{3001}-\x{D7FF} \x{F900}-\x{FDCF}
\x{FDF0}-\x{FFFD} \x{10000}-\x{EFFFF}\]

// name-char adds \[\\- . 0-9 \x{B7} \x{0300}-\x{036F}
\x{203F}-\x{2040}\]

// number-literal adds \[+ e\]

NEW

Unquoted = literal-char+

// Then down in ; Restrictions on characters in various contexts

literal-char = _all but following list; simpler to leave in this format
until after feedback._

Needed to avoid syntax conflicts

U+007B LEFT CURLY BRACKET
U+007C VERTICAL LINE
U+007D RIGHT CURLY BRACKET

Whitespace

U+0020 SPACE
U+00A0 NO-BREAK SPACE
U+1680 OGHAM SPACE MARK
U+2000 - U+200A EN QUAD .. HAIR SPACE
U+2028 LINE SEPARATOR
U+2029 PARAGRAPH SEPARATOR
U+202F NARROW NO-BREAK SPACE
U+205F MEDIUM MATHEMATICAL SPACE
U+3000 IDEOGRAPHIC SPACE

Controls

U+0000 - U+001F
U+007F - U+009F

Surrogates

U+D800 - U+DFFF

Private Use

U+E000 - U+F8FF U+F0000 - U+FFFFD U+100000 - U+10FFFD

Noncharacters

U+FDD0 - U+FFFE U+FFFF U+1FFFE U+1FFFF U+2FFFE U+2FFFF U+3FFFE U+3FFFF
U+4FFFE U+4FFFF U+5FFFE U+5FFFF U+6FFFE U+6FFFF U+7FFFE U+7FFFF U+8FFFE
U+8FFFF U+9FFFE U+9FFFF U+AFFFE U+AFFFF U+BFFFE U+BFFFF U+CFFFE U+CFFFF
U+DFFFE U+DFFFF U+EFFFE U+EFFFF U+FFFFF U+FFFFE U+FFFFF U+10FFFE
U+10FFFF

The text was updated successfully, but these errors were encountered:

aphillips · 2024-09-11T00:03:04Z

We should consider this in a severely timeboxed way. Bear in mind design, which is not directly "on the nose" to this request.

Note that unquoted literals appear in other places than in keys. We previously reserved a bunch of the ASCII punctuation (which is the main consideration here) for future use via reserved-statement. Removing that from the syntax does not mean that we should pilfer the box for more of these characters. Things that spoof sigils in appearance are probably a Bad Idea.

For example, one of the characters not listed above is :, which is the function introducer and namespace separator. It can't be in an unquoted. # and / probably need to be avoided because of markup. And @ because of attributes.

On the other hand, square brackets and parens seems potentially useful as do some of the other junk.

aphillips · 2024-11-09T00:50:29Z

This won't go in 46.1, so I'm going to change the labels. I am also adding resolve-candidate because I think we won't extend unquoted, but that's for the WG to decide.

macchiati · 2024-11-09T01:00:20Z

One comment; broadening can be done in future versions, since it would be backwards compatible.

aphillips · 2024-11-09T01:06:27Z

Broadening can be done, so long as it is done in a backwards-compatible way. It's a little tricky here, because of the uses of literals in the syntax. I haven't carefully reviewed the proposal recently enough to say one way or the other if there are sticky bits that I'd object to. I won't say that we would never do an extension (never is a long time), but I think it unlikely in the 2.0 timeframe (e.g. 46.1/47)

macchiati added the Preview-Feedback Feedback gathered during the technical preview label Mar 13, 2024

aphillips added the syntax Issues related with MF Syntax label Mar 18, 2024

aphillips added the LDML46 LDML46 Release (Tech Preview - October 2024) label Sep 10, 2024

aphillips added LDML46.1 MF2.0 Draft Candidate and removed LDML46 LDML46 Release (Tech Preview - October 2024) labels Sep 16, 2024

aphillips added resolve-candidate This issue appears to have been answered or resolved, and may be closed soon. Future Deferred for future standardization and removed LDML46.1 MF2.0 Draft Candidate labels Nov 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEEDBACK] Message Format Unquoted Literals #724

[FEEDBACK] Message Format Unquoted Literals #724

macchiati commented Mar 13, 2024

aphillips commented Sep 11, 2024

aphillips commented Nov 9, 2024

macchiati commented Nov 9, 2024

aphillips commented Nov 9, 2024

[FEEDBACK] Message Format Unquoted Literals #724

[FEEDBACK] Message Format Unquoted Literals #724

Comments

macchiati commented Mar 13, 2024

Summary

Background

Reason for reconsidering

Requirements

Detailed Proposal

OLD

NEW

aphillips commented Sep 11, 2024

aphillips commented Nov 9, 2024

macchiati commented Nov 9, 2024

aphillips commented Nov 9, 2024