Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Well-formed vs valid #935

Open
macchiati opened this issue Nov 14, 2024 · 10 comments
Open

Well-formed vs valid #935

macchiati opened this issue Nov 14, 2024 · 10 comments
Labels
formatting LDML46.1 MF2.0 Draft Candidate normative resolve-candidate This issue appears to have been answered or resolved, and may be closed soon.

Comments

@macchiati
Copy link
Member

macchiati commented Nov 14, 2024

Added text 2024-11-24


I think we need to be careful about our usage of the terms 'well-formed' and 'valid'. The following is not fully fleshed out; it is more of a discussion of the issue and some ideas for the future.

We often reference other sources for identifiers, and want them to be interpreted according to that source. Sources that change over time should (and typically do) distinguish between well-formed and valid. For example, 'ge:manic' is not a well-formed locale identifier, and 'de-Flub' is not a valid locale identifier. However, 'de-Flub' could (conceivably) become valid in the future, if a script is given the code 'Flub'. Good sources also never remove identifiers, or make material changes in the meaning, but may deprecate them: those are still treated as valid.

When we reference such sources in message format, such as with option values, we have a few goals.

  • Ideally, implementations could only accept well-formed and valid identifiers, and only interpret them according to the source semantics. For example, interpret 'de' as German and not as Dezfuli.
  • However, we don't want to force implementations to break if they don't support all the identifiers, nor if they don't support the latest version, or if they support an identifier that has become deprecated.

This is also true for our own enums, . We have in registry.md:

Implementations MAY accept additional option values for options defined here. However, such values might become defined with a different meaning in the future, including with a different, incompatible name or using an incompatible value space. Supporting implementation-specific option values for standard or optional functions is NOT RECOMMENDED.

We also have BNF:

option = identifier o "=" o (literal / variable)

The implications are that conformant implementation can interpret any of:

{$x :currency compactDisplay=short}
{$x :currency compactDisplay=medium}
{$x :currency compactDisplay=μικρός}
{$x :currency compactDisplay=|🐭|}
{$x :currency compactDisplay=$myDisplay}

It can also interpret:

{$x :currency currency=CAD}
{$x :currency currency=MyCurrency}
{$x :currency currency=δολάριοΚαναδά}
{$x :currency currency=|¥|}
{$x :currency currency=|🐭|}
{$x :currency currency=$myCurrency}

It could also interpret compactDisplay=short by formatting a long form, and compactDisplay=long by formatting a short form. Or a value of CAD as being GBP, etc.

This level of freedom seems counterproductive for interoperability.


So I propose that we have the general rule something like the following, where option values are defined according to a reference to an external source

  • An implementation MUST ignore any option with a literal option value that is ill-formed according to its external source, and signal that error. This allows linters and message builders to catch ill-formed values early.
    • [It must ignore the option locale=|ge:manic|]
  • An implementation MUST ignore any option with an option value that isn't valid according to any version of the external source.
    • [At the time of this writing, must ignore locale=|dab|]
  • An implementation SHOULD (but need not) ignore an option with an option value that is valid according to some version of the external source.
    • [An implementation might not support Dezfuli, and thus ignore locale=|def|; it may also ignore all deprecated language identifiers, and thus ignore locale=|daf|.]
  • If an implementation doesn't ignore an option, then it MUST interpret its option value in accordance with some version of the source.
    • [It must not interpret 'de' as Dezfuli, or 'def' as German.]

Ignore means that the expression is interpreted as if the option were not there. (I won't talk here about what signals to the caller are associated with that.)


I think we could apply that to our standard enum option values, such as the following in https://github.com/unicode-org/message-format-wg/blob/main/spec/registry.md#options-1, so that |@!$| could be recognized as ill-formed.

  • useGrouping
    • auto (default)
    • always
    • never
    • min2

That is, perhaps we can have a rule in the registry for our functions, something like: the default well-formedness criteria for standard function option values matches the constraints on function option identifiers in README.md. Thus |$abc| would be ill-formed for useGrouping. Any function option that had different criteria for well-formedness of its values would simply have have an explicit well-formedness statement.


@aphillips
Copy link
Member

A few notes:

  • be careful not to confuse "implementation" and "function handler". Many option values are handled by the function handler and not necessary by the MF2 implementation itself. The MF2 implementation frequently does not validate the option values, even though the spec defines what a well-formed option contains.
  • we don't "MUST ignore" options whose values are ill-formed for some of the reference sources because we allow for implementation defined values. For example, MSFT might allow LCIDs as a value for u:locale=$lcid while ORCL might permit legacy ids like AMERICAN_AMERICA 😱 . Obviously, for non-implementation-defined extensions, the result SHOULD be ignored.

Note that we have text about option resolution in the spec which does indeed drop bad options on the floor. But for options whose interpretation is inside the function handler, the dropping-on-the-floor part is up to the function itself. This is why there is a resolved options section in each function: it defines which options are visible downstream (functions don't currently eat any of their options)

Are there specific changes you want in the spec? I'd advise a careful look at u-namespace.md and registry.md as well as option resolution in formatting.md.

@macchiati
Copy link
Member Author

I was struck by the fact that we are requiring valid for some identifiers (eg timezones), but only well-formed for currencies. Those feel like very similar cases, so if well-formed is right for currency, that term should also be right for timezones (or the inverse).

we don't "MUST ignore" options whose values are ill-formed for some of the reference sources because we allow for implementation defined values

But a straightforward reading of registry.md means that we don't allow that in many cases (whenever we say well-formed (like currencies) or valid like:

timeZone (default is system default time zone or UTC)

But that means I can't use implementation-defined identifiers like "$California Time"

@aphillips
Copy link
Member

No, you're correct about this. We should be well-formed for acceptance but permit checking for validity. And we should fix values to permit implementation-specific gorp (mainly for platform-specific values that aren't the sanctioned identifiers)

@aphillips
Copy link
Member

I think what we should do is: merge #911 and #922 and then do a cleanup edit on registry.md in a new PR

@macchiati
Copy link
Member Author

makes perfect sense

@aphillips aphillips added the Agenda+ Requested for upcoming teleconference label Nov 16, 2024
@duerst
Copy link

duerst commented Nov 18, 2024

Mark (@macchiati) wrote:

  • An implementation MUST ignore any option with an option value that is ill-formed according to its source.

    • [It must ignore the option locale=|ge:manic|]
  • An implementation MUST ignore any option with an option value that isn't valid according to any version of the source. [At the time of this writing, must ignore locale=|dab|]

  • An implementation SHOULD (but need not) ignore an option with an option value that is valid according to some version of the source. [An implementation might not support Dezfuli, and thus ignore locale=|def|; it may also ignore all deprecated language identifiers, and thus ignore locale=|daf|.]

I think the SHOULD in this paragraph should be a MAY, for obvious reasons.

@aphillips aphillips added resolve-candidate This issue appears to have been answered or resolved, and may be closed soon. and removed Agenda+ Requested for upcoming teleconference labels Nov 20, 2024
@aphillips
Copy link
Member

This was discussed in the 2024-11-18 call. We resolved to use valid in most cases, but with careful phrasing in the boilerplate. I believe this is now addressed?

@macchiati
Copy link
Member Author

I elaborated a bit. I would like to discuss further, after 46.1

@aphillips
Copy link
Member

I see your elaboration. One callout:

"implementation" has to be used carefully. In most cases in our spec it refers to the MessageFormat framework/executable/host environment itself, e.g. in ICU4J the actual MessageFormatter class. And it is true that the ABNF and well-formed/validity rules at the message level are quite permissive about option values.

At the function set level, there is a different layer of "implementation", specifically what we call the function handler. This is what a lot of the normative language in the current registry.md is about. In general, the function handler is some code that maps option values to local API-specific representations. So for "digit size options", it parses the option value. If it's a positive integer, great. Otherwise it's not valid.

We definitely want to impose standards on options and their values, to ensure interoperability. But the MF2-level implementation has no role in this (once the message is syntactically correct). Instead, the specific function handler, such as for :integer or whatever is involved. Thus the wording needs to be precise about where the "implementation" is taking place. And it needs to not impose such restrictions as would limit extensibility or prevent the correct level in the code from receiving the information.

@macchiati
Copy link
Member Author

I agree that there are important distinctions to be made, and in any final text we should make it clear. What I'm specifically talking about are the implementations of the standard functions defined in the registry.md. Whatever we do, it should be clear what kinds of results we can expect to have, and what kinds of errors we can expect to see raised (which might be different for ill-formed vs well-formed+invalid vs well-formed+valid+unsupported vs well-formed+valid+supported).

Some of that could apply to implementation-defined functions, but I didn't want to talk about that in this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
formatting LDML46.1 MF2.0 Draft Candidate normative resolve-candidate This issue appears to have been answered or resolved, and may be closed soon.
Projects
None yet
Development

No branches or pull requests

3 participants