forked from w3c/string-meta
-
Notifications
You must be signed in to change notification settings - Fork 1
/
index.html
1705 lines (1236 loc) · 136 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html lang="en" dir="ltr">
<head>
<title>Strings on the Web: Language and Direction Metadata</title>
<meta charset="utf-8"/>
<script src="https://www.w3.org/Tools/respec/respec-w3c" class="remove"></script>
<script class="remove">
var respecConfig = {
// specification status (e.g. WD, LCWD, WG-NOTE, etc.). If in doubt use ED.
specStatus: "ED",
//publishDate: "2019-04-16",
//previousPublishDate: "2019-04-16",
//previousMaturity: "WD",
noRecTrack: true,
shortName: "string-meta",
copyrightStart: "2017",
edDraftURI: "https://w3c.github.io/string-meta/",
editors: [
{ name: "Richard Ishida", mailto: "[email protected]",
company: "W3C", w3cid: 3439 },
{ name: "Addison Phillips", mailto: "[email protected]",
company: "Invited Expert", w3cid: 33573 },
],
group: "i18n",
github: "w3c/string-meta",
xref: ["i18n-glossary"],
localBiblio: {
"LDML": {
title: "Unicode Technical Standard #35: Unicode Locale Data Markup Language (LDML)",
href: "https://unicode.org/reports/tr35/",
authors: [ "Mark Davis", "CLDR Contributors" ]
},
}
};
</script>
<link rel="stylesheet" href="local.css">
</head>
<body>
<div id="abstract">
<p>This document describes the best practices for identifying the language and direction for strings used on the Web.</p>
</div>
<div id="sotd">
<p>We welcome comments on this document, but to make it easier to track them, please raise separate issues for each comment, and point to the section you are commenting on using a URL.</p>
</div>
<section>
<h2 id="introduction">Introduction</h2>
<p>This document was developed as a result of observations by the Internationalization Working Group over a series of specification reviews related to formats based on JSON, WebIDL, and other non-markup data languages. Unlike markup formats, such as XML, these data languages generally do not provide extensible attributes and were not conceived with built-in language or direction metadata.</p>
<p>The concepts in this document are applicable any time strings are used on the Web, either as part of a formalised data structure, but also where they simply originate from JavaScript scripting or any stored list of strings.</p>
<p><a>Natural language</a> information on the Web depends on and benefits from the presence of language and direction metadata. Along with support for Unicode, mechanisms for including and specifying the <a>block direction</a> and the <a>natural language</a> of spans of text are one of the key internationalization considerations when developing new formats and technologies for the Web.</p>
<p>Markup formats, such as HTML and XML, as well as related styling languages, such as CSS and XSL, are reasonably mature and provide support for the interchange and presentation of the world's languages via built-in features. Strings and string-based data formats need similar mechanisms in order to ensure complete and consistent support for the world's languages and cultures.</p>
<section id="conventions">
<h3>Document Conventions</h3>
<p>In this document [[RFC2119]] keywords in uppercase italics have their usual meaning. We also use these stylistic conventions:</p>
<p class="definition-example"><strong>Definitions</strong> appear with a different background color and decoration like this.</p>
<p class="advisement"><strong>Best practices</strong> appear with a different background color and decoration like this.</p>
<!-- Remove comment when adding an 'issue'
<p class="issue-example" id="issue-example"><strong>Recommendations</strong> for future work appear with a different background color and decoration like this.</p>
-->
</section>
<section id="terminology">
<h3>Terminology</h3>
<p>This section provides short definitions of key terminology necessary to understand the contents of this document. Most of the terms found here are taken from the [[I18N-GLOSSARY]]: they are repeated here for convenience.</p>
<p class="note">If you are unfamiliar with bidirectional or right-to-left text, there is a basic introduction <a href="https://www.w3.org/International/articles/inline-bidi-markup/uba-basics">here</a>. This will give you a basic grasp of how the <a>Unicode Bidirectional Algorithm</a> works and the interplay between it and the <a>block direction</a>, which will stand you in good stead for reading this document. Additional materials can be found in the Internationalization Working Group's <a href="https://www.w3.org/TR/international-specs/#text_direction">Best Practices for Spec Developers</a>.</p>
<p class="definition"><a>Metadata</a> is data <em>about</em> data: it is information included in a data structure that provides additional context, meaning, or presentation. In this document, the function of metadata is to express information about direction and language. [[I18N-GLOSSARY]]</p>
<p class="definition">A <a>producer</a> is any process where natural language string data is created for later storage, processing, or interchange. [[I18N-GLOSSARY]]</p>
<p class="definition">A <a>consumer</a> is any process that receives natural language strings, either for display or processing. [[I18N-GLOSSARY]]</p>
<p class="definition">A <a>serialization agreement</a> is the common understanding between a producer and consumer about the serialization of string metadata: how it is to be understood, serialized, read, transmitted, removed, etc. [[I18N-GLOSSARY]]</p>
<p>The <cite>Unicode Bidirectional Algorithm</cite> [[UAX9]], also known as <em>UBA</em>, defines the concept of a [=paragraph direction=]. This is the initial base direction of a "paragraph", and resolves to either <em>left-to-right</em> or <em>right-to-left</em>. The term "paragraph" has a specific meaning internal to UBA. In the context of this document, the term is misleading, because generally strings and other data on the Web are not "paragraphs of text" in some document format. In this document, we generally use the following two more specific terms:</p>
<p class="definition"><dfn>Block direction</dfn>. The initial base direction of a block of text, which resolves to either <em>left-to-right</em> or <em>right-to-left</em>. A block refers to a unit of text as a whole, such as a paragraph in a document or a string in a data file. The name "block" is chosen as a contrast to <em>inline direction</em>. Unicode calls this value the [=paragraph direction=]. [[I18N-GLOSSARY]]</p>
<p class="definition"><dfn>String direction</dfn>. The overall direction of a specific string, which indicates the presentation order of string-internal directional runs. Strings transmitted inside various data structures are often inserted into a block (such as a paragraph). In such a case, the string direction is needed as part of the [=bidi isolation=] of the string.</p>
<p>In this document we are concerned with identifying the <a>string direction</a> of a whole string and how to transmit and apply the string direction when displaying strings in various contexts. We do not talk about how to determine the direction or display of runs of text within a string.</p>
<p>The <a>bidi algorithm</a> is primarily focused on arranging adjacent characters, based on character properties. The <a>block direction</a> dictates (a) the visual order and direction in which runs of strongly-typed <a>LTR</a> and <a>RTL</a> characters are displayed, and (b) where there are weakly-directional or neutral characters, such as punctuation, the placement of those items relative to the other content.</p>
</section>
<section id="producers_consumers">
<h3>The String Lifecycle</h3>
<p>It's not possible to consider alternatives for handling string metadata in a vacuum: we need to establish a framework for talking about string handling and data formats.</p>
<section id="producers">
<h4>Producers</h4>
<p>A string can be created in a number of ways, including a content author typing strings into a plain text editor, text message, or editing tool; or a script scraping text from web pages; or acquisition of an existing set of strings from another application or repository. In the data formats under consideration in this document, many strings come from back end data repositories or databases of various kinds. Sources of strings may provide an interface, API, or metadata that includes information about the <a>string direction</a> and language of the data. Some also provide a suitable default for when the direction or language is not provided or specified. In this document, the <b class="newterm">producer</b> of a string is the source, be it a human or a mechanism, that creates or provides a string for storage or transmission.</p>
<p>When a string is created, it's necessary to (a) detect or capture the appropriate language and <a>string direction</a> to be associated with the string, and (b) take steps, where needed, to set the string up in a way that stores and communicates the language and <a>string direction</a>.</p>
<p>For example, in the case of a string that is extracted from an HTML form, the <a>string direction</a> can be detected from the computed value of the form's field. Such a value could be inherited from an earlier element, such as the <code class="kw" translate="no">html</code> element, or set using markup or styling on the <code class="kw" translate="no">input</code> element itself. The user could also set the direction of the text by <a href="https://www.w3.org/International/questions/qa-html-dir#userexplicit">using keyboard shortcut keys</a> to change the direction of the form field. The <code class="kw" translate="no">dirname</code> attribute provides a way of automatically communicating that value with a form submission.</p>
<p>Similarly, language information in an HTML form would typically be inherited from the <code class="kw" translate="no">lang</code> attribute on the <code class="kw" translate="no">html</code> tag, or an ancestor element in the tree with a <code class="kw" translate="no">lang</code> attribute.</p>
<p>If the producer of the string is receiving the string from a location where it was stored by another producer, and where the <a>string direction</a> and language has already been established, the producer needs to understand that the language and string direction has already been set, and understand how to convert or encode that information for its consumers.</p>
</section>
<section id="consumers">
<h4>Consumers</h4>
<p>A <b class="newterm">consumer</b> is an application or process that receives a string for processing and possibly places it into a context where it will be exposed to a user. For display purposes, it must ensure that the <a>block direction</a> and language of the string is correctly applied to the string in that context. For processing purposes, it must at least persist the language and direction and may need to use the language and direction data in order to perform language-specific operations.</p>
<p>Proper display of the string involves supplying the <a>string direction</a> and language to the rendering document or process by applying additional markup, adding control codes, or setting display properties. This indicates to rendering software the <a>string direction</a> or language that should be applied to the string in this display context to get the string to appear correctly. For both language and direction, it must make clear the boundaries for the range of text to which the language applies. For text direction, it must also isolate embedded strings from the surrounding text to avoid spill-over effects of the bidi algorithm [[UAX9]].</p>
<p>Note that a consumer of one document format might be a <a>producer</a> of another document format.</p>
</section>
<section id="agreements">
<h4>Serialization Agreements</h4>
<p>Between any <a>producer</a> and <a>consumer</a>, there needs to be an <dfn data-lt="agreement|serialization agreement">agreement</dfn> about what the document format contains and what the data in each field or attribute means. Any time a producer of a string takes special steps to collect and communicate information about the <a>string direction</a> or language of that string, it must do so with the expectation that the consumer of the string will understand how the producer encoded this information. </p>
<p>If no action is taken by the producer, the consumer must still decide what rules to follow in order to decide on the appropriate <a>string direction</a> and language, even if it is only to provide some form of default value.</p>
<p>In some systems or document formats, the necessary behaviour of the producers and consumers of a string are fully specified. In others, such agreements are not available; it is up to users to provide an agreement for how to encode, transmit, and later decode the necessary language or direction information. Low level specifications, such as JSON, do not provide a string metadata structure by default, so any document formats based on these need to provide the "agreement" themselves.</p>
</section>
</section>
<section id="syntactic-content">
<h3>Strings that are not <a>localizable text</a></h3>
<p>The Web uses strings and character sequences to encode most data. Leaving aside different data types (such as numbers, time values, or binary data serializations such as <code>base64</code>), there are still values that are defined as using a string data type but which are not intended for use as <a>natural language</a> data values. For example, the <a>syntactic content</a> defined by a specification, such as the reserved keywords in CSS or the names of the various definitions in a WebIDL document, are not part of the <a>localizable text</a> of their respective document formats or protocols.</p>
<p>Many specifications also allow users to provide <a>user-supplied values</a> inside of a given namespace or document format. For example, SSIDs on a Wifi network are user-defined. So too are class names in a CSS stylesheet. Most specifications allow (and are encouraged to allow) a wide range of Unicode characters in these names. Most users choose values that are recognizable as words in one or another natural language, as doing so makes the values easier to work with. However, even though these strings consist of words in a natural language, these types of strings are not considered <a>localizable text</a> and do not need to be encumbered with additional metadata related to language or <a>string direction</a>. Usually they are merely identifiers that enable a computer to match the values.</p>
<p>A sometimes-useful test is that if replacing the identifier with an arbitrary string such as <code>tK0001.37B</code> would still be allowed, functional, and "normal", then it's not <a>localizable text</a>.</p>
<p>For example, in the <a href="#base_example">base example</a> below, all of the keys in the JSON document (<code>id</code>, <code>title</code>, <code>authors</code>, <code>language</code>, <code>publisher</code>, and so on) are syntactic content. The data values, such as the ISBN, the language tag, and the publication date are also syntactic content. Only the actual book title, the author's name, and the publisher's name are natural language data values and thus <a>localizable text</a>.</p>
</section>
</section> <!-- Introduction -->
<section>
<h2 id="bp_and-reco">Best Practices, Recommendations, and Gaps</h2>
<p>This section consists of the Internationalization (I18N) Working Group's set of best practices for identifying language and <a>string direction</a> in data formats on the Web. In some cases, there are gaps in existing standards, where the recommendations of the I18N WG require additional standardization or there might be barriers to full adoption.</p>
<p>The main issue is how to establish a common <a>serialization agreement</a> between producers and consumers of data values so that each knows how to encode, find, and interpret the language and <a>string direction</a> of each data field. The use of metadata for supplying both the language and <a>string direction</a> of natural language string fields ensures that the necessary information is present, can be supplied and extracted with the minimal amount of processing, and does not require producers or consumers to scan or alter the data.</p>
<p>The most basic best practice, which the Internationalization Working Group looks for in every specification, is:</p>
<div class="req" id="bp-determine">
<p class="advisement">For any string field containing natural language text, it MUST be possible to determine the language and <a>string direction</a> of that specific string. Such determination SHOULD use metadata at the string or document level and SHOULD NOT depend on heuristics.</p>
</div>
<section>
<h3 id="serialization-best-practices">Recommended Serializations</h3>
<p>This section describes four approaches to serialization for string values. Specifications are intended to use these together to form a complete solution to managing language and direction metadata in document formats and protocols.</p>
<section id="non-linguistic">
<h4 id="string-specific-language">Non-Linguistic Fields</h4>
<p>Avoid assigning or requiring language or direction metadata for <a>non-linguistic fields</a> (that is, strings that contain data that is not human language). Note that this includes <a data-cite="international-specs/#application_internal">application-internal data values</a> [[INTERNATIONAL-SPECS]].</p>
<p>While the value of a <a>syntactic content</a> item or <a>user-supplied value</a> will often use a word-like token that conveys meaning to humans (as an aid in debugging, for example), the values need to consistently be wrapped with localizable display strings for presentation to the user.</p>
<div class="req" id="bp-do_not_use_language_non_data">
<p class="advisement">Specifications SHOULD NOT specify or require the use of language metadata for <a>syntactic content</a> or for the value of fields that cannot contain natural language text.</p>
</div>
<aside class="note">
<p>Error messages are not <a>syntactic content</a>. They consist of and need to be treated as <a>localizable text</a>.</p>
</aside>
<div class="req" id="bp-non-linguistic-defaults">
<p class="advisement">If a <a>consumer</a> is required to assign a language tag to some non-linguistic data, the language tag <code>zxx</code> (Non-Linguistic) SHOULD be used. If a <a>consumer</a> is required to assign a <a>string direction</a> to such data, the value <code>auto</code> SHOULD be used.</p>
</div>
<aside class="example" title="Examples of non-linguistic string values">
<pre class="javascript">
"isbn": "978-123456789-X",
"part-number": "§ABC-123-0094"
</pre>
<p>Note that non-linguistic values are sometimes <em>localized</em>, even though they are not <em>translated</em>.</p>
<pre class="javascript">
"gaining-value-color": "green", // perhaps "red" in another locale?
"background-color": "#ffebdd", // or something else?
"default-level": "medium", // perhaps "large" or "small" in another locale?
"help-file-url": "https://example.org/en-US/help.html"
</pre>
</aside>
<div class="req" id="bp_separate_localizable">
<p class="advisement">Specifications SHOULD be careful to distinguish <a>syntactic content</a>, including <a>user-supplied values</a>, from <a>localizable text</a>.</p>
</div>
<div class="req" id="bp_non_displayable_syntactic">
<p class="advisement">Specifications MUST NOT treat <a>syntactic content</a> values as "displayable".</p>
</div>
</section>
<section id="single-linguistic-field">
<h4>Single-Language Localizable Text Field</h4>
<div class="req" id="bp_lang_field_based_metadata">
<p class="advisement">Use field-based metadata or string datatypes to indicate the language and the <a>string direction</a> for individual <a>localizable text</a> values.</p>
</div>
<p>For <a>localizable text</a> fields that appear in a single language, use a data structure to represent the value. The recommended representation is an object with three fields. The field <code>value</code> contains the actual string. The field <code>lang</code> contains a <a>valid</a> [[BCP47]] language tag. The field <code>dir</code> contains the string's <a>string direction</a> (one of the values <code>ltr</code>, <code>rtl</code>, and <code>auto</code>).</p>
<aside class="example" title="Example of a localizable text field">
<pre class="json" id="localizable-text-field">
"field-name-goes-here": {
"value": "This is the string value",
"lang": "en-US",
"dir": "ltr"
}
</pre>
</aside>
<p>Use of heuristics to determine language or <a>string direction</a> will always fail for certain cases, and there needs to be a way to provide the correct outcome for those strings. Assignment of <a href="#metadata">metadata</a> (either as a resource-wide default, or in a string-specific label) is an intentional act that removes the need to guess the outcome by applying heuristics.</p>
<p>The use of <a href="#metadata">metadata</a> for indicating <a>block direction</a> is preferred because it avoids requiring the consumer to interpolate the direction using methods such as <a href="#firststrong">first strong</a> or use of methods which require modification of the data itself (such as the <a href="#rlm">insertion of RLM/LRM markers</a> or <a href="#paired">bidirectional controls</a>).</p>
<aside class="note">
<p>Some schema languages, such as the RDF suite of specifications, have no built-in mechanism for associating [=string direction=] metadata with natural language string values. It is up to specifications that use these specifications to define structures and adopt best practices that result in clean interchange of language and direction metadata.</p>
<p>For example, [[JSON-LD]] provides a document-level [=block direction=] using the <code class="kw" translate="no">@context</code> mechanism and defines the <code class="kw" translate="no">i18n</code> namespace as an extension of existing RDF datatypes which can be used to set the language, [=string direction=], or both of string values.</p>
</aside>
<div class="req" id="bp_localizable">
<p class="advisement">For [[WebIDL]]-defined data structures, define each <a>localizable text</a> (natural language text) field as a <q><a>Localizable</a></q>.</p>
</div>
<p> This combines both language and direction metadata and, if consistently adopted, makes interchange between different formats easier. Consistency between different specifications and document formats allows for the easy interchange of string data. By naming field attributes in the same way and adopting the same semantics, different specifications can more easily extract values from or add values into resources from other data sources.</p>
</section>
<section id="language-defaults">
<h4 id="resource_wide_default">Resource-wide Defaults</h4>
<p>When a resource contains a number of natural language strings (and particularly if those string are all in the same language), using the localized string representation described above can become inefficient. To reduce the complexity of encoding these strings, specifications can establish a resource-level default for language and [=string direction=]. These are separate values, as language does not imply direction. There should still be the ability to override either language or direction on any given string value by using the representation found <a href="#single-linguistic-field">above</a>.</p>
<p class="definition">A <dfn data-lt="resource-wide default|document-level default">resource-wide default</dfn> is a value that is specified at the resource or document-level and can be applied to any unlabeled value contained by that resource.</p>
<div class="req" id="bp_default_setting">
<p class="advisement">Specifications MAY define a mechanism to provide the default language and the default [=string direction=] for all strings in a given resource. However, specifications MUST NOT assume that a resource-wide default is sufficient. Even if a resource-wide setting is available, it must be possible to use string-specific metadata to override that default.</p>
</div>
<p>If your specification defines its own document level defaults, provide two optional fields:</p>
<div class="req" id="bp-document-language-default">
<p class="advisement">A resource-wide default language field SHOULD be called <code>language</code> and SHOULD be specified to contain a <a>valid</a> [[BCP47]] language tag. Specifications SHOULD specify that implementations are only require to check if a [[BCP47]] language tag is <a>well-formed</a>.</p>
</div>
<div class="req" id="bp-document-direction-default">
<p class="advisement">A resource-wide default <a>block direction</a> field SHOULD be called <code>direction</code> and support the values <code>ltr</code>, <code>rtl</code>, or <code>auto</code>.</p>
</div>
<aside class="example" title="Example of document level defaults">
<pre class="json">
"language": "en-US",
"direction": "ltr",
//...
"some-field-name": "This string is in U.S. English, thanks to the default",
// the following field overrides the language but not the direction:
"another-field-name": {
"value": "Diese Zeichenfolge ist auf Deutsch",
"lang": "de"
}
</pre>
</aside>
<p>Exceptions to the default are always a possibility, so it needs to be possible for users to override the default on a string-by-string basis.</p>
<p>First-strong heuristics are not applied to strings when the direction has been set externally using metadata. Even if a strongly directional character, such as <span class="codepoint" translate="no"><img alt="RLM" src="images/200F.png"><code class="uname">U+200F RIGHT-TO-LEFT MARK</code></span>, has been prepended to a string, resource-wide default metadata can override the presentation of the string in ways that result in <a>spillover</a> effects. Therefore content needs to be able to provide string-level metadata to override the default for strings whose <a>string direction</a> does not match the resource-wide default.</p>
<p>For specifications that can make use of the [[JSON-LD]] <code>@context</code> mechanism, use the <code>@language</code> and <code>@direction</code> fields to supply the document level defaults.</p>
<aside class="note">
<p><strong>Document level defaults are not a complete solution on their own.</strong> Specifications can define a document-level defaulting mechanism to assist users who often create monolingual documents. However, specifications should only use document-level defaults to augment the ability for users to provide string-specific metadata.</p>
</aside>
</section>
<section id="language-maps">
<h4>Language Maps</h4>
<div class="req" id="bp-lang-maps">
<p class="advisement">Use a language map to store multiple language versions of a single field inside of a document. For [[WebIDL]]-defined data structures, use <a href="#language-map-idl"><code>LanguageMap</code></a> to define the field.</p>
</div>
<p>The world is not monolingual. Having documents that contain only a single language would mean providing many iterations of the document, one for each language, in order to localize the content. This also might require language negotiation when requesting the content.</p>
<p>One way to address this is to allow multilingual values for each <a>localizable text</a> field inside the document.</p>
<p>Language selection is not merely the exact matching of language tag string values to the user's preferred locale. The usual <a href="#localizable-text-field">object representation</a> of a <a>localizable text</a> field requires that the object be deserialized in order to discover the language tag associated with the value. This can be inefficient when there are many values in a given file. In these cases, the best practice is to use a language map to organize <a>localizable text</a> values. Such a map exposes the language tag for the purposes of selection, but still uses an object representation on the value side of the map, since both language and direction might need to be overridden for a given string value.</p>
<aside class="note">
<p>The language map structure presented in this section is not the same as the language maps found in <a data-cite="JSON-LD#string-internationalization">JSON-LD String Internationalization</a> [[JSON-LD]].</p>
</aside>
<aside class="example" title="A localizable language map">
<pre class="json">
"field-name-goes-here": {
"en": {"value": "This is English"},
"en-GB": {"value": "This is UK English", "dir": "ltr"},
"fr": {"value": "C'est français", "lang": "fr-CA", "dir": "ltr"},
"ar": {"value": "هذه عربية", "dir": "rtl"}
}
</pre>
</aside>
<aside class="note">
<p>This structure permits a language tag as the key and also as part of the value. These two values do not have to match. The value of the key indicates the <a>locale</a> of the intended audience, while the language tag of the value represents information about the actual language of the string.</p>
<p>These values might not match in cases where additional specificity is needed to get the correct rendering or processing of the value or, occasionally, when a foreign language value <strong><em>is</em></strong> the intended localization (and the language needs to be overridden, such as to select voices or dictionaries in assistive technology):</p>
<pre class="json">
"extra-rendering-help": {
"zh": {"value": "你好世界!", "lang": "zh-Hans"}
}
"hello-in-french": {
"en": {"value": "Bonjour!", "lang": "fr"},
"de": {"value": "Bonjour!", "lang": "fr"}
}
</pre>
</aside>
</section>
</section>
<section>
<h3 id="bp-lang-dir-unknown">When Language and Direction are Unknown</h3>
<div class="req" id="bp_default_fallback">
<p class="advisement">Specify that, in the absence of other information, the default direction and default language are unknown.</p>
</div>
<p>Explicit metadata, if available, trumps the need for heuristics to be applied. This is logical, since the heuristic method cannot reliably deduce the necessary direction on its own, and if metadata has been explicitly provided there is an indication that it is intended to be authoritative.
</p>
<p>It is essential for a consumer to know that language and direction are unknown quantities in order for them to know when to apply fallback strategies to the data (this could include language-detection, or first-strong heuristics for direction). In particular, the default direction should not be set to LTR, since that would override the need for first-strong detection, which is more appropriate for strings written in a RTL script.</p>
</section>
<div class="req" id="bp_use_heuristics_1">
<p class="advisement">For the case where the [=string direction=] is not known, specify that consumers should use first-strong heuristics to identify the [=string direction=] of each string.</p>
</div>
<p>If metadata is not available, consumers of strings should use heuristics, preferably based on the Unicode Standard's first-strong detection algorithm, to detect the base direction of a string.</p>
<p>The <a href="#firststrong">first-strong algorithm</a> looks for the first strongly-directional character in a string (skipping certain preliminary substrings), and assumes that it represents the [=string direction=] of the string as a whole. However, the first strong directional character doesn't always coincide with the actual or desired [=string direction=] of the string as a whole, so it should be possible to provide metadata, where needed, to address this problem.</p>
<div class="req" id="bp_using_rlm_lrm">
<p class="advisement">If relying on first-strong heuristics, allow content developers to use RLM/LRM at the beginning of a string where it is necessary to force a particular base direction, but do not prepend one of these characters to existing strings.</p>
</div>
<div class="req" id="bp_rlm_lrm_availability">
<p class="advisement">Do not rely on the availability of RLM/LRM formatting characters in most cases.</p>
</div>
<p>If string data is being provided by users or content developers in web forms or other simple environments, users may not be able to enter these formatting characters. In fact, most users will probably be unaware that such characters exist, or how to use them. A web form can render their use unnecessary for immediate inspection if it sets the <a>block direction</a> for the input (which it should).</p>
<div class="req" id="bp_inferring_from_language">
<p class="advisement">Specifications SHOULD NOT allow a <a>string direction</a> to be <a href="#script_subtag">interpolated from available language metadata</a> unless direction metadata is not available and cannot otherwise be provided.</p>
</div>
<p>Not all resources make use of the available metadata mechanisms. The script subtag of a language tag (or the "likely" script subtag based on [[BCP47]] and [[LDML]]) can sometimes be used to infer a [=block direction=] or [=string direction=] when other data is not available. Using language information is a "last resort" and specifications SHOULD NOT use it as the primary way of indicating [=block direction=]: make the effort to provide for metadata.</p>
<section>
<h2 id="defining_bidi_keywords">Defining Bidirectional Keywords in Specifications</h2>
<p>A specification for a document format or protocol that includes natural language text values will need to define a data field or attribute to store the <a>block direction</a> for each natural language content value. These definitions need to be consistent across the Web in order to ensure interoperability, because <a>consumers</a> of one document format will need to map the <a>block direction</a> for values they receive to fields that they produce or will need to control the <a>string direction</a> of each string when displaying the content. This section describes how to provide such a definition along with the specific content to use.</p>
<p>There are two common use cases for defining content direction: (i) defining a <a>directional metadata field</a> for storing and transmitting the <a>string direction</a> as a field in a data structure or (ii) defining a <a>direction attribute</a> to associate a <a>block direction</a> with a given piece of natural language content.</p>
<p class="definition"><dfn data-lt="directional metadata field|direction field">Directional metadata field</dfn>. A directional metadata field (or <strong>direction field</strong> for short) is a field in a data structure used to associate a [=string direction=] with a given natural language string field or data value.</p>
<aside class="example" id="example-direction-metadata">
<p><strong>Example of a <a>direction field</a>.</strong> In this JSON fragment, the <code>title</code> structure has a field <code>direction</code> which represents the <a>string direction</a> to use for the field <code>value</code>.</p>
<pre class="json">"title": {
"value": "HTML و CSS: تصميم و إنشاء مواقع الويب",
"direction": "rtl",
"language": "ar"
}</pre>
</aside>
<p class="definition"><dfn data-lt="direction attribute">Direction attribute</dfn>. A direction attribute is a field or value, usually represented by an attribute in markup languages, that provides the [=string direction=] of the associated natural language string content.</p>
<aside class="example">
<p><strong>Example of a <a>direction attribute</a>.</strong> If the JSON in the <a href="#example-direction-metadata">above example</a> of a <a>directional metadata field</a> were received by a process that was assembling a Web page for display, it might fill in a template similar to the top line in this example to produce markup like the second line. Here the <code class="kw" translate="no">dir</code> attribute from [[HTML]] is an example of a <a>direction attribute</a>.</p>
<pre class="html">
<p dir="{$title.direction}">{$title.value}</p>
<p dir="rtl">HTML و CSS: تصميم و إنشاء مواقع الويب</p>
</pre>
</aside>
<div class="req" id="bp-define-field-direction-value">
<p class="advisement">Use the field name <code class="kw" translate="no">direction</code> when defining a <a>directional metadata field</a> in a data structure or protocol.</p>
</div>
<p>The name <code class="kw" translate="no">direction</code> is preferred for data values. The name <code class="kw" translate="no">dir</code> is an acceptable alternative.</p>
<div class="req" id="bp-define-display-dir-attribute">
<p class="advisement">Use the field name <code class="kw" translate="no">dir</code> when defining a <a>direction attribute</a>.</p>
</div>
<p>The name <code class="kw" translate="no">dir</code> is preferred for an attribute, such as in markup languages. Using <code class="kw" translate="no">direction</code> for an attribute is not recommended, since it is long and relatively uncommon for this use case. Note that both [[HTML]] and [[XML10]] have a built-in <code class="kw" translate="no">dir</code> attribute. A <code class="kw" translate="no">dir</code> attribute should have scope within a document and should be defined to provide bidi isolation.</p>
<div class="req" id="bp-define-direction-values">
<p class="advisement">Define the values of a <a>directional metadata field</a> or a <a>direction attribute</a> to include and be limited to the values <code class="kw" translate="no">ltr</code>, <code class="kw" translate="no">rtl</code>, and <code class="kw" translate="no">auto</code>.</p>
</div>
<p>The value <code class="kw" translate="no">ltr</code> indicates a direction of left-to-right, in exactly the same manner indicated by <a href="https://www.w3.org/TR/css-writing-modes/#direction">CSS writing modes</a> [[CSS-WRITING-MODES-4]]</p>
<p>The value <code class="kw" translate="no">rtl</code> indicates a direction of right-to-left, in exactly the same manner indicated by <a href="https://www.w3.org/TR/css-writing-modes/#direction">CSS writing modes</a> [[CSS-WRITING-MODES-4]]</p>
<p>The value <code class="kw" translate="no">auto</code> indicates that the user agent uses the <a href="https://html.spec.whatwg.org/multipage/dom.html#the-dir-attribute">algorithm</a> for <code class="kw" translate="no">auto</code> defined by [[HTML]] to determine the [=block direction=] ("[=paragraph direction=]"). This heuristic looks for the first character with a strong directionality, in a manner analogous to the Paragraph Level determination in the bidirectional algorithm [[UAX9]].</p>
<p>When <code class="kw" translate="no">auto</code> is applied to multiple fields or to a document as a whole, it means that the direction should be individually derived for each field (with string-specific metadata providing an override for cases that cannot be determined automatically). It can be useful for labelling a group of mixed direction strings, when the <a>string direction</a> of most strings can be reliably determined using the first-strong heuristics. Whenever possible, the actual <a>string direction</a> (<code class="kw" translate="no">ltr</code> or <code class="kw" translate="no">rtl</code>) of individual strings should be stored or exchanged instead of <code class="kw" translate="no">auto</code>. Omitting the <a>direction field</a> is preferable when the value is truly unknown.</p>
</section>
<section id="writing-spec-examples">
<h3 id="bp-writing-spec-examples">Writing Examples in Specifications</h3>
<p>Specifications for document formats or protocols typically include examples. Examples necessarily include natural language text fields. </p>
<div class="req" id="bp-use-serializations-in-examples">
<p class="advisement">When creating examples in a specfication, always use the serializations and best practices found in this document for fields that contain [=natural language=] text. If the format or protocol supports a <a href="#resource_wide_default">resource-wide default</a>, show setting the default in the example. If the format or protocol does not support a document-level default or showing the default would be inconvenient, use a <a href="#single-linguistic-field">Single-Language Localizable Text Field</a> or <a href="#language-maps">Language Map</a> in the example.</p>
</div>
<aside class="example" title="Examples of using best practices">
<p>Here is an example using a document-level default. It also shows overriding the default, in case you need to demonstrate that. Specifications do not need to include examples of such overrides when they are demonstrating specific features of the document format or protocol.</p>
<pre class="json">
"@context": [
"@language": "en",
"@direction": "ltr"
],
"name": "Example University",
"description": "The examples 'name' and 'description' use the document-level default.",
"french-field": {
"value": "Cet exemple est en français",
"lang": "fr"
},
"arabic-field": {
"value": "هذا المثال باللغة العربية",
"lang": "ar",
"dir": "rtl"
}
</pre>
</aside>
</section>
<section id="guidance-for-producers">
<h3 id="bp-producers">Guidance for [=Producers=]</h3>
<p>Content [=producers=], including implementers of specifications that provide the various language and direction metadata mechanisms described in this document, have some discretion about how to implement the best practices found here. For example, if a document format provides both a resource-wide default and single-language localizable text fields, which should a user prefer?</p>
<div class="req" id="bp-producer-resource-wide-lang">
<p class="advisement">If a [=resource-wide default=] for language is provided by a document format or protocol, it SHOULD always be set to the language most appropriate for the contents of the document. Often this is the [=locale=] of the generating user.</p>
</div>
<div class="req" id="bp-producer-resource-wide-dir">
<p class="advisement">If a [=resource-wide default=] for direction is provided by a document format or protocol, it SHOULD always be set to the direction most associated with the content of the document. Usually this direction is consistent with the document-level language default, if provided.</p>
</div>
<p>For example, if the resource-wide language of a document is <code class="kw" translate="no">en-US</code> (English, United States), then the resource-wide direction of the document should probably be <code class="kw" translate="no">[=LTR=]</code>, because left-to-right is the direction associated with that language.</p>
<div class="req" id="bp-producer-omit-lang-dir">
<p class="advisement">[=Producers=] SHOULD NOT include string-specific language or direction metadata if a [=resource-wide default=] is provided and the string-specific value is consistent with that default.</p>
</div>
<div class="req" id="bp-producer-more-specific-lang">
<p class="advisement">[=Producers=] SHOULD include string-specific language metadata if the value for a given string is <em>more specific</em> or entirely different from that of the [=resource-wide default=].</p>
</div>
<p>For example, if the [=resource-wide default=] value were <code class="kw" translate="no">fr</code> (French) and the string's associated language were <code class="kw" translate="no">fr-FR</code> (French, France), the [=producer=] ought to generate string-specific metadata with the more specific <code class="kw" translate="no">fr-FR</code> tag. Similarly, the [=producer=] ought to generate string-specific metadata if the language were entirely different, such as <code class="kw" translate="no">de</code> (German).</p>
<p>A [=language tag=] is more-specific if it contains more subtags.</p>
<aside class="note">
<p>Supporting the encoding of more-specific language tags is one reason why the <a href="#language-maps">language maps</a> stucture includes the option of overriding the language tag in the value portion of each entry. The more-specific tags can assist with processing, such as selecting the right voice in a text-to-speech system or the right dictionary when checking spelling.</p>
</aside>
<div class="req" id="bp-producer-include-rtl-dir">
<p class="advisement">[=Producers=] SHOULD include string-specific direction metadata for any content whose [=string direction=] is opposite that of a provided [=resource-wide default=], even if the string itself is unambiguous.</p>
</div>
<p>Many strings consist solely of strongly directional characters that are consistent with the overall [=string direction=]. When this direction does not agree with the [=resource-wide default=] (and the default is present), the [=string direction=] needs to be included so that [=consumers=] do not need to introspect the string to determine direction and so that processes (such as filtering and selection) do not mistake the content's direction.</p>
</section>
<section id="guidance-for-consumers">
<h3 id="bp-consumers">Guidance for [=Consumers=]</h3>
<p>The purpose of collecting, serializing, and transmitting language and [=string direction=] metadata is so that [=consumers=] can use it to display and process string data correctly.</p>
<div class="req" id="bp-consumer-lang-assumption">
<p class="advisement">Consumers SHOULD employ any language metadata provided by document formats or protocols when processing or displaying the associated string value.</p>
</div>
<div class="req" id="bp-consumer-dir-assumption">
<p class="advisement">Consumers SHOULD employ any [=string direction=] metadata provided by document formats or protocols when processing or displaying the associated string value.</p>
</div>
<div class="req" id="bp-consumer-isolation">
<p class="advisement">When a string is displayed in or inserted into a document, consumers SHOULD isolate it directionally from any surrounding text.</p>
</div>
<div class="req" id="bp-consumer-direction">
<p class="advisement">Consumers SHOULD apply direction metadata to a string when it is inserted into a document. If this [=string direction=] is provided by the metadata associated with a string, consumers SHOULD use that. If such metadata is not available, first-strong heuristics SHOULD be used to assign the direction.</p>
</div>
<p>It never causes a problem to wrap an inserted string value with bidirectional isolation, and doing so prevents [=spillover effects=] to produce the best result.</p>
<aside class="note">
<p class="links_title">For more information on how to do this in various programming languages, see:</p>
<ul>
<li class="w3"><a href="https://www.w3.org/International/questions/qa-direction-native">How can I use direction metadata in native APIs?</a>
</li>
<li><a href="https://www.w3.org/International/articles/inline-bidi-markup/index.en.html#whattodo">Inline markup and bidirectional text in HTML</a></li>
<li><a href="https://www.w3.org/International/questions/qa-html-dir.en.html#insertedtext">Structural markup and right-to-left text in HTML</a></li>
</ul>
</aside>
<div class="req" id="bp-consumer-lang-display">
<p class="advisement">Consumers SHOULD apply language metadata to a string when it is inserted into a document. Use relevant document attributes or APIs to apply any available language metadata to the string.</p>
</div>
<p>To get the best results in presentation (such as font selection) or text processing (such as hyphenation), the language of inserted text should be set in the document or in APIs that handle the text. In [[HTML]] this is done by setting the <code>lang</code> attribute. In [[XML]] this is done by setting the <code>xml:lang</code> attribute.</p>
<div class="req" id="bp-consumer-metadata-normalization">
<p class="advisement">Consumers MAY normalize language tags to help ensure interoperability.</p>
</div>
<p>For example, many implementations will use the normalization found in <a href="http://www.unicode.org/reports/tr35/#BCP_47_Language_Tag_Conversion">Language Tag Conversion</a> in [[CLDR]]. This normalization, among other things, replaces obsolete subtags and alphabetizes variants.</p>
<div class="req" id="bp-consumer-passthru">
<p class="advisement">Consumers that are also producers SHOULD take care to pass language and direction metadata to their downstream consumers.</p>
</div>
<hr>
<aside class="example" id="consumer-isolation-example" title="Employing language and direction metadata">
<p>This example pertains to all of the above best practices. Suppose your application expects to receive information about an e-book to insert into the user interface. This data includes the language and string direction metadata discussed in this document and might look like:</p>
<pre class="json">"title": {
"value": "⁧HTML و CSS: تصميم و إنشاء مواقع الويب⁩",
"direction": "rtl",
"language": "ar"
}</pre>
<p>The application might wish to display the information to a user, perhaps by inserting the data into a larger message or into a document.</p>
<p>For example, in a plain text application it might use a <a href="https://www.unicode.org/reports/tr35/tr35-messageFormat.html">MessageFormat</a> pattern [[LDML]] such as:</p>
<pre>You are currently reading {$title}</pre>
<p>The application can provide Unicode isolating bidirectional controls around the inserted string. In this case, since the direction is <code translate="no">rtl</code> (<em>right-to-left</em>), the <span class="codepoint" translate="no"><img alt="RLI" src="images/2067.png"><code class="uname">U+2067 RIGHT-TO-LEFT ISOLATE</code></span> is inserted before the title and <span class="codepoint" translate="no"><img alt="PDI" src="images/2069.png"><code class="uname">U+2069 POP DIRECTIONAL ISOLATE</code></span> after it:</p>
<pre class="text">
You are currently reading \u2067⁧HTML و CSS: تصميم و إنشاء مواقع الويب⁩\u2069
</pre>
<p>If instead the application wanted to insert the formatted message into an [[HTML]] document, the author will want to use markup around the inserted string. If no existing markup is available, use the inline element <code>bdi</code>, which isolates the text, to carry the <code>lang</code> and <code>dir</code> attributes. For example:</p>
<pre class="html">
<p>You are currently reading
<bdi lang="ar" dir="rtl">⁧HTML و CSS: تصميم و إنشاء مواقع الويب⁩</bdi></p>
</pre>
<p>If there is an existing element that tightly wraps the inserted text, set the <code>lang</code> and <code>dir</code> attributes of that element to match the metadata provided. For example:</p>
<pre class="html">
<p>You are currently reading
<cite lang="ar" dir="rtl">⁧HTML و CSS: تصميم و إنشاء مواقع الويب⁩</cite></p>
</pre>
<p>Further examples can be found in the articles <a href="https://www.w3.org/International/articles/inline-bidi-markup/index.en.html#whattodo">Inline markup and bidirectional text in HTML</a> and <a href="https://www.w3.org/International/questions/qa-html-dir.en.html#insertedtext">Structural markup and right-to-left text in HTML</a>.</p>
<p>When [=string direction=] metadata is not available, the string should still be isolated and, unless the [=consumer=] knows better, first-strong detection applied. In [[HTML]] this means using the value <code>auto</code> for the attribute <code>dir</code>. In plain text, this might mean wrapping the string with <span class="codepoint" translate="no"><img alt="FSI" src="images/2068.png"><code class="uname">U+2068 FIRST STRONG ISOLATE</code></span> character paired with a closing <span class="codepoint" translate="no"><img alt="PDI" src="images/2069.png"><code class="uname">U+2069 POP DIRECTIONAL ISOLATE</code></span> character at the end of the string.</p>
<p class="note">Notice that, because it starts with a strong LTR character, the example string displays incorrectly with "first strong" detection: <code translate="no" lang="ar">⁨HTML و CSS: تصميم و إنشاء مواقع الويب⁩</code></p>
</aside>
</section>
<section id="technology_specific_solutions">
<h3 id="bp-json-ld">Using JSON-LD</h3>
<div class="req" id="bp_use_jsonld_language_context">
<p class="advisement">Use of [[JSON-LD]] <code class="kw" translate="no">@context</code> and the built-in <code class="kw" translate="no">@language</code> attribute is RECOMMENDED as a document level default.</p>
</div>
<p>For document formats that use it, [[JSON-LD]] includes some data structures that are helpful in assigning language (but not paragraph direction) metadata to collections of strings (including entire resources). Notably, it defines what it calls "string internationalization" in the form of a context-scoped <code class="kw" translate="no">@language</code> value which can be associated with blocks of JSON or within individual objects. There is no definition for base direction, so the <code class="kw" translate="no">@context</code> mechanism does not currently address all concerns raised by this document.</p>
<div class="req" id="bp-use_jsonld_i18n_namespace">
<p class="advisement">Specifications SHOULD use the <code class="kw" translate="no">i18n</code> Namespace feature for RDF literals, as defined in [[JSON-LD]] 1.1.</p>
</div>
<div class="req" id="bp_use_jsonld_atsign">
<p class="advisement">Where the <code class="kw" translate="no">i18n</code> Namespace is not available or is inappropriate to use, specifications SHOULD require [[JSON-LD]] plain string literals for natural language values to provide string-specific language information.</p>
</div>
<p>Some datatypes, such as [[RDF-PLAIN-LITERAL]], already exist that allow for <em>language</em> metadata to be serialized as part of a string value.</p>
<aside class="example" title="Examples of RDF plain literals with language tags">
<pre>
"title": "تصميم و إنشاء مواقع الويب@ar",
"tags": [ "HTML@en", "CSS@en", "تصميم المواقع@ar" ]
"id": "978-111887164-5@und"
</pre>
</aside>
</section>
<section id="protocol-strings">
<h3>Strings that are part of a legacy protocol or format</h3>
<div class="req" id="bp_legacy_fmt_dir">
<p class="advisement">For strings that cannot specify direction due to legacy format reasons, specifications SHOULD specify that the <a>string direction</a> of each string depends on first-strong heuristics.</p>
</div>
<div class="req" id="bp_legacy_fmt_nonlang">
<p class="advisement">For string values and string fields that are <em>not</em> <a>localizable text</a>, specifications SHOULD specify that the field is non-linguistic in nature and recommend the language tag <code class="kw" translate="no">zxx</code> ("No linguistic content") be associated with each string value.</p>
</div>
<div class="req" id="bp_legacy_fmt_lang_unknown">
<p class="advisement">For string values and string fields that are known to contain <a>localizable text</a> but for which there is no possibility of language metadata from the underlying format, specifications SHOULD specify that the language of the content is unknown and recommend the language tag <code class="kw" translate="no">und</code> ("Undetermined") be associated with each string. Specifications MAY allow the use of heuristics or the inference of the language from other field values where appropriate and as a last resort.</p>
</div>
<div class="note" id="language-like-tokens">
<p>Many protocols or formats make use of values that are meant to be human-decipherable tokens, while not being intended as natural language text. This allows people to make use of the value, such as using it for debugging. These can include common protocol elements where which humans expect to view and interact with the values.</p>
<p>Common examples of these include domain names and email addresses. With greater availability of Unicode in these sorts of value spaces, display of these values might vary between systems and environments. For example, font selection, which can vary depending on language, might be different on systems with different default locales.</p>
</div>
<p>Some specifications interact with string values defined by existing protocols or formats. Often these strings are not associated with or do not provide language or direction metadata. For example, many HTTP headers define their contents as if their contents were not <a>localizable text</a>, even when those contents are expected to be natural language text. Specifications that act as <a>consumers</a> or <a>producers</a> of these string values have no way to discover what the language or direction metadata is, nor will they have a mechanism to attach such metadata.</p>
<aside class="example" title="A legacy field that cannot encode language or direction">
<p>The following <code>dictionary</code> defined in the [[Webtransport]] specification depends on a data structure defined in [[RFC9000]].</p>
<pre>dictionary WebTransportCloseInfo {
unsigned long closeCode = 0;
DOMString reason = "";
};</pre>
<p>Although the field <samp translate="no">reason</samp> is expected to contain a descriptive string, it is defined as an array of bytes (with an expected encoding of UTF-8). Since the underlying protocol does not provide fields for language or direction metadata, it is not possible to accurately derive the values when reading data from the wire and any values generated by producers implementing WebTransport would be dropped when the structure is eventually serialized to the wire format. As a result, consumers cannot know the direction of the string. Using the first-strong heuristic (such as assigning the value <code class="kw" translate="no">auto</code> to an HTML <code class="kw" translate="no">dir</code> attribute when adding the reason message to a text element) is preferred to assigning an arbitrarily computed value. In addition, while the actual language of the field <samp translate="no">reason</samp> cannot be known from the data structure, <a>consumers</a> should assign a value of <code class="kw" translate="no">und</code> (or the empty value) as the language of the field for display and processing purposes.</p>
</aside>
</section>
<section>
<h3 id="other_approaches">Additional Best Practices</h3>
<div class="req" id="bp_unicode_tag_chars_nonuse">
<p class="advisement">Specifications SHOULD NOT use the Unicode "language tag" characters (code points <code>U+E0000</code> to <code>U+E007F</code>) for language identification.</p>
</div>
<p>[[Unicode]] says that the <q>... use of tag characters to convey language tags is strongly discouraged</q> and that the use of the character <span class="uname">U+E0001 LANGUAGE TAG</span> is <em>strongly discouraged</em>.</p>
<div class="req" id="bp_no_paired_bidi">
<p class="advisement">Specifications MUST NOT require the production or use of <a href="#paired">paired bidi controls</a>.</p>
</div>
<p>Another way to say this is: <strong><em>do not require implementations to modify data passing through them</em></strong>. Unicode bidi control characters might be found in a particular piece of string content, where the producer or data source has used them to make the text display properly. That is, they might already be part of the data. Implementations should not disturb any controls that they find—but they shouldn't be required to produce additional controls on their own.</p>
<div class="req" id="bp_language_indexing">
<p class="advisement">Specifications SHOULD recommend the use of <a>language indexing</a> when <a>Localizable</a> strings can be supplied in multiple languages for the same value.</p>
</div>
<p><a>Producers</a> sometimes need to supply multiple language values (see <a href="#localization-considerations">Localization Considerations</a>) for the same content item or data record. One use for this <a>language negotiation</a> by the <a>consumer</a>.</p>
<aside class="note">
<p>[[JSON-LD]] <a>language indexing</a> does not support the use of <a>Localizable</a> values or identification of language metadata, such as using <code class="kw" translate="no">i18n</code> namespace additions to values.</p>
</aside>
<aside class="example">
<p>Here is the record used in the <a href="#base_example">original example</a> with a record-level default language and default [=block direction=] added. It also shows the use of a Localizable string to override the document-level defaults for the <code class="kw">author</code> field. Note that this "worked example" is not valid.</p>
<pre>
{
"@context": {
"@language": "ar",
"@direction": "rtl"
},
"id": {"978-111887164-5"},
"title": "<span dir="rtl">HTML و CSS: تصميم و إنشاء مواقع الويب</span>",
"authors": [ {"value": "Jon Duckett", "lang": "en", "dir": "ltr"} ],
"pubDate": "2008-01-01",
"publisher": "مكتبة",
"language": "ar", // recall that this is data about the language of the book's content!
"coverImage": "https://example.com/images/html_and_css_cover.jpg",
// etc.
},
</pre>
<p>Here's a different rendition using [[JSON-LD]]'s <code class="kw" translate="no">i18n</code> Namespace:</p>
<pre>
{
"@context": {
"@language": "ar",
"@direction": "rtl"
},
"id": {"978-111887164-5"},
"title": "<span dir="rtl">HTML و CSS: تصميم و إنشاء مواقع الويب</span>"^^i18n:ar_rtl,
"authors": [ "Jon Duckett"^^i18n:en-US_ltr ],
"pubDate": "2008-01-01",
"publisher": "مكتبة"^^i18n:ar-eg_rtl,
"language": "ar", // recall that this is data about the book content's language!
"coverImage": "https://example.com/images/html_and_css_cover.jpg",
// etc.
},
</pre>
</aside>
</section>
</section> <!-- best practices -->
<section>
<h2 id="use_cases">Requirements and Use Cases</h2>
<div class="note" title="Start Here">
<p>Please read the article <a href="https://www.w3.org/International/articles/lang-bidi-use-cases/"><strong>Use cases for bidi and language metadata on the Web</strong></a> for detailed use cases, including a clear illustration of issues such as <a>spillover</a> or locale-based rendering. This section summarises some key points in that document and related to the need for language and direction metadata.</p>
</div>
<section>
<h3 id="problem_statement">Why is this important?</h3>
<p>Information about the language of content is important when processing and presenting <a>localizable text</a> for a variety of reasons. When language information is not present, the resulting degradation in appearance or functionality can frustrate users, render the content unintelligible, or disable important features. Some of the affected processes include:</p>
<ul>
<li>Selection of fonts and configuration of rendering options to enable the proper display of different languages. This includes
prevention of problems such as: <ul>
<li>"ransom noting" (showing text using multiple different fonts)</li>
<li>language specific glyph selection, especially the selection of the correct Chinese/Japanese/Korean font due to important presentational variations for the same characters in these languages
<li>displaying blanks, spaces, question marks, or other disappearance of characters due to the lack of glyphs in the selected font</li>
</ul></li>
<li>Spell checking and other content processing (such as abuse detection, hyphenation, line-breaking, case conversion, etc.) </li>
<li>Indexing, search, and other natural language text operations </li>
<li>Filtering according to intended audience and language negotiation </li>
<li>Selection of a text-to-speech voice and processor, such as used for accessibility or in a voice-based interface</li>
</ul>
<p>Similarly, direction metadata is important to the Web. When a string contains text in a script that runs right-to-left (RTL), it must be possible to eventually display that string correctly when it reaches an end user. For that to happen, it is necessary to establish what <a>string direction</a> needs to be applied to the string as a whole. The appropriate [=string direction=] cannot always be deduced by simply looking at the string; even where it is possible, the producer and consumer of the string need to use the same heuristics to interpret the direction.</p>
<p>Static content, such as the body of a Web page or the contents of an
e-book, often has language or direction information provided by the document format
or as part of the content metadata. Data formats found on the Web
generally do not supply this metadata. Base specifications such as
Microformats, WebIDL, JSON, and more, have tended to store natural
language text in string objects, without additional metadata.</p>
<p>This places a burden on application authors and data format
designers to provide the metadata on their own initiative. When
standardized formats do not address the resulting issues, the result
can be that, while the data arrives intact, its processing or
presentation cannot be wholly recovered.</p>
<p>In a distributed Web, any <a>consumer</a> can also be a <a>producer</a> for some other process or system. Thus, a given consumer might need to pass language and direction metadata from one document format (and using one <a>serialization agreement</a>) to another consumer using a different document format. Lack of consistency in representing language and direction metadata in serialization agreements poses a threat to interoperability and a barrier to consistent implementation.</p>
</section>
<section>
<h3 id="base_example">An example</h3>
<p>Suppose that you are building a Web page to show a
customer's library of e-books. The e-books exist in a catalog of data
and consist of the usual data values. A JSON file for a single entry
might look something like:</p>
<!--
Title below is actually "HTML and CSS: Design and Build Websites"
ASIN: 1118871642
ISBN-13: 978-1118871645
ISBN-10: 1118871642
-->
<pre id="example1Data">
{
"id": "978-111887164-5",
"title": "<span dir=rtl>HTML و CSS: تصميم و إنشاء مواقع الويب</span>",
"authors": [ "Jon Duckett" ],
"language": "ar",
"pubDate": "2008-01-01",
"publisher": "مكتبة",
"coverImage": "https://example.com/images/html_and_css_cover.jpg",
// etc.
},
</pre>
<p>Each of the above is a data field in a database somewhere. There is even information about what language the book is in: (<samp>"language": "ar"</samp>).</p>
<p>A well-internationalized catalog would include additional metadata to what is shown above. That is, for each of the fields containing <a>localizable text</a>, such as the <samp>title</samp> and <samp>authors</samp> fields, there should be language and <a>string direction</a> information stored as metadata. (There may be other values as well, such as pronunciation metadata for sorting East Asian language information.) These metadata values are used by consumers of the data to influence the processing and enable the display of the items in a variety of ways. As the JSON data structure
provides no place to store or exchange these values, it is more difficult to construct internationalized applications.</p>
<p>One work-around might be to encode the values using a mix of HTML and Unicode bidi controls, so that a data value might look like one of the following:</p>
<pre>
// following examples are NOT recommended
// contains HTML markup
"title": "<span lang='ar' dir='rtl'><span dir=rtl>HTML و CSS: تصميم و إنشاء مواقع الويب</span></span>",
// contains LRM as first character
"authors": [ "\u200eJon Duckett" ],
</pre>
<p>But JSON is a data interchange format: the content might not end up with the title field being displayed in an HTML context. The JSON above might very well be used to populate, say, a local data store which uses native controls to show the title and these controls will treat the HTML as string contents. Producers and consumers of the data might not expect to introspect the data in order to supply or remove the extra data or to expose it as metadata. Most JSON libraries don't know anything about the structure of the content that they are serializing. Producers want to generate the JSON file directly from a local data store, such as a database. Consumers want to store or retrieve the value for use without additional consideration of the content of each string. In addition, either producers or consumers can have other considerations, such as field length restrictions, that are affected by the insertion of additional controls or markup. Each of these considerations places special burden on implementers to create arbitrary means of serializing, deserializing, managing, and exchanging the necessary metadata, with interoperability as a casualty along the way.</p>
<p>(As an aside, note that the markup shown in the above example is actually needed to make the title as well as the inserted markup display correctly in the browser.)</p>
</section>
<section>
<h3 id="unicode_enough">Isn't Unicode enough?</h3>
<p>[[Unicode]] and its character encodings (such as UTF-8) are key elements of the Web and its formats. They provide the ability to encode and exchange text in any language consistently throughout the Internet. However, Unicode by itself does not guarantee perfect presentation and processing of <a>natural language</a> text, even though it does guarantee perfect interchange.</p>
<p>Several features of Unicode are sometimes suggested as part of the solution to providing language and direction metadata. Specifically, Unicode bidi controls are suggested for handling direction metadata. In addition, there are "tag" characters in the <code class="kw" translate="no">U+E0000</code> block of Unicode originally intended for use as language tags (although this use is now deprecated). </p>
<p>There are a variety of reasons why the addition of characters to
data in an interchange format is not a good idea. These include:</p>
<ul>
<li>Most of the data sources used to assemble the documents on the Web will not contain
these characters; producers, in the process of assembling or serializing the data,
will need to introspect and insert the characters as needed—changing the data from the original source. Consumers must then deserialize and introspect the information using an identical <a>serialization agreement</a>. The consumer has no way of knowing if the characters found in the data were inserted by the producer (and should be removed) or if the characters were part of the source data. Overzealous producers might introduce additional and unnecessary characters, for example adding an additional layer of bidi control codes to a string that would not otherwise require it. Equally, an overzealous consumer might remove characters that are needed by or intended for downstream processes.</li>
<li>Another challenge is that many applications that use these data formats have limitations on
content, such as length limits or character set restrictions. Inserting additional characters into
the data may violate these externally applied requirements, and interfere
with processing. In the worst case, portions (or all of) the data value itself might be rejected, corrupted,
or lost as a result.</li>
<li>Inserting additional characters changes the identity of the string. This may have important consequences in certain contexts.</li>
<li>Inserting and removing characters from the string is not a common operation for most data serialization libraries. Any processing that adds language or direction controls would need to introspect the string to see if these are already present or might need to do other processing to insert or modify the contents of the string as part of serializing the data.</li>
</ul>
<p class=note>This last consideration is important to call out: document formats are often built and serialized using several layers of code. Libraries, such as general purpose JSON libraries, are expected to store and retrieve faithfully the data that they are passed. Higher-level implementations also generally concern themselves with faithful serialization and de-serialization of the values that they are passed. Any process that alters the data itself introduces variability that is undesirable. For example, consider an application's unit test that checks if the string returned from the document is identical to the one in the data catalog used to generate the document. If bidi controls, HTML markup, or Unicode language tags have been inserted, removed, or changed, the strings might not compare as equal, even though they would be expected to be the same.</p>
</section>
<section>
<h3 id="what_consumers_do">What consumers need to do to support direction</h3>
<p>Given the <a href="">use cases</a> for bidirectional text, it will be clear that a consumer cannot simply insert a string into a target location without some additional work or preparation taking place, first to establish the appropriate <a>string direction</a> for the string being inserted, and secondly to apply bidi isolation around the string.</p>
<p>This requires the presence of markup or Unicode formatting controls around the string. If the string's actual direction is opposite that of the content into which it is being inserted, the markup or control codes need to tightly wrap the string. Strings that are inserted adjacent to each other all need to be individually wrapped in order to avoid the spillover issues we saw in the previous section.</p>
<p>[[HTML]] provides base direction controls and isolation for any inline element when the <code class="kw" translate="no">dir</code> attribute is used, or when the <code class="kw" translate="no">bdi</code> element is used. When inserting strings into plain text environments, isolating Unicode formatting characters need to be used. (Unfortunately, support for the isolating characters, which the Unicode Standard recommends as the default for plain text/non-markup applications, is still not universal.)</p>
<p>The trick is to ensure that the direction information provided by the markup or control characters reflects the <a>string direction</a> of the string.</p>
</section>
</section>
<section>
<h2 id="bidi-approaches">Approaches Considered for Identifying the [=String Direction=]</h2>
<p>The fundamental problem for bidirectional text values is how a <a>consumer</a> of a string will know what [=string direction=] to use for that string when it is eventually displayed to a user. Note that some of these approaches for identifying or estimating the direction have utility in specific applications and are in use in different specifications such as [[HTML]]. The issue here is which are appropriate to adopt generally and specify for use as a best practice in document formats.</p>
<section id="firststrong">
<h3>First-strong property detection</h3>
<p><strong>This approach is NOT recommended when used alone, but IS recommended as a fallback in combination with other approaches.</strong></p>
<section>
<h4>How it works</h4>
<p>A producer doesn't need to do anything.</p>
<p>The string is stored as it is.</p>
<p>Consumers must look for the first character in the string with a strong Unicode directional property, and set the [=string direction=] to match it. They then take appropriate action to ensure that the string will be displayed as needed. This is not quite so simple as it may appear, for the following reasons:</p>
<ol>
<li>Characters at the start of a string without a strong direction (eg. punctuation, numbers, etc) and isolated sequences (ie. sequences of characters surrounded by RLI/LRI/FSI...PDI formatting characters) within a string must be skipped in order to find the first strong character.</li>
<li>The detection algorithm needs to be able to handle markup at the start of the string. It needs to be able to tell whether the markup is just string text, or whether the markup needs to be parsed in the target location – in which case it must understand the markup, and understand any direction-related information that is carried in the markup.</li>
</ol>
<p>First-strong detection is only needed where the required [=string direction=] is not already known. If direction is indicated for a string by metadata, either string-specific or via a resource-wide declaration, then first-strong heuristics should not be invoked. For example, first-strong heuristics would produce the wrong result for a string such as "<span lang="ar" dir="rtl">HTML و CSS: تصميم و إنشاء مواقع الويب</span>". This can be corrected using metadata, the use of which signifies informed intention, and you would not need or want to apply heuristics that would then make the result incorrect.</p>
<p>However, if there is no mechanism for the application of metadata, or if there is such a mechanism but the content developer omitted to use it, then first-strong heuristics can be helpful to establish <a>base direction</a> in many, though not all, cases. The application of strongly-directional formatting characters can help produce correct results for plain text strings such as the example just quoted, but it is not always possible to apply those (see [[[#rlm]]]).</p>
</section>
<section>
<h4>Advantages</h4>
<p>Where it is reliable, information about direction can be obtained without any changes to the string, and without the agreements and structures that would be needed to support out-of-band metadata.</p>
</section>
<section>
<h4>Issues</h4>
<p>The main problem with this approach is that it produces the wrong result for </p>
<ol>
<li>strings that begin with a strong character with a different directionality than that needed for the string overall (eg. an Arabic tweet that starts with a hashtag)</li>
<li>strings that don't have a strong directional character (such as a telephone number), which are likely to be displayed incorrectly in a RTL context.</li>
<li>strings that begin with markup, such as <code class="kw" translate="no">span</code>, since the first strong character is always going to be LTR.</li>
</ol>
<p>In cases where the entire string starts and ends with RLI/LRI/FSI...PDI formatting characters, it is not possible to detect the first strong character by following the Unicode Bidirectional Algorithm. This is because the algorithm requires that bidi-isolated text be excluded from the detection.</p>
<p>If no strong directional character is found in the string, the direction should probably be assumed to be LTR, and the consumer should act on that basis. This has not been tested fully, however.</p>
<p>If a string contains markup that will be parsed by the consumer as markup, there are additional problems. Any such markup at the start of the string must also be skipped when searching for the first strong directional character. </p>
<p>If <em>parseable</em> markup in the string contains information about the intended direction of the string (for example, a <span class="kw" translate="no"><code class="kw" translate="no">dir</code></span> attribute with the value <span class="kw" translate="no"><code class="kw" translate="no">rtl</code></span> in HTML), that information should be used rather than relying on first-strong heuristics. This is problematic in a couple of ways: (a) it assumes that the consumer of the string understands the semantics of the markup, which may be ok if there is an agreement between all parties to use, say, HTML markup only, but would be problematic, for example, when dealing with random XML vocabularies, and (b) the consumer must be able to recognise and handle a situation where only the initial part of the string has markup, ie. the markup applies to an inline span of text rather than the string as a whole.</p>
<p class=issue>It's not clear where the example with the broken link in the following paragraph is or used to be.</p>
<p>If, however, there is angle bracket content that is intended to be an <em>example</em> of markup, rather than actual markup, the markup must not be skipped – trying to display markup source code in a RTL context yields very confusing results! It isn't clear, however, how a consumer of the string would always know the difference between examples and parseable strings.</p>
</section>
<section>
<h4>Additional notes</h4>
<p>Although first-strong detection is outlined in the Unicode Bidirectional Algorithm (UBA) [[UAX9]], it is not the only possible higher-level protocol mentioned for estimating string direction. For example, X (formerly known as Twitter) and Facebook currently use different default heuristics for guessing the base direction of text — neither use just simple first-strong detection, and one uses a completely different method.</p>
</section>
</section>
<section id="metadata">
<h3> Metadata</h3>
<p><strong>This approach is recommended.</strong></p>
<p>By 'metadata' we mean field-based information associated with a specific string or a set of strings in a data format, or information built into a string datatype (see also [[[#dir-approach-new-datatype]]]).</p>
<p>An example would be:</p>
<pre id="example1Data2">
{
"title": "<span dir=rtl>HTML و CSS: تصميم و إنشاء مواقع الويب</span>",
"direction": "rtl",
"language": "ar",
},
</pre>
<p>Metadata indicating the default direction for all the strings in a resource could also be set using an appropriate field.</p>
<section>
<h4>How it works</h4>
<p>A producer ascertains the [=string direction=] of the string and adds that to a metadata field that accompanies the string when it is stored or transmitted.</p>
<p>There are several approaches to using metadata:</p>
<ol>
<li>Label every string with a <a>string direction</a>.</li>
<li>Provide a document-level default for <a>block direction</a> and only include metadata for strings whose value is different. The value <code>auto</code> is used when the direction of a string is not known.</li>
<li>Rely on the consumer to do <a href="#firststrong">first-strong detection</a>, and label only those strings which would produce the wrong result (that is, a right-to-left string that starts with left-to-right strong characters).</li>
</ol>
<p>If storing or transmitting a set of strings at a time, it helps to have a field for the resource as a whole that sets a global, default <a>string direction</a> which can be inherited by all strings in the resource. Note that in addition to a global field, you still need the possibility of attaching string-specific metadata fields in cases where a string's <a>string direction</a> is not the same as the default value. The [=string direction=] set on an individual string must always override the default.</p>
<p>Consumers would need to understand how to read the metadata sent with a string, and would need to apply first-strong heuristics in the absence of metadata.</p>
<p>The use of the <a href="#use-the-localizable-data-structure">Localizable</a> dictionary structure is RECOMMENDED for individual values in JSON-based document formats, as it combines both language and direction metadata and, if consistently adopted, makes interchange between different formats easier.</p>
<p class=note>As noted <a href="#localizable-dictionary">here</a>, [[JSON-LD]] includes some data structures that are helpful in assigning language (but not direction) metadata to collections of strings (including entire resources). These gaps in support for pre-built metadata at the resource or item level are one of the key reasons for this documents development.</p>
</section>
<section>
<h4>Advantages</h4>
<p>Passing metadata as separate data value from the string provides a simple, effective and efficient method of communicating the intended [=string direction=] without affecting the actual content of the string.</p>
<p>If every string is labelled for direction, or the direction for all strings can be ascertained by applying the global setting and any string-specific deviations, it avoids the need to inspect and run heuristics to determine each separate string's [=string direction=].</p>
</section>
<section>
<h4>Issues</h4>
<p>Out-of-band information needs to be associated with and kept with strings. This may be problematic for some sets of string data which are not part of a defined framework.</p>
<p>In particular, JSON-LD doesn't allow direction to be associated with individual strings in the same way as it works for language.</p>
</section>
</section>
<section id="rlm">
<h3>Augmenting <q>first-strong</q> by inserting RLM/LRM markers</h3>
<p><strong>This approach is NOT workable for all situations.</strong></p>
<section>
<h4>How it works</h4>
<p>A producer ascertains the [=string direction=] of the string and adds an marker character (either <span class="unicode">U+200F RIGHT-TO-LEFT MARK</span> (RLM) or <span class="unicode">U+200E LEFT-TO-RIGHT MARK</span> (LRM)) to the beginning of the string. The marker is not functional, ie. it will not automatically apply a base direction to the string that can be used by the consumer, it is simply a marker.</p>