CLDR-15391 Update Canadian census and Fix Cans Script Match #4208

conradarcturus · 2024-11-20T04:36:21Z

The original purpose of this change is to update the default language for Canadian Aboriginal syllabics [Cans] from Inuktitut [iu] to Cree [cr] since Cree has a larger population. Understandably, both of these languages are macrolanguages with many variations -- so its funny to include both the Cree macro-language along with its constituents. But I think it's better to cover both groupings because depending on the consumer they may want [cr] data or constituent data.

While I was doing this I updated all of the Canadian locale data to the 2021 Census. I also added a few missing aboriginal Canadian languages: Woods Cree [cwd] and Western Ojibway [ojw].

See the 2021 Census table here: https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=9810021601

CLDR-15391

This PR completes the ticket.

ALLOW_MANY_COMMITS=true

macchiati · 2024-11-20T22:14:32Z

Understandably, both of these languages are macrolanguages with many variations -- so its funny to include both the Cree macro-language along with its constituents.

Important: CLDR treats macrolanguage codes as regular languages — identifying them with their most common encompassed language. For example, zh is interpreted as identical to 'cmn' (and preferred — cmn aliases to zh). So it should be expected that cr (=cwd) appears along with other languages that ISO considers encompassed by cr (eg crj).

We test in other cases, and should test here, that we don't have any aliased language codes.

macchiati · 2024-11-20T22:16:19Z

tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/country_language_population.tsv

+Canada	CA	"36,328,480"	99%	"1,774,000,000,000"		Mi'kmaq	mic	0.0254%			https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=9810021601 Canada 2021 Census language "Knowledge of Language"; official status from Wikipedia Languages_of_Canada
+Canada	CA	"36,328,480"	99%	"1,774,000,000,000"	recognized	Atikamekw	atj	0.0187%			https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=9810021601 Canada 2021 Census language "Knowledge of Language"; official status from Wikipedia Languages_of_Canada
+Canada	CA	"36,328,480"	99%	"1,774,000,000,000"		Siksika	bla	0.0183%			https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=9810021601 Canada 2021 Census language "Knowledge of Language"; official status from Wikipedia Languages_of_Canada
+Canada	CA	"36,328,480"	99%	"1,774,000,000,000"		Woods Cree	cwd	0.0140%			https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=9810021601 Canada 2021 Census language "Knowledge of Language"; official status from Wikipedia Languages_of_Canada


The unit tests should prevent aliased language codes (like cwd). We need to fix that, but in the meantime you should use cr instead of cwd.

Indeed, this change needs work.

I'd like to understand the macrolanguage better. As I've learned, Woods Cree cwd is being aliased to Cree cr. However, Woods Cree (5,110 speakers according to CA 2021 Census) isn't even the biggest Cree dialect -- Plains Creek crk is (12,005 speakers) -- however the vast majority of Cree speakers aren't grouped (there are 87,875 Cree speakers in total. 61,000 are not matched to a constituent dialect). So perhaps Woods Cree is actually the biggest.

So back to aliasing -- personally I'm not a big fan of that because they are grouping oranges with apples -- yes they are all fruits but they are different groups. Do you have any design docs about aliasing so I can understand the background?

For another example of aliasing losing data precision, when I worked on CLDR-10478, the Macao census has different estimates for Traditional Chinese (written, any spoken dialect zh_Hant, at 98%), Simplified Chinese (written, any spoken dialect zh_Hans, at 5%) as well as Spoken Cantonese yue @ 86.2% and Spoken Mandarin cmn @ ~40%. I ended up just ignoring the Mandarin category because was being aliased back to zh.

Nonetheless, I have faith this has been thought-through a lot already, so maybe I just need to catch up to the design choices.

common/supplemental/supplementalData.xml

conradarcturus · 2024-11-20T22:23:57Z

@srl295 ah this was a larger can of worms than I anticipated.

So it looks like currently cr defaults to Woods Cree/cwd. But, really, cr is the Macrolanguage for all Cree variations https://iso639-3.sil.org/code/cwd

The original reported bug is that "und_Cans" is matching to "iu_Cans" not "cr_Cans" even though the Cree community is bigger than the Inuktitut community. However I need to figure out the macrolanguage matching to see hwo to move forward.

Happy to punt this and get back to this once we are all back in late November.

macchiati · 2024-11-20T23:55:56Z

The handling of the 'macrolanguage' concept was introduced in BCP47 for backwards compatibility, but causes its own — more severe — compatibility problems. If 'zh' truly means 'any Chinese language', then it would be perfectly fine for an implementation to request 'zh' and for us to serve up 'yue' content in LDML. So we have a longstanding policy for macro/encompassed languages that I outlined. See also https://cldr.unicode.org/index/cldr-spec/picking-the-right-language-code.

There are times where we will adjust the aliasing, where there is a strong shift one way or another. But for small changes we favor stability. The key is, just treat 'cr' exactly as if it were 'cwd', and don't use 'cwd'.

conradarcturus · 2024-11-21T23:05:38Z

Ooof that makes sense and I don't want this ticket to generate any more work than it already has so I am inclined to punt on the macrolanguage conversation. To return to the narrow origin of the ticket of 1) making und_Cans match to cr_Cans instead of iu_Cans. I'll still keep the official language + population updates though.

jira-pull-request-webhook · 2024-11-21T23:14:28Z

Notice: the branch changed across the force-push!

common/supplemental/likelySubtags.xml is different
common/supplemental/supplementalData.xml is different
common/supplemental/supplementalMetadata.xml is no longer changed in the branch
common/testData/localeIdentifiers/likelySubtags.txt is different
common/testData/localeIdentifiers/localeCanonicalization.txt is no longer changed in the branch
tools/cldr-code/src/main/java/org/unicode/cldr/tool/GenerateLikelySubtags.java is now changed in the branch
tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/country_language_population.tsv is different
tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/language_script.tsv is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

conradarcturus · 2024-11-21T23:23:27Z

common/supplemental/supplementalData.xml

 		<language type="crg" scripts="Latn"/>
 		<language type="crh" scripts="Cyrl"/>
 		<language type="crj" scripts="Cans"/>
 		<language type="crj" scripts="Latn" alt="secondary"/>
 		<language type="crk" scripts="Cans"/>
+		<language type="crk" territories="CA" alt="secondary"/>


This change happened because Woods Cree is not an official language of a region of Canada.

Rather, Plains Cree is. Cree is only official in the Northern Territories (NT). Unfortunately, the NT law does not specify which Cree variation. We can deduce the variation because the only Cree language present in NT is Plains Cree [crk] so I infer that is the correct match.

Sounds good

The main purpose of this change is to update the default language for Canadian Aborginal syllabics [Cans] from Inukitut [iu] to Cree [cr] since Cree has a larger population. Understandably, both of these languages are macrolanguages with many variations -- so its funny to include both the Cree macrolanguage along with its constituents. But I think it's better to cover both groupings because depending on the consumer they may want [cr] data or constituent data. While I was doing this I updated all of the Canadian locale data to the 2021 Census. I also added a few missing aborginal Canadian languages: Woods Cree [cwd] and Western Ojibway [ojw]. See the 2021 Census table here: https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=9810021601

I'll note that Cree is only official in the Northern Territories (NT). However [the NT law](https://web.archive.org/web/20090324202430/http://www.justice.gov.nt.ca/PDF/ACTS/Official_Languages.pdf) does not specify which Cree variation -- the only Cree language present in NT is Plains Cree [crk] so I infer that is the correct match.

Removed hardcoded _CA entries because they aren't necessary since the likely subtags can be derived from the population data. Also edited the cr_Cans_CA comment because it was overflowing the line and to add more context.

jira-pull-request-webhook · 2024-11-24T17:25:31Z

Notice: the branch changed across the force-push!

tools/cldr-code/src/main/java/org/unicode/cldr/tool/GenerateLikelySubtags.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

conradarcturus requested review from macchiati, pedberg-icu and btangmu November 20, 2024 04:36

github-actions bot assigned conradarcturus Nov 20, 2024

conradarcturus requested a review from srl295 November 20, 2024 04:36

conradarcturus marked this pull request as draft November 20, 2024 21:03

macchiati requested changes Nov 20, 2024

View reviewed changes

conradarcturus force-pushed the CLDR-15391-Fix-Cans-Default branch from 71db1c0 to 1f00d07 Compare November 21, 2024 23:14

conradarcturus commented Nov 21, 2024

View reviewed changes

conradarcturus requested a review from macchiati November 21, 2024 23:23

conradarcturus added 4 commits November 24, 2024 09:16

CLDR-15391 Undo Cree canonicalization of cwd -> cr

743d13a

CLDR-15931 GenerateLikelySubtags cleanup

d00b78d

Removed hardcoded _CA entries because they aren't necessary since the likely subtags can be derived from the population data. Also edited the cr_Cans_CA comment because it was overflowing the line and to add more context.

conradarcturus force-pushed the CLDR-15391-Fix-Cans-Default branch from 1f00d07 to d00b78d Compare November 24, 2024 17:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLDR-15391 Update Canadian census and Fix Cans Script Match #4208

CLDR-15391 Update Canadian census and Fix Cans Script Match #4208

conradarcturus commented Nov 20, 2024

macchiati commented Nov 20, 2024

macchiati Nov 20, 2024

conradarcturus Nov 20, 2024

conradarcturus commented Nov 20, 2024

macchiati commented Nov 20, 2024

conradarcturus commented Nov 21, 2024

jira-pull-request-webhook bot commented Nov 21, 2024

conradarcturus Nov 21, 2024

macchiati Nov 24, 2024

jira-pull-request-webhook bot commented Nov 24, 2024

CLDR-15391 Update Canadian census and Fix Cans Script Match #4208

Are you sure you want to change the base?

CLDR-15391 Update Canadian census and Fix Cans Script Match #4208

Conversation

conradarcturus commented Nov 20, 2024

macchiati commented Nov 20, 2024

macchiati Nov 20, 2024

Choose a reason for hiding this comment

conradarcturus Nov 20, 2024

Choose a reason for hiding this comment

conradarcturus commented Nov 20, 2024

macchiati commented Nov 20, 2024

conradarcturus commented Nov 21, 2024

jira-pull-request-webhook bot commented Nov 21, 2024

conradarcturus Nov 21, 2024

Choose a reason for hiding this comment

macchiati Nov 24, 2024

Choose a reason for hiding this comment

jira-pull-request-webhook bot commented Nov 24, 2024