Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Outdated encoding for Malayalam sample #3

Open
asmusf opened this issue May 20, 2018 · 3 comments
Open

Outdated encoding for Malayalam sample #3

asmusf opened this issue May 20, 2018 · 3 comments

Comments

@asmusf
Copy link

asmusf commented May 20, 2018

https://github.com/unicode-org/unilex/blob/master/data/frequency/ml.txt

This file is encoded with the Unicode 5.0 and earlier encoding for Chillu characters. (See Chapter 12 of Unicode 10.0).

@behnam
Copy link
Member

behnam commented May 21, 2018

IIUC, a normalization table can be used, as show in Table 12-37. Atomic Encoding of Malayalam Chillus https://www.unicode.org/versions/Unicode10.0.0/ch12.pdf#page=65.

@asmusf
Copy link
Author

asmusf commented May 21, 2018

I think the file should be replaced with one that uses the atomic encoding. Another issue is the use of U+0D4C which I understand is considered outdated. (Other corpora I've encountered recently do not have the latter issue).

@brawer
Copy link
Collaborator

brawer commented Aug 31, 2018

If you don’t mind, could you send a pull request to fix the problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants