Skip to content

Latest commit

 

History

History
134 lines (100 loc) · 6.12 KB

PROPERTIES.md

File metadata and controls

134 lines (100 loc) · 6.12 KB

Unilex Properties

All files in data are tab-separated plaintext files in UTF-8 encoding. The file names are Unicode locale identifiers in IETF BCP47 syntax; for example, vec-u-sd-itpd.txt contains data for the Venetian language as used in the Padua subdivision of Italy. If there are multiple data files for the same locale in the same folder, they are distinguished by a private-use subtag such as en-CA-x-foobar.

If you’d like to contribute data, please tell us by filing a GitHub issue. You’re also welcome to simply send pull requests via GitHub.

For more background, see the main Unilex description.

Word frequency

In frequency, we collect data how often each word form appears in a language corpus.

File format: The columns are identified by their headers in the TSV file. Additional columns may be added over time.

  • Form is the surface form being counted.

  • Frequency tells how often this form appears per billion tokens. Our corpora are actually much smaller, but their size varies a lot depending on the language; so we scale the numbers to a hypothetical total of one billion tokens per language.

Source: Currently, we use Google’s Corpus Crawler project to build language corpora. We’ve computed our word frequencies on these crawled corpora, but we’re open to accepting other contributions.

Noise: We are just getting started, so there will be some noise in the word frequency data. For example, there may be odd words resulting from quoting a foreign language, or words representing model numbers (“A3”). So people should use it with that in mind. That being said, it should be usable enough to advance the quality for languages that otherwise have little available data.

Segmentation: To segment the crawled data, we typically used the word-break algorithm of the ICU library. For most languages, this corresponds to what people think of as “space-delimited” words. For languages that don’t typically use spaces, an extended algorithm was used.

N-Grams: Currently, we have no n-gram data available for our datasets. However, you can run Corpus Crawler to crawl plaintext corpora yourself, and extract n-grams from the crawled content.

Adding more data: To add more data to a language requires modifying the code of Corpus Crawler for that language. The changes are to fetch additional URLs, and to extract text from the crawled content. For more information, see Corpus Crawler.

Pronunciation

In pronunciation, we collect phonemic transcriptions of every word form to the International Phonetic Alphabet.

File format: The columns are identified by their headers in the TSV file. Additional columns may be added over time.

  • Form is the surface form to be pronounced. There may be multiple rows for the same form in case it varies by part of speech or grammatical features.

  • Pronunciation is a phonemic transcription in IPA.

  • PartOfSpeech and Features are optional fields, used to distinguish cases where the same form has multiple pronunciations. Currently, this is only used for Bangla pronunciations but we anticipate that other languages will need the same. As identifiers, we use the part of speech tags and lexical features from the Universal Dependencies Project. When there’s no information available, we use *.

Adding more data: We’re soliciting contributions. Please file an issue to improve the current data, or to add additional data sets. You’re also welcome to simply send pull requests via GitHub.

Hyphenation

In hyphenation, we collect hyhenated words.

File format: The columns are identified by their headers in the TSV file. Additional columns may be added over time.

  • Form is the surface form to be hyphenated.

  • Hyphenation is a marked-up version of the form where hyphenation points are indicated by circled digits with priorities. For example, an entry uit➊spra➋ken means that uitspraken can be hyphenated in two places, but that it’s better to write uit-spraken than uitspra-ken. There may be ties such as aan➊de➊len.

At some point, we’ll need to model hyphenations that modify the letter sequence, but our current data doesn't yet need the additional structure. For example, in German traditional orthography (de-1901), the word Beckenbruch is hyphenated as Bek-ken-bruch. One way we‘ve considered expressing this is as Be⟨ck|k➋k⟩en➊bruch.

Adding more data: We’re soliciting contributions. Again, please file an issue to improve the current data, or to add additional data sets. You’re also welcome to simply send pull requests via GitHub.

Experimental

Morphology and Grammar

In the long term, it would be good to model morphological and grammatical features. Currently, however, we’re not sure how to do this. The current data in stems is highly experimental and particularly likely to change. Before settling on a format, we should model a set of languages with more challenging morphologies.