All files in data are tab-separated plaintext files in
UTF-8 encoding. The file names are Unicode locale
identifiers
in IETF BCP47 syntax; for example, vec-u-sd-itpd.txt
contains
data for the Venetian language as used in the Padua subdivision of Italy.
If there are multiple data files for the same locale in the same folder,
they are distinguished by a private-use subtag such as en-CA-x-foobar
.
If you’d like to contribute data, please tell us by filing a GitHub issue. You’re also welcome to simply send pull requests via GitHub.
For more background, see the main Unilex description.
In frequency, we collect data how often each word form appears in a language corpus.
File format: The columns are identified by their headers in the TSV file. Additional columns may be added over time.
-
Form
is the surface form being counted. -
Frequency
tells how often this form appears per billion tokens. Our corpora are actually much smaller, but their size varies a lot depending on the language; so we scale the numbers to a hypothetical total of one billion tokens per language.
Source: Currently, we use Google’s Corpus Crawler project to build language corpora. We’ve computed our word frequencies on these crawled corpora, but we’re open to accepting other contributions.
Noise: We are just getting started, so there will be some noise in the word frequency data. For example, there may be odd words resulting from quoting a foreign language, or words representing model numbers (“A3”). So people should use it with that in mind. That being said, it should be usable enough to advance the quality for languages that otherwise have little available data.
Segmentation: To segment the crawled data, we typically used the word-break algorithm of the ICU library. For most languages, this corresponds to what people think of as “space-delimited” words. For languages that don’t typically use spaces, an extended algorithm was used.
N-Grams: Currently, we have no n-gram data available for our datasets. However, you can run Corpus Crawler to crawl plaintext corpora yourself, and extract n-grams from the crawled content.
Adding more data: To add more data to a language requires modifying the code of Corpus Crawler for that language. The changes are to fetch additional URLs, and to extract text from the crawled content. For more information, see Corpus Crawler.
In pronunciation, we collect phonemic transcriptions of every word form to the International Phonetic Alphabet.
File format: The columns are identified by their headers in the TSV file. Additional columns may be added over time.
-
Form
is the surface form to be pronounced. There may be multiple rows for the same form in case it varies by part of speech or grammatical features. -
Pronunciation
is a phonemic transcription in IPA. -
PartOfSpeech
andFeatures
are optional fields, used to distinguish cases where the same form has multiple pronunciations. Currently, this is only used for Bangla pronunciations but we anticipate that other languages will need the same. As identifiers, we use the part of speech tags and lexical features from the Universal Dependencies Project. When there’s no information available, we use*
.
Adding more data: We’re soliciting contributions. Please file an issue to improve the current data, or to add additional data sets. You’re also welcome to simply send pull requests via GitHub.
In hyphenation, we collect hyhenated words.
File format: The columns are identified by their headers in the TSV file. Additional columns may be added over time.
-
Form
is the surface form to be hyphenated. -
Hyphenation
is a marked-up version of the form where hyphenation points are indicated by circled digits with priorities. For example, an entryuit➊spra➋ken
means thatuitspraken
can be hyphenated in two places, but that it’s better to writeuit-spraken
thanuitspra-ken
. There may be ties such asaan➊de➊len
.
At some point, we’ll need to model hyphenations that modify
the letter sequence, but our current data doesn't yet need
the additional structure. For example,
in German traditional orthography (de-1901), the word Beckenbruch
is hyphenated as Bek-ken-bruch
.
One way we‘ve considered expressing this is as Be⟨ck|k➋k⟩en➊bruch
.
Adding more data: We’re soliciting contributions. Again, please file an issue to improve the current data, or to add additional data sets. You’re also welcome to simply send pull requests via GitHub.
In the long term, it would be good to model morphological and grammatical features. Currently, however, we’re not sure how to do this. The current data in stems is highly experimental and particularly likely to change. Before settling on a format, we should model a set of languages with more challenging morphologies.