Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train a language identifier model that works well on ingredient lists #349

Open
raphael0202 opened this issue Sep 19, 2024 · 26 comments
Open

Comments

@raphael0202
Copy link
Contributor

Problem

We're currently using fasttext for language identification.
This is useful especially to detect the language of an ingredient list extracted automatically using a ML model, or added by a contributor.

However, fasttext was trained on data that is quite different from ingredient lists (Wikipedia, Tatoeba and SETimes).

Sometimes the model fails for obvious cases, such as this one (french ingredient list):

text: fraise (12%), framboise (10%)

predictions:
en, confidence=0.4291181
it, confidence=0.13040087
fr, confidence=0.0435654
ro, confidence=0.026255628
no, confidence=0.019594753
de, confidence=0.017750196
es, confidence=0.01671417
tr, confidence=0.015862297
sco, confidence=0.01577331
ms, confidence=0.015433003

This behaviour is mostly present for short ingredient lists.

We should explore training a new model for language identification using Open Food Facts data (especially ingredient lists).

Requirements

Using fasttext is not a requirement. We can either train a new fasttext model, or train it with pytorch/tensorflow and export it to ONNX format.

@korablique
Copy link

What I'm going to try (following the discussion on Slack):

  • select products from OFF with >80% of recognized ingredients
  • measure existing fasttext model quality on this data
  • fine tune the fasttext model
  • compare results

@raphael0202 If that's OK with you, could you please assign the issue to me?

@raphael0202
Copy link
Contributor Author

Yes that's a good plan to start 👍
I've assigned you on this issue

@korablique
Copy link

korablique commented Oct 10, 2024

Here is the number of texts for each language
en    422020
fr    299681
de     89880
es     46255
it     31801
nl     19983
pl      8401
pt      8119
sv      6128
bg      4453
ro      3771
fi      3726
ru      3610
nb      3591
cs      3500
th      3157
da      2021
hr      2015
hu      1962
ar      1104
el       943
ja       912
ca       824
sr       735
sl       727
sk       606
tr       506
lt       453
zh       436
et       370
lv       333
xx       318
no       315
uk       274
id       262
he       209
vi       121
is       113
la        89
in        72
ko        71
sq        70
iw        59
ka        54
ms        52
bs        37
fa        35
bn        33
gl        32
kk        25
mk        23
nn        18
hi        18
aa        17
uz        17
so        15
af        12
eu        11
az         8
be         7
cy         7
hy         7
tt         6
ku         5
km         4
te         4
ky         4
ur         4
mg         3
ty         3
ta         3
tg         3
my         3
tl         3
mo         2
sc         2
ir         2
ne         2
tk         2
am         2
mn         2
co         2
se         2
si         2
fj         1
ch         1
ug         1
yi         1
to         1
fo         1
mt         1
ht         1
ak         1
jp         1
oc         1
lb         1
mi         1
as         1
yo         1
ga         1
gd         1
ba         1
zu         1
mr         1

@baslia
Copy link
Collaborator

baslia commented Oct 11, 2024

Hey, is that a requirements that the models needs to run locally (not from a server) ?
It seems that LLMs are good with detecting languages.

@korablique
Copy link

korablique commented Oct 14, 2024

Tried to use clustering to fix mislabeled data.

I took the languages for which there are at least 100 texts (37 languages). Then took 100 texts for each language and used them as a training dataset (wanted then to get predictions for the entire dataset).

The texts were converted to embeddings using fasttext (get_sentence_vector method), the dimension was reduced from 256 to 66 to preserve 95% variance using PCA.
Tried 2 methods: gaussian mixture and HDBSCAN.
Gaussian mixture divides the data into only 3 clusters. And HDBSCAN classifies all new data as noise. The picture below shows the result of HDBSCAN clustering of training data. The clusters are difficult to separate.

Either clustering is not suitable for this task, or I am doing something wrong.

Now I will try to use another text classification model: lingua https://github.com/pemistahl/lingua-py to compare the predictions and confidence of two models. Then I'll take data in which the predictions of the models coincide and they are both confident and fine-tune one of them on this data.

Image

@teolemon teolemon removed the ✨ enhancement New feature or request label Oct 19, 2024
@jeremyarancio
Copy link
Collaborator

jeremyarancio commented Oct 22, 2024

Here's a really nice article summarizing different approaches for language detection, from statistical to deep learning
https://medium.com/besedo-engineering/language-identification-for-very-short-texts-a-review-c9f2756773ad

Would be great to have a validation dataset to estimate the performance of any solution.
This dataset can be manually annotated using https://languagetool.org/

@korablique
Copy link

How I got the distribution of texts languages:

  1. selected ingredients_text_{LANG} field names from mongo DB:
docker exec -i mongodb-container mongo mydatabase --quiet --eval '
var cursor = db.mycollection.aggregate([
  { "$project": {
      "fields": { "$objectToArray": "$$ROOT" }
  }},
  { "$unwind": "$fields" },
  { "$match": { "fields.k": /^ingredients_text_/ }},
  { "$group": {
      "_id": null,
      "all_fields": { "$addToSet": "$fields.k" }
  }},
  { "$limit": 20 }
]);
if (cursor.hasNext()) {
  printjson(cursor.next().all_fields);
}' > field_names.json
  1. then selected field values:
FIELDS=$(jq -r '.[]' field_names.json | paste -sd "," -)
docker exec -i mongodb-container mongo mydatabase --quiet --eval '
var fields = "'$FIELDS'".split(",");
var projection = {};
fields.forEach(function(field) { projection[field] = 1; });

var cursor = db.mycollection.find({}, projection).forEach(function(doc) {
    var cleanedDoc = {};
    fields.forEach(function(field) {
        if (doc[field] && doc[field] !== "") {
            cleanedDoc[field] = doc[field];
        }
    });
    if (Object.keys(cleanedDoc).length > 0) {
        printjson(cleanedDoc);
    }
});' > filtered_extracted_values.json

(but after that there are still some extra fields left, e.x. ingredients_text_with_allergens)

  1. then I made a dictionary in which text is the key, language is the value:
ingredients_text_lang_dct = dict()

with open(os.path.join(data_dir, 'filtered_extracted_values.json'), 'r') as data_file:
    for dct in ijson.items(data_file, 'item'):
        for k, v in dct.items():
            if k == 'ingredients_text_with_allergens':
                continue
            lang = k[k.rfind('_') + 1:]

            # if the field is `ingredients_text_{LANG}_imported`
            if lang == 'imported':
                start = k[:k.rfind('_')].rfind('_') + 1
                end = k.rfind('_')
                lang = k[start:end]
            ingredients_text_lang_dct.update({v: lang})

@raphael0202

@korablique
Copy link

Would be great to have a validation dataset to estimate the performance of any solution. This dataset can be manually annotated using https://languagetool.org/

How many samples should it contain? Should I select an equal number of samples for each language or just random?
@jeremyarancio

@jeremyarancio
Copy link
Collaborator

Roughly 30 labels per language to start with I would say.
It's just to have an idea about the performances

Would be great to have a validation dataset to estimate the performance of any solution. This dataset can be manually annotated using https://languagetool.org/

How many samples should it contain? Should I select an equal number of samples for each language or just random?
@jeremyarancio

@baslia
Copy link
Collaborator

baslia commented Oct 26, 2024

Here is the number of texts for each language

en    422020
fr    299681
de     89880
es     46255
it     31801
nl     19983
pl      8401
pt      8119
sv      6128
bg      4453
ro      3771
fi      3726
ru      3610
nb      3591
cs      3500
th      3157
da      2021
hr      2015
hu      1962
ar      1104
el       943
ja       912
ca       824
sr       735
sl       727
sk       606
tr       506
lt       453
zh       436
et       370
lv       333
xx       318
no       315
uk       274
id       262
he       209
vi       121
is       113
la        89
in        72
ko        71
sq        70
iw        59
ka        54
ms        52
bs        37
fa        35
bn        33
gl        32
kk        25
mk        23
nn        18
hi        18
aa        17
uz        17
so        15
af        12
eu        11
az         8
be         7
cy         7
hy         7
tt         6
ku         5
km         4
te         4
ky         4
ur         4
mg         3
ty         3
ta         3
tg         3
my         3
tl         3
mo         2
sc         2
ir         2
ne         2
tk         2
am         2
mn         2
co         2
se         2
si         2
fj         1
ch         1
ug         1
yi         1
to         1
fo         1
mt         1
ht         1
ak         1
jp         1
oc         1
lb         1
mi         1
as         1
yo         1
ga         1
gd         1
ba         1
zu         1
mr         1

Would it be possible to share the link to this original data set ? I am curious to have a look to it as well.
Thanks!

@korablique
Copy link

Would it be possible to share the link to this original data set ? I am curious to have a look to it as well.
Thanks!

I used the MongoDB dump. I described above how I retrieved the data from it. However, there might be an error in my script because some languages have fewer texts than expected (e.g. I got 912 samples of Japanese texts, but on https://jp-en.openfoodfacts.org/ there are around 16,000).

Please keep me posted if you’re planning to work on this task, as I’m actively working on it. You can find me on OFF slack (Yulia Zhilyaeva)

@jeremyarancio
Copy link
Collaborator

If this can help, there's now a Parquet dump in Hugging Face, which is the JSONL processed and cleaned from irrelevant features:

https://huggingface.co/datasets/openfoodfacts/product-database

@korablique
Copy link

korablique commented Oct 31, 2024

If this can help, there's now a Parquet dump in Hugging Face, which is the JSONL processed and cleaned from irrelevant features:

https://huggingface.co/datasets/openfoodfacts/product-database

Tried to retrieve data from huggingface dataset, but still get ~900 samples of Japanese texts, and ~996,000 texts in total. Am I doing something wrong? Or is it because at the moment the hf dataset stores text only in the original language?

My code here

other_lang_columns = [
    'ingredients_text_fr',
    'ingredients_text_en',
    ...
] 
dataset_file = os.path.join(data_dir, 'data_from_hf.csv')

for start, stop in tqdm(zip(range(0, 91, 10), range(10, 101, 10))):  
    # read 10% of the dataset
    hf_dataset = load_dataset('openfoodfacts/product-database', split=f'main[{start}%:{stop}%]')
    
    # retrieve ingredients_text and lang
    ingredients_texts = hf_dataset['ingredients_text']
    langs = hf_dataset['lang']
    
    df = pd.DataFrame({'ingredients_text': ingredients_texts, 'lang': langs})
    df.dropna(inplace=True)
    
    # retrieve ingredients_text_{LANG}
    for other_lang_col in other_lang_columns:
        lang = other_lang_col[-2:]
        other_lang_texts = hf_dataset[other_lang_col]
        other_lang_texts = [text for text in other_lang_texts if text is not None and len(text) > 0]
        
        new_rows = pd.DataFrame({'ingredients_text': other_lang_texts, 'lang': [lang] * len(other_lang_texts)})
        df = pd.concat((df, new_rows), ignore_index=True)
        
    # save
    df.to_csv(dataset_file, mode='a', header=start == 0, index=False)

@jeremyarancio

@jeremyarancio
Copy link
Collaborator

If this can help, there's now a Parquet dump in Hugging Face, which is the JSONL processed and cleaned from irrelevant features:
https://huggingface.co/datasets/openfoodfacts/product-database

Tried to retrieve data from huggingface dataset, but still get ~900 samples of Japanese texts, and ~996,000 texts in total. Am I doing something wrong? Or is it because at the moment the hf dataset stores text only in the original language?

My code here

other_lang_columns = [
    'ingredients_text_fr',
    'ingredients_text_en',
    ...
] 
dataset_file = os.path.join(data_dir, 'data_from_hf.csv')

for start, stop in tqdm(zip(range(0, 91, 10), range(10, 101, 10))):  
    # read 10% of the dataset
    hf_dataset = load_dataset('openfoodfacts/product-database', split=f'main[{start}%:{stop}%]')
    
    # retrieve ingredients_text and lang
    ingredients_texts = hf_dataset['ingredients_text']
    langs = hf_dataset['lang']
    
    df = pd.DataFrame({'ingredients_text': ingredients_texts, 'lang': langs})
    df.dropna(inplace=True)
    
    # retrieve ingredients_text_{LANG}
    for other_lang_col in other_lang_columns:
        lang = other_lang_col[-2:]
        other_lang_texts = hf_dataset[other_lang_col]
        other_lang_texts = [text for text in other_lang_texts if text is not None and len(text) > 0]
        
        new_rows = pd.DataFrame({'ingredients_text': other_lang_texts, 'lang': [lang] * len(other_lang_texts)})
        df = pd.concat((df, new_rows), ignore_index=True)
        
    # save
    df.to_csv(dataset_file, mode='a', header=start == 0, index=False)

@jeremyarancio

The Parquet contains the same information as the JSONL file, so it's not surprising.
You also have the text in all languages as ingredients_text and ingredients_text_{lang}

@korablique
Copy link

korablique commented Oct 31, 2024

The Parquet contains the same information as the JSONL file, so it's not surprising.
You also have the text in all languages as ingredients_text and ingredients_text_{lang}

I see. I mean I don't understand why there https://jp-en.openfoodfacts.org/ are 16,000 products, and I have only 900
@jeremyarancio

@korablique
Copy link

Oh, seems that just not all of them have ingredients list in Japanese

@korablique
Copy link

I created a validation dataset from texts from OFF off_validation_dataset.csv (42 languages, 15-30 texts per language) and validated FastText and lingua models.

I took 30 random texts in each language, obtained language predictions using the Deepl API and two other models (this and this). For languages they don’t support, I used Google Translate and ChatGPT for verification. (As a result, after correcting the labels, some languages have fewer than 30 texts).

Accuracy of the models:
fasttext: 92.94%
lingua: 93.79%
(I used only these models, because according to some articles (this and this) comparing language identification models these's almost nothing better than fasttext)

Should I compare their accuracy on only short texts, or should I try to retrain fasttext?
@raphael0202 @jeremyarancio

@raphael0202
Copy link
Contributor Author

Hello @korablique, thank you for the analysis!

So if I understood correctly, the lang field was obtained by querying Deepl and two other models, or checking manually?

And can you provide the metrics for each language?

For reference, using duckdb, I computed the number of items for each language:
┌─────────┬───────┐
│ lang │ count │
│ varchar │ int64 │
├─────────┼───────┤
│ fi │ 30 │
│ nl │ 30 │
│ pl │ 30 │
│ hr │ 30 │
│ pt │ 30 │
│ es │ 30 │
│ en │ 30 │
│ de │ 30 │
│ fr │ 30 │
│ it │ 30 │
│ cs │ 30 │
│ sv │ 29 │
│ da │ 29 │
│ he │ 29 │
│ nb │ 29 │
│ sl │ 28 │
│ et │ 28 │
│ lv │ 28 │
│ bg │ 28 │
│ ja │ 28 │
│ tr │ 27 │
│ hu │ 27 │
│ ru │ 26 │
│ vi │ 26 │
│ zh │ 25 │
│ is │ 25 │
│ th │ 24 │
│ no │ 24 │
│ ro │ 24 │
│ sr │ 24 │
│ uk │ 23 │
│ ko │ 22 │
│ ar │ 22 │
│ sk │ 22 │
│ lt │ 21 │
│ ka │ 17 │
│ el │ 17 │
│ bn │ 17 │
│ ca │ 17 │
│ bs │ 16 │
│ sq │ 15 │
│ id │ 15 │
├─────────┴───────┤
│ 42 rows │
└─────────────────┘

@raphael0202
Copy link
Contributor Author

raphael0202 commented Nov 6, 2024

I've just added to the Python SDK a new method to analyze the ingredients in a given language:
https://openfoodfacts.github.io/openfoodfacts-python/usage/#perform-ingredient-analysis

Using the is_in_taxonomy field for each detected ingredient, you can compute easily the number of ingredients recognized or not, and spot ingredient lists that are not in the right language. It can help you detect errors in your validation set or increase its size.

edit: you need the latest version of the SDK for it to work, openfoodfacts==2.1.0

@jeremyarancio
Copy link
Collaborator

Gj @korablique
Since the distribution is not uniform, it would be preferable to compute the Precision & Recall for each lang to have a better understanding of which languages the models struggle with.
Also, based on the initial issue description, it seems the language prediction is often wrong when the text is quite short. Having Precision and Recall depending on the text length (<10 words, 10 - 20 words, >20 words, for example) could be insightful.

@korablique
Copy link

So if I understood correctly, the lang field was obtained by querying Deepl and two other models, or checking manually?

Yes

And can you provide the metrics for each language?

<style type="text/css"></style>

lang count fasttext_precision fasttext_recall lingua_precision lingua_recall
no 53 0.980392 0.943396 0.961538 0.943396
en 30 0.933333 0.933333 0.947368 0.600000
nl 30 0.937500 1.000000 0.937500 1.000000
pl 30 1.000000 1.000000 0.967742 1.000000
it 30 0.966667 0.966667 1.000000 0.966667
pt 30 1.000000 0.900000 1.000000 0.900000
hr 30 0.689655 0.666667 0.531915 0.833333
fr 30 1.000000 0.933333 1.000000 0.966667
es 30 0.931034 0.900000 1.000000 0.900000
fi 30 0.931034 0.900000 1.000000 0.933333
de 30 1.000000 0.966667 0.937500 1.000000
cs 30 0.964286 0.900000 0.937500 1.000000
he 29 1.000000 1.000000 1.000000 1.000000
da 29 0.965517 0.965517 0.928571 0.896552
sv 29 0.966667 1.000000 0.966667 1.000000
sl 28 0.931034 0.964286 0.928571 0.928571
et 28 0.965517 1.000000 0.965517 1.000000
lv 28 1.000000 0.928571 1.000000 0.892857
bg 28 1.000000 0.892857 1.000000 0.964286
ja 28 0.833333 0.357143 1.000000 0.892857
hu 27 0.928571 0.962963 1.000000 0.962963
tr 27 0.962963 0.962963 1.000000 0.925926
ru 26 1.000000 0.961538 0.962963 1.000000
vi 26 0.928571 1.000000 1.000000 0.923077
zh 25 0.517241 0.600000 0.892857 1.000000
is 25 1.000000 1.000000 1.000000 1.000000
ro 24 1.000000 0.916667 1.000000 0.833333
sr 24 0.000000 0.000000 0.000000 0.000000
th 24 1.000000 1.000000 1.000000 1.000000
uk 23 1.000000 1.000000 1.000000 0.913043
ko 22 1.000000 0.909091 1.000000 1.000000
sk 22 0.846154 1.000000 1.000000 0.863636
ar 22 0.916667 1.000000 1.000000 0.954545
lt 21 1.000000 1.000000 0.913043 1.000000
ka 17 1.000000 1.000000 1.000000 1.000000
el 17 1.000000 1.000000 1.000000 1.000000
ca 17 0.882353 0.882353 0.937500 0.882353
bn 17 1.000000 0.764706 1.000000 1.000000
bs 16 0.300000 0.750000 0.235294 0.250000
id 15 1.000000 0.933333 1.000000 0.800000
sq 15 1.000000 0.866667 0.933333 0.933333

Serbian (sr), Bosnian (bs) and Croatian (hr) are very similar, so models confuse them. I talked to a friend from Serbia and he said that basically they are the same language with only tiny variations.

Also, I considered the variants of Norwegian as one language.

Sorry, I didn't think to filter only short texts from the beginning. I'll calculate the metrics again after I improve the dataset

@baslia
Copy link
Collaborator

baslia commented Nov 11, 2024

It seems like good results, congrats!
If I can suggest some way of improvement:

  • You can compute the AUC ROC, there are different way to do it as it is a multi classification problem. But this can be relevant to see how sensitive the model is to the threshold
  • You can weight the loss function, to give more importance to certain language, or to balance minority samples.

@raphael0202
Copy link
Contributor Author

I would suggest also adding f1-score as a metric!

@korablique
Copy link

Recalculated metrics on only short texts (no more than 10 words). 30 texts per language.

lang fasttext_precision lingua_precision fasttext_recall lingua_recall fasttext_f1 lingua_f1
ar 0.964286 1.000000 0.900000 0.866667 0.931035 0.928572
bg 1.000000 0.965517 0.633333 0.933333 0.775510 0.949152
ca 0.769231 0.913043 0.666667 0.700000 0.714286 0.792453
cs 0.941176 1.000000 0.533333 0.833333 0.680851 0.909091
da 0.800000 0.818182 0.800000 0.900000 0.800000 0.857143
de 0.717949 0.906250 0.933333 0.966667 0.811594 0.935484
en 0.571429 0.896552 0.800000 0.866667 0.666667 0.881356
es 0.807692 0.941176 0.700000 0.533333 0.750000 0.680851
fi 0.903226 0.933333 0.933333 0.933333 0.918033 0.933333
fr 0.842105 0.888889 0.533333 0.800000 0.653061 0.842105
hr 1.000000 0.952381 0.400000 0.666667 0.571429 0.784314
hu 0.964286 1.000000 0.900000 0.866667 0.931035 0.928572
it 1.000000 0.960000 0.900000 0.800000 0.947368 0.872727
ja 1.000000 1.000000 0.233333 0.700000 0.378378 0.823529
lv 1.000000 1.000000 0.766667 0.866667 0.867925 0.928572
no 0.720000 0.800000 0.600000 0.666667 0.654545 0.727273
nl 0.880000 0.833333 0.733333 0.666667 0.800000 0.740741
pl 0.966667 1.000000 0.966667 1.000000 0.966667 1.000000
pt 0.944444 0.696970 0.566667 0.766667 0.708333 0.730159
ro 0.956522 0.961538 0.733333 0.833333 0.830189 0.892857
ru 0.961538 1.000000 0.833333 0.833333 0.892857 0.909091
sv 0.958333 0.961538 0.766667 0.833333 0.851852 0.892857

@raphael0202
Copy link
Contributor Author

@korablique Can you publish the source code and your results in this repo? in a new langid folder.

@korablique
Copy link

@korablique Can you publish the source code and your results in this repo? in a new langid folder.

Yes, I remember. I am preparing the code. Haven't published it yet because of the problem with the huggingface dataset. Plan to publish the code this week

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants