Extract information from receipt using LayoutLM-like models #364

raphael0202 · 2024-11-24T16:52:03Z

On Open Prices, users upload images of receipt to serve as a "proof" for the price of the product they bought.
Currently, after uploading the proof image, users enter prices of the product one at a time, either by scanning the barcode using the app (web or mobile) or by indicating the category of the product (for raw products, such as vegetables or fruits).

An example of such a receipt can be found here: https://prices.openfoodfacts.org/img/0019/B5RGQnlCPI.webp. For more receipts, please look at the Open Prices dataset on Hugging Face.

We wish to automate the task of extracting informations from receipts, using a Document AI model such as LayoutLM.

The information we wish to extract are the following:

date/hour of purchase
Name of the shop
address of the shop
the items bought. For each item, we wish to have the following information:
- the quantity of items (for quantifiable products)
- the price per item (for quantifiable products)
- the total price paid (after discount)
- the price per kg (or equivalent unit, for products sold per weight)
- the label (=name) of the product on the receipt

A reference dataset exists for extraction from receipt images: https://github.com/clovaai/cord.
This dataset however mainly contains receipts from Indonesia, we should investigate whether the models works well with Open Prices data.
Note that we now have OCR data for all images in Open Prices, by just replacing in the image URL the file extension (ex: .webp) by .json.gz. To deal with Google Cloud Vision OCR files, look at openfoodfacts-python library: https://github.com/openfoodfacts/openfoodfacts-python/blob/develop/openfoodfacts/ocr.py#L295.

The first task is to investigate whether models trained on the CORD dataset (such as LayoutLMv3) work well on Open Prices receipt images.

The text was updated successfully, but these errors were encountered:

baslia · 2024-11-25T06:39:09Z

This is the planned tasks, let me know if I am missing something

Test the fine tuned CORD models and see how they perform on Open Food Prices data
Test how the LayoutLMv3 performs out of the box
Set up a Label Studio with pre annotations coming from the best model
Manually label some data
Fine tune a model for Open Food Prices data
Test performance and iterate by changing model or label more data

raphael0202 added prices Open Prices labels Nov 24, 2024

teolemon added this to 🤖 Artificial Intelligence @ Open Food Facts Nov 24, 2024

github-project-automation bot moved this to Todo in 🤖 Artificial Intelligence @ Open Food Facts Nov 24, 2024

raphael0202 assigned baslia Nov 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract information from receipt using LayoutLM-like models #364

Extract information from receipt using LayoutLM-like models #364

raphael0202 commented Nov 24, 2024 •

edited

Loading

baslia commented Nov 25, 2024 •

edited

Loading

Extract information from receipt using LayoutLM-like models #364

Extract information from receipt using LayoutLM-like models #364

Comments

raphael0202 commented Nov 24, 2024 • edited Loading

baslia commented Nov 25, 2024 • edited Loading

raphael0202 commented Nov 24, 2024 •

edited

Loading

baslia commented Nov 25, 2024 •

edited

Loading