Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract information from receipt using LayoutLM-like models #364

Open
raphael0202 opened this issue Nov 24, 2024 · 1 comment
Open

Extract information from receipt using LayoutLM-like models #364

raphael0202 opened this issue Nov 24, 2024 · 1 comment
Assignees

Comments

@raphael0202
Copy link
Contributor

raphael0202 commented Nov 24, 2024

On Open Prices, users upload images of receipt to serve as a "proof" for the price of the product they bought.
Currently, after uploading the proof image, users enter prices of the product one at a time, either by scanning the barcode using the app (web or mobile) or by indicating the category of the product (for raw products, such as vegetables or fruits).

An example of such a receipt can be found here: https://prices.openfoodfacts.org/img/0019/B5RGQnlCPI.webp. For more receipts, please look at the Open Prices dataset on Hugging Face.

We wish to automate the task of extracting informations from receipts, using a Document AI model such as LayoutLM.

The information we wish to extract are the following:

  • date/hour of purchase
  • Name of the shop
  • address of the shop
  • the items bought. For each item, we wish to have the following information:
    • the quantity of items (for quantifiable products)
    • the price per item (for quantifiable products)
    • the total price paid (after discount)
    • the price per kg (or equivalent unit, for products sold per weight)
    • the label (=name) of the product on the receipt

A reference dataset exists for extraction from receipt images: https://github.com/clovaai/cord.
This dataset however mainly contains receipts from Indonesia, we should investigate whether the models works well with Open Prices data.
Note that we now have OCR data for all images in Open Prices, by just replacing in the image URL the file extension (ex: .webp) by .json.gz. To deal with Google Cloud Vision OCR files, look at openfoodfacts-python library: https://github.com/openfoodfacts/openfoodfacts-python/blob/develop/openfoodfacts/ocr.py#L295.

The first task is to investigate whether models trained on the CORD dataset (such as LayoutLMv3) work well on Open Prices receipt images.

@baslia
Copy link
Collaborator

baslia commented Nov 25, 2024

This is the planned tasks, let me know if I am missing something

  • Test the fine tuned CORD models and see how they perform on Open Food Prices data
  • Test how the LayoutLMv3 performs out of the box
  • Set up a Label Studio with pre annotations coming from the best model
  • Manually label some data
  • Fine tune a model for Open Food Prices data
  • Test performance and iterate by changing model or label more data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants