Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing text with OCR-D #217

Open
jbarth-ubhd opened this issue Jul 10, 2024 · 10 comments
Open

Missing text with OCR-D #217

jbarth-ubhd opened this issue Jul 10, 2024 · 10 comments

Comments

@jbarth-ubhd
Copy link

after updating to ocrd/all:maximum 2024-07-10 15:00 CEST,

when OCR'ing https://digi.ub.uni-heidelberg.de/diglitData/v/faber1566_-_0075r.tif

Preview:
grafik

with this workflow:

ocrd workspace init
ocrd workspace add -g P_00001 -G OCR-D-IMG -i OCR-D-IMG_00001 -m image/tiff faber1566_-_0075r.tif 
ocrd-olena-binarize -P impl wolf -P k 0.10 -I OCR-D-IMG -O OCR-D-002
ocrd-tesserocr-crop -I OCR-D-002 -O OCR-D-003
ocrd-olena-binarize -P impl wolf -P k 0.10 -I OCR-D-003 -O OCR-D-004
ocrd-cis-ocropy-deskew -P level-of-operation page -I OCR-D-004 -O OCR-D-005
ocrd-tesserocr-recognize -P find_tables true -P segmentation_level region -P textequiv_level word -P model frak2021 -I OCR-D-005 -O OCR-D-OCR

I'll get this text:

15 1 4 x 8
— — * 5
ö S *
*
*
— * E
—
2


365 achie würſt nu
* 41 vnd


wo voiſ deinc Bict ö ne O odt wo iſ
+ ‚ G5 w


ſte tatt te 400 voꝛ de m hralen wege ge⸗
warnet / vnnd jmmer auff dem en⸗
gen pfade ö da zu einer ſeiten Waſſer /



When running tesseract -l frak2021 OCR-D-005/OCR-D-005_00001.IMG-DESKEW.png pure-tesseract, (with the image after deskew) I'll get this text (see below):
Image preview:
grafik

geäc achie ö 6 wuſt nu
vnd ꝛder C Galle gezelet / vnd
wuſt den n Gorioſen geſang 8
Die feind * aant 0 0 +

And when running tesseract -l frak2021 faber1566_-_0075r.tif pure-tesseract-from-original I'll get

Jetzt wirſtu ruhe finden dei⸗
ner Seele. Jetzt wird dir ein Tiſch

gedecket / an welchem CH RIſtus
ſelbs der Haußknecht ſein wil. Der⸗
halben verſuch alles / wag alles / ver⸗
dag nicht / kempff hindurch / Laß die

erſchlagenen hinder dir liegen / denn

hie werden deine ihrenen alle ab⸗

getruͤcknet / deine arbeit belohnet / der
du zuuor / ein weil ein

*
*

vnder die kinder Goͤttes gezelet / vnd

wirſt den Glorioſen geſang ſingen:

Die feind ſind vberwunden. O Hell
wo iſt deine Victori? O Codt wo iſt

dein Stachel / Ey jhr ſeid alle ver⸗

ſchlungen im ſieg / 7.
Vnnd auff daß diß alles alſo an
allen Himliſchen Landferern / er⸗

ſtattet / ſie voꝛ dem breiten wege ge⸗

warnet / vnnd jmmer auff dem en⸗
gen pfade / da zu einer ſeiten Waſſer /

Ketzer vñ Sab
bath der Welt geachtet biſt / wirſt nu

And this problem — missing a lot of text in OCR-D — is occuring on approx. 70-80% of all pages (depending on the book, of course).

@jbarth-ubhd jbarth-ubhd changed the title Difference in recognition with/without OCR-D Missing text with OCR-D Jul 10, 2024
@stweil
Copy link
Contributor

stweil commented Jul 10, 2024

@jbarth-ubhd
Copy link
Author

Yeah, a pure-tesseract-only-workflow ocrd-tesserocr-recognize -P segmentation_level region -P model frak2021 -I OCR-D-IMG -O OCR-D-OCR3

gives

Jetzt wirſtu ruhe finden dei⸗
ner Seele. Jetzt wird dir ein Tiſch


gedecket / an welchem CH RIſtus
ſelbs der Haußknecht ſein wil. Der⸗
halben verſuch alles / wag alles / ver⸗
dag nicht / kempff hindurch / Laß



erſchlagenen hinder dir liegen / denn

hie werden deine ihrenen alle ab⸗

getruͤcknet / deine arbeit belohnet / der
du zuuor / ein weil ein


vnder die kinder Goͤttes gezelet / vnd


wirſt den Glorioſen geſang ſingen:

Die feind ſind vberwunden. O Hell
wo iſt deine Victori? O Codt wo iſt

dein Stachel / Ey jhr ſeid alle ver⸗

ſchlungen im ſieg / 7.
Vnnd auff daß diß alles alſo an
allen Himliſchen Landferern / er⸗

ſtattet / ſie voꝛ dem breiten wege ge⸗

warnet / vnnd jmmer auff dem en⸗
gen pfade / da zu einer ſeiten Waſſer /


Ketzer vñ Sab
bath der Welt geachtet biſt / wirſt nu

But why is tesseract so sensible to cropping & minimal deskewing?

@stweil
Copy link
Contributor

stweil commented Jul 10, 2024

Maybe it gets a wrong DPI value in your original workflow? Is the DPI value correct for the input image (which is the result of previous OCR-D processors)?

@jbarth-ubhd
Copy link
Author

Bad that this wasn't my idea:

jb@nuc:~/faber1566$ find . \( -iname "*.png" -o -iname "*.tif" \) -printf "identify -format '%%x %%y %%U\\\n' %p\n"|bash -x
+ identify -format '%x %y %U\n' ./faber1566_-_0075r.tif
1225.29296875 1225.29296875 PixelsPerInch  # page has 8° format, dpi is +-20% correct
+ identify -format '%x %y %U\n' ./OCR-D-004/OCR-D-IMG_00001-BIN_wolf.png
72 72 Undefined
+ identify -format '%x %y %U\n' ./OCR-D-OCR3/OCR-D-OCR3_00001.IMG-BIN.png
72 72 Undefined
+ identify -format '%x %y %U\n' ./OCR-D-005/OCR-D-005_00001.IMG-DESKEW.png
72 72 Undefined
+ identify -format '%x %y %U\n' ./OCR-D-OCR/OCR-D-OCR_00001.IMG-BIN.png
72 72 Undefined
+ identify -format '%x %y %U\n' ./OCR-D-002/OCR-D-002_00001-BIN_wolf.png
72 72 Undefined
+ identify -format '%x %y %U\n' ./OCR-D-OCR2/OCR-D-OCR2_00001.IMG-BIN.png
72 72 Undefined
+ identify -format '%x %y %U\n' ./OCR-D-003/OCR-D-003_00001.IMG-CROP.png
72 72 Undefined

@stweil
Copy link
Contributor

stweil commented Jul 10, 2024

So the image resolution gets lost early in the OCR-D workflow in olena-binarize? I think this looks like a bug. And the following processors might have the same issue.

I wonder why nobody noticed this up to now. But maybe high resolution images like in your case are rare, and for 300 dpi images the damage is less severe.

@jbarth-ubhd
Copy link
Author

I think relying on DPI without "reading distance" is not sufficient for 100% of all cases (but 99% of the "usual"): a microfilm scan might have 2540 dpi; a poster might have been scanned with 300dpi - but is typically read with meters of distance.

@stweil
Copy link
Contributor

stweil commented Jul 10, 2024

I agree. My example is text with huge letters written on a wall. Ideally Tesseract should not depend on DPI values.

@jbarth-ubhd
Copy link
Author

At least OCR-D could try to keep resolution information. Or I'll have to write a workaround, perhaps with exiftool

@jbarth-ubhd
Copy link
Author

jbarth-ubhd commented Jul 16, 2024

tried workaround exiftool -tagsFromFile OCR-D-IMG/00001.tif OCR-D-.../*.png after each step, but the result is bad:

ocrdcluster/finished/faber1566/run11/0075r> ocrd-show-text OCR-D-OCR/*.xml|egrep -v '^$'
5 Die je feind ſindv pberwu unden. O 5 l
. wo in St Siachſ 0/ Ey r 7 *0 Lans
ſtatet fv voꝛ dem brahen wege ge⸗
warn 6t/ vnnd eer ün dem en⸗

Complete workspace see https://digi.ub.uni-heidelberg.de/diglitData/faber1566_-_0075r.tar

Image file resolutions:

dwork@pers109:/mnt/sds/sd22d001/ocrdcluster/finished/faber1566/run11/0075r$ identify -format "%d/%f %x %y %U\n" */*.tif */*.png*
OCR-D-IMG/00001.tif 1225.2930908203125 1225.2930908203125 PixelsPerInch
OCR-D-001/OCR-D-001_00001-BIN_wolf.png 1225.2931228861330055 1225.2931228861330055 PixelsPerInch
OCR-D-001/OCR-D-001_00001-BIN_wolf.png_original 72 72 Undefined
OCR-D-002/OCR-D-002_00001.IMG-CROP.png 1225.2931228861330055 1225.2931228861330055 PixelsPerInch
OCR-D-002/OCR-D-002_00001.IMG-CROP.png_original 72 72 Undefined
OCR-D-003/OCR-D-IMG_00001-BIN_wolf.png 1225.2931228861330055 1225.2931228861330055 PixelsPerInch
OCR-D-003/OCR-D-IMG_00001-BIN_wolf.png_original 72 72 Undefined
OCR-D-004/OCR-D-004_00001.IMG-DESKEW.png 1225.2931228861330055 1225.2931228861330055 PixelsPerInch
OCR-D-004/OCR-D-004_00001.IMG-DESKEW.png_original 72 72 Undefined
OCR-D-OCR/OCR-D-OCR_00001.IMG-BIN.png 72 72 Undefined

Really astonishing is the fact, that tesseract notices correct dpi:

...
11:17:50.743 INFO processor.TesserocrCrop - INPUT FILE 0 / P_00001
11:17:51.072 INFO processor.TesserocrCrop - Page 'P_00001' images will use 1225 DPI from image meta-data
11:17:51.072 INFO processor.TesserocrCrop - Cropping with Tesseract
11:17:53.757 INFO processor.TesserocrCrop - Ignoring region 'region0000' because its width is too small (43)
11:17:53.758 INFO processor.TesserocrCrop - Ignoring region 'region0001' because its width is too small (35)
...

@jbarth-ubhd
Copy link
Author

jbarth-ubhd commented Jul 16, 2024

Ok... changed my workflow to you can have any resolution, as long as it's 300 dpi (convert ... -resample 300 ...). That helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants