-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing text with OCR-D #217
Comments
Does this workflow work better: https://github.com/slub/ocrd_manager/blob/main/workflows/ocr-workflow-default.sh? |
Yeah, a pure-tesseract-only-workflow gives
But why is tesseract so sensible to cropping & minimal deskewing? |
Maybe it gets a wrong DPI value in your original workflow? Is the DPI value correct for the input image (which is the result of previous OCR-D processors)? |
Bad that this wasn't my idea:
|
So the image resolution gets lost early in the OCR-D workflow in olena-binarize? I think this looks like a bug. And the following processors might have the same issue. I wonder why nobody noticed this up to now. But maybe high resolution images like in your case are rare, and for 300 dpi images the damage is less severe. |
I think relying on DPI without "reading distance" is not sufficient for 100% of all cases (but 99% of the "usual"): a microfilm scan might have 2540 dpi; a poster might have been scanned with 300dpi - but is typically read with meters of distance. |
I agree. My example is text with huge letters written on a wall. Ideally Tesseract should not depend on DPI values. |
At least OCR-D could try to keep resolution information. Or I'll have to write a workaround, perhaps with exiftool |
tried workaround
Complete workspace see https://digi.ub.uni-heidelberg.de/diglitData/faber1566_-_0075r.tar Image file resolutions:
Really astonishing is the fact, that tesseract notices correct dpi:
|
Ok... changed my workflow to you can have any resolution, as long as it's 300 dpi ( |
after updating to ocrd/all:maximum 2024-07-10 15:00 CEST,
when OCR'ing https://digi.ub.uni-heidelberg.de/diglitData/v/faber1566_-_0075r.tif
Preview:
with this workflow:
I'll get this text:
When running
tesseract -l frak2021 OCR-D-005/OCR-D-005_00001.IMG-DESKEW.png pure-tesseract
, (with the image after deskew) I'll get this text (see below):Image preview:
And when running
tesseract -l frak2021 faber1566_-_0075r.tif pure-tesseract-from-original
I'll getAnd this problem — missing a lot of text in OCR-D — is occuring on approx. 70-80% of all pages (depending on the book, of course).
The text was updated successfully, but these errors were encountered: