Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segment-region: crop_polygons creates invalid coordinates #98

Open
bertsky opened this issue Dec 4, 2019 · 7 comments
Open

segment-region: crop_polygons creates invalid coordinates #98

bertsky opened this issue Dec 4, 2019 · 7 comments
Assignees

Comments

@bertsky
Copy link
Collaborator

bertsky commented Dec 4, 2019

When using ocrd-tesserocr-segment-region with crop_polygons=True, one will frequently get coordinates extending the segment bbox, which could easily end up in negative coordinates (which is forbidden syntactically in PAGE-XML).

So maybe Tesseract's BlockPolygon must be clipped just like its BoundingBox is clipped?

@bertsky bertsky self-assigned this Dec 4, 2019
@bertsky
Copy link
Collaborator Author

bertsky commented Dec 4, 2019

Also, this parameter should be called just polygons (because it is independent of how cropping is done now).

@wrznr
Copy link
Contributor

wrznr commented Dec 4, 2019

I'd even say the parameter should be called bboxes as soon as this issue is fixed. Polygons should be the default.

@bertsky
Copy link
Collaborator Author

bertsky commented Dec 4, 2019

Polygons should be the default.

I agree, but we still have the issue of Tesseract generating invalid (self-intersecting) polygon paths internally, which end up in very strange ways on the consumer side (depending on how the coordinates are being processed, with numpy / skimage / cv2 etc). But maybe it's enough to check against that as well – using Shapely, and as a workaround, taking the exterior or the self-union...

@stweil
Copy link
Contributor

stweil commented Dec 4, 2019

The tesseract command line executable also has an issue with an endless loop when doing segmentation for certain images.

I did not test whether this affects ocrd_tesserocr, too.

@bertsky
Copy link
Collaborator Author

bertsky commented Dec 16, 2019

This appears to affect all kinds of regions, but only when they have been rotated internally. Anyway, this is not about clipping to the image/rectangle.

@bertsky
Copy link
Collaborator Author

bertsky commented Dec 19, 2019

We now have a partial solution in Tesseract itself, but on top of that I still hesitate to make a PR for the convex_hull workaround here...

@bertsky
Copy link
Collaborator Author

bertsky commented May 12, 2020

What if instead of trying to find the bug deep inside Tesseract's polyblk generator we take the liberty of annotating text regions along with text lines in one pass? (Perhaps even with #127 ...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants