-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
segment-region: crop_polygons creates invalid coordinates #98
Comments
Also, this parameter should be called just |
I'd even say the parameter should be called bboxes as soon as this issue is fixed. Polygons should be the default. |
I agree, but we still have the issue of Tesseract generating invalid (self-intersecting) polygon paths internally, which end up in very strange ways on the consumer side (depending on how the coordinates are being processed, with numpy / skimage / cv2 etc). But maybe it's enough to check against that as well – using |
The I did not test whether this affects ocrd_tesserocr, too. |
This appears to affect all kinds of regions, but only when they have been rotated internally. Anyway, this is not about clipping to the image/rectangle. |
We now have a partial solution in Tesseract itself, but on top of that I still hesitate to make a PR for the convex_hull workaround here... |
What if instead of trying to find the bug deep inside Tesseract's polyblk generator we take the liberty of annotating text regions along with text lines in one pass? (Perhaps even with #127 ...) |
When using
ocrd-tesserocr-segment-region
withcrop_polygons=True
, one will frequently get coordinates extending the segment bbox, which could easily end up in negative coordinates (which is forbidden syntactically in PAGE-XML).So maybe Tesseract's BlockPolygon must be clipped just like its BoundingBox is clipped?
The text was updated successfully, but these errors were encountered: