Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory usage explosion with very narrow images (e.g. book spine) #67

Open
mikegerber opened this issue Feb 15, 2022 · 6 comments
Open
Labels
bug Something isn't working

Comments

@mikegerber
Copy link
Member

mikegerber commented Feb 15, 2022

With this document (PPN894261851.zip) we experienced an OOM error. Further investigation revealed this memory usage (measured using procpath):

eynollah vs Buchrücken drawio

The culprit seems to be this "page" from the document - an image of a book spine:

FILE_0017_MAX tif

Relevant parts from the log output:

18:25:30.757 INFO eynollah - INPUT FILE PHYS_0017 (17/18)
18:25:30.780 INFO eynollah - resize and enhance image
18:25:30.780 INFO eynollah - Detected 25 DPI
18:25:40.756 INFO eynollah - Found 5 columns ([[4.1955504e-01 1.7818451e-13 2.7631987e-21 7.5972243e-22 5.8044493e-01
  0.0000000e+00]])
18:31:39.449 INFO eynollah - Image is enhanced
18:31:40.369 INFO eynollah - Enhancing took 369.5891568660736s
18:31:47.043 INFO eynollah - Image dimensions: 448x672
18:43:35.935 INFO eynollah - Image dimensions: 224x448
18:52:07.638 INFO eynollah - Image dimensions: 448x672
19:01:28.031 INFO eynollah - Textregion detection took 1787.6620445251465s
19:01:36.604 INFO eynollah - Graphics detection took 8.571088552474976s
19:01:36.604 INFO eynollah - cont_page [array([[  519,   445],
       [ 4404,   445],
       [ 4404, 27685],
       [  519, 27685]])]
19:01:41.160 INFO eynollah - Image dimensions: 448x672
19:08:15.645 INFO eynollah - textline detection took 399.04073786735535s
19:26:32.295 INFO eynollah - slope_deskew: -90.0
19:26:32.451 INFO eynollah - deskewing took 1096.8060252666473s
19:26:33.040 INFO eynollah - detection of marginals took 0.5885534286499023s
19:26:55.466 INFO eynollah - Image dimensions: 896x896
19:27:51.663 INFO eynollah - Image dimensions: 896x896
19:34:22.576 INFO eynollah - areas_cnt_text [1.60449940e-05 3.67248936e-05 4.69396395e-05 1.78734430e-05
 6.68446924e-05 1.59316018e-05 2.67794541e-05 3.35782605e-05
 2.04153178e-05 1.02601028e-04 1.49299709e-05 2.50974700e-05
 1.09640792e-04 4.56729543e-04 1.69521315e-05 7.82122588e-05
 9.06334276e-05 2.25603199e-04 1.58796304e-05 4.07455914e-05
 1.44858515e-05 1.97103964e-04 3.92242463e-05 2.14925435e-05
 2.01601854e-05 1.57520642e-05 1.14313495e-04 2.90331237e-05
 1.44291554e-04 2.15615238e-04 3.12064739e-05 4.46585667e-04
 2.03675986e-04 4.18700639e-05 2.75817038e-04 2.86669615e-04
 4.78515016e-05 1.76816212e-04 2.13172581e-04 2.02211337e-04
 3.27372684e-05 1.72403366e-05 1.62434303e-05 3.26522243e-05
 2.49226571e-05 1.41551243e-05 2.55297777e-04 2.39352001e-05
 1.48591008e-05 1.77080794e-05 1.41844173e-04 7.28828262e-05
 1.27079565e-04 1.09125803e-04 5.03886517e-05 1.61253135e-05
 2.59356273e-04 3.43578317e-05 1.49417826e-04 1.00711158e-04
 1.49819423e-05 5.42553252e-04 2.48706857e-05 2.26875554e-03
 4.71257916e-04 8.13966893e-05 7.39080805e-05 4.21195267e-04
 3.22033802e-05 2.35572262e-04 2.46580753e-05 2.20656465e-04
 2.95670119e-05 1.99759231e-05 4.83650737e-04 2.61520173e-04
 1.14686745e-04 5.78111151e-05 1.14729267e-04 1.89081467e-05
 1.68529133e-04 1.66998339e-04 1.72875834e-05 2.23552691e-04
 1.04831546e-03 6.28268293e-04 5.47693697e-04 1.98365452e-04
 2.78094331e-05 6.26397322e-05 5.01098959e-05 1.08133621e-04
 9.64258784e-05 5.27179162e-05 6.81203545e-05 1.25246392e-04
 7.48104933e-04 8.99908719e-05 6.32440181e-04 1.75379911e-05
 9.17437261e-05 3.56807405e-05 3.17781595e-05 2.56077349e-05
 1.14162306e-04 3.40275770e-04 1.91113077e-05 2.73133423e-05
 2.53143326e-04 4.32118714e-05 1.93848663e-04 3.59594963e-05
 1.95918070e-04 1.34687236e-03 1.60180634e-04 2.35761249e-05
 6.63717525e-04 4.14731913e-05 1.89790168e-05 1.82136195e-05
 1.86530142e-05 2.08773909e-04 2.22569958e-04 3.77780235e-04
 4.02589500e-05 5.98474497e-05 1.02081314e-04 3.75233635e-05
 4.72098908e-04 5.47306274e-05 1.23058868e-04 1.49281755e-03
 8.34802707e-05 1.13349662e-04 2.02093220e-04 2.57681376e-03
 2.15686108e-04 5.79150579e-05 4.43079958e-05 2.98197820e-04
 2.61132750e-05 8.44677276e-05 5.68189335e-05 3.62051794e-05
 7.14342410e-04 1.95589233e-03 1.87621542e-04 2.56549816e-05
 1.75568898e-05 1.43630100e-05 9.49763483e-04 5.73769175e-04
 3.36840932e-04 1.75474405e-05 1.04953916e-04 6.89329984e-05
 6.42224981e-05 2.66504705e-04 6.18412623e-05 5.68283828e-05
 2.05906032e-04 1.20568964e-04 2.07554943e-05 2.06421021e-05
 6.66509807e-05 3.16127959e-05 1.37913244e-05 7.39458779e-05
 3.90399840e-05 2.61038257e-05 2.60187815e-05 2.02953110e-04
 4.78609509e-05 1.26876404e-04 8.87908046e-05 4.99917791e-05
 2.68890665e-04 4.74404549e-05 1.45269562e-04 1.67092832e-04]
19:43:02.340 INFO eynollah - Job done in 4651.560835599899s

This log output is not from the OOM, but another run I did on a different machine to investigate the problem. If I interpret the cont_page part correctly, the image is blown up to [ 4404, 27685], which would certainly explain the OOM error on the other machine.

Reproduce with ocrd-eynollah-segment -I MAX -O TEST-SEGMENT -P models /path/to/models.

@mikegerber mikegerber added the bug Something isn't working label Feb 15, 2022
@mikegerber
Copy link
Member Author

mikegerber commented Feb 21, 2022

While eynollah should handle this gracefully, we should also consider how to handle irrelevant images that are already marked as such in the METS structMap. In this case possibly spine and colour_checker (could also be SBB defined types):

  <mets:structMap TYPE="LOGICAL">
    <mets:div ADMID="AMD" CONTENTIDS="http://resolver.staatsbibliothek-berlin.de/SBB000205BC00000000" DMDID="DMDLOG_0000" ID="LOG_0000" LABEL="Disputationum Medicarum Undecima, De Chirurgia" ORDERLABEL="Disputationum Medicarum Undecima, De Chirurgia" TYPE="monograph">
      <mets:div ID="LOG_0001" TYPE="binding">
        <mets:div ID="LOG_0002" TYPE="cover_front"/>
        <mets:div ID="LOG_0003" TYPE="paste_down"/>
        <mets:div ID="LOG_0004" TYPE="endsheet">
          <mets:div ID="LOG_0005" TYPE="contents"/>
        </mets:div>
      </mets:div>
      <mets:div ID="LOG_0006" TYPE="title_page"/>
      <mets:div DMDID="DMDLOG_0001" ID="LOG_0007" LABEL="Quaestio Prima. [bis] 44." TYPE="section"/>
      <mets:div ID="LOG_0008" TYPE="binding">
        <mets:div ID="LOG_0009" TYPE="endsheet"/>
        <mets:div ID="LOG_0010" TYPE="paste_down"/>
        <mets:div ID="LOG_0011" TYPE="cover_back"/>
        <mets:div ID="LOG_0012" TYPE="spine"/>
      </mets:div>
      <mets:div ID="LOG_0013" TYPE="colour_checker"/>
    </mets:div>

(Full document: PPN894261851.zip)

@bertsky @kba @cneud What are your thoughts on this?

@bertsky
Copy link
Contributor

bertsky commented Feb 21, 2022

Yes, it should be possible to skip pages marked as certain types in the logical structmap – not just in any one processor, but as a general mechanism for workflows in OCR-D.

For the concrete set of supported page types, we should stick to DFG Strukturdatenset, which is strangely missing colour_checker.

This set is also partially supported by ocrd-anybaseocr-layout-analysis:

{'annotation': 0, 'binding': 1, 'chapter': 2, 'colour_checker': 3, 'contained_work': 4, 'contents': 5, 'cover': 6, 'edge': 7, 'endsheet': 8, 'epicedia': 9, 'illustration': 10, 'index': 11, 'musical_notation': 12, 'page': 13, 'paste_down': 14, 'preface': 15, 'provenance': 16, 'section': 17, 'sermon': 18, 'table': 19, 'title_page': 20}

For the general mechanism, I suggest something along the lines of our --page-id CLI option's existing numerical range syntax, but more elaborate. For example, one could define filter operators that can look into the structmap, perhaps XPath expressions with predefined functions?

@mikegerber
Copy link
Member Author

Yes, it should be possible to skip pages marked as certain types in the logical structmap – not just in any one processor, but as a general mechanism for workflows in OCR-D.

For the concrete set of supported page types, we should stick to DFG Strukturdatenset, which is strangely missing colour_checker.

100% agree! Should we take this to an OCR-D core or spec issue? I have some additional thoughts to discuss (like: What happens with skipped pages in the output?)

@bertsky
Copy link
Contributor

bertsky commented Feb 21, 2022

Should we take this to an OCR-D core or spec issue?

Yes, we should elevate this to OCR-D/spec.

I have some additional thoughts to discuss (like: What happens with skipped pages in the output?)

There is already some discussion on skip strategies for API changes in spec...

@cneud
Copy link
Member

cneud commented Aug 17, 2023

With the current version including #67 I was able to

  • process FILE_0017_MAX.tif successfully without memory explosion
  • process the whole document PPN894261851 using the -di flag without running into memory issues

Is there anything relevant from here that is still needed for OCR-D/spec#172 (comment) or can we close this?

@mikegerber
Copy link
Member Author

With the current version including #67 I was able to

* process `FILE_0017_MAX.tif` successfully without memory explosion7

* process the whole document PPN894261851 using the `-di` flag without running into memory issues

Is there anything relevant from here that is still needed for OCR-D/spec#172 (comment) or can we close this?

I wouldn't know, the current version is not working for OCR-D and so I can't reproduce until it's fixed. (Yes, there is a elaborate workaround but I am not willing to invest the time to reproduce with a lengthy changeset (#86) missing.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants