OCR integration problem #1638

paksak · 2023-04-05T09:28:12Z

paksak
Apr 5, 2023

To read pdf files, I installed tesseract-ocr and enabled in fscrawler as manual.
I run fscrawler to read hundreds of pdf files.
After running, some pdf files were indexed in WS, but just about one of five pdf files were indexed.
Even thought I tried many times, omitted pdf were not indexed.
I wonder why some file is ocred well and the others doesn't.
What is the reason for that? is there something in tesseract-ocr?

Answered by paksak

Apr 5, 2023

I think I found the reason.
Size of pdf was the problem. (error message below)
tesseract processing limit is 100000 character and that's why parsing was stopped.
Even though I have to find another way to solve this problem, I'm just glad I found the reason why crawing stopped.
Thank you david as always. have a good day

19:58:29,231 WARN (f.p.e.c.ft.TikaDocParser] Failed to extract [100000]characters of text for /usr/local/lib/apache-tomcat-8.5.83/webapps/ROOT/doc/dONXwzf19R/MSE_and_Drilling Efficiency_1_1.pdf): Unable to extract PDF content -> Unable to end a page -> TesseractOCRParser timeout

View full answer

dadoonet · 2023-04-05T09:36:25Z

dadoonet
Apr 5, 2023
Maintainer

What is the reason for that?

Hard to tell. But first thing is that you can share the files with us so we can test this.
Otherwise, you can start fscrawler with --trace to see if we can find some details.

1 reply

paksak Apr 5, 2023
Author

thank you for quick reply! I will do what you suggested

paksak · 2023-04-05T11:13:04Z

paksak
Apr 5, 2023
Author

I think I found the reason.
Size of pdf was the problem. (error message below)
tesseract processing limit is 100000 character and that's why parsing was stopped.
Even though I have to find another way to solve this problem, I'm just glad I found the reason why crawing stopped.
Thank you david as always. have a good day

19:58:29,231 WARN (f.p.e.c.ft.TikaDocParser] Failed to extract [100000]characters of text for /usr/local/lib/apache-tomcat-8.5.83/webapps/ROOT/doc/dONXwzf19R/MSE_and_Drilling Efficiency_1_1.pdf): Unable to extract PDF content -> Unable to end a page -> TesseractOCRParser timeout

3 replies

dadoonet Apr 5, 2023
Maintainer

No. The reason is the timeout. There is no 100000 limit in Tesseract. That's just the default limit for the number of chars to be extracted by FSCrawler.

The interesting part is: TesseractOCRParser timeout.

paksak Apr 5, 2023
Author

I see. I will check my server resoruce and tesseract configuration.
Thanks David.

paksak Apr 10, 2023
Author

I think it was matter of enterprise-search.yml setting.
After I changed workplace_seearch.content_source.document_size.limit 100k to 10000k, timeout disappeared.
Then plus I run on trace mode to avoid session disconnet.
With this two options, I could upload thousands of pdf files without disconneciton.
Thank you, David

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCR integration problem #1638

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

OCR integration problem #1638

paksak Apr 5, 2023

Replies: 2 comments · 4 replies

dadoonet Apr 5, 2023 Maintainer

paksak Apr 5, 2023 Author

paksak Apr 5, 2023 Author

dadoonet Apr 5, 2023 Maintainer

paksak Apr 5, 2023 Author

paksak Apr 10, 2023 Author

paksak
Apr 5, 2023

Replies: 2 comments 4 replies

dadoonet
Apr 5, 2023
Maintainer

paksak Apr 5, 2023
Author

paksak
Apr 5, 2023
Author

dadoonet Apr 5, 2023
Maintainer

paksak Apr 5, 2023
Author

paksak Apr 10, 2023
Author