Skip to content

OCR integration problem #1638

Answered by paksak
paksak asked this question in Q&A
Apr 5, 2023 · 2 comments · 4 replies
Discussion options

You must be logged in to vote

I think I found the reason.
Size of pdf was the problem. (error message below)
tesseract processing limit is 100000 character and that's why parsing was stopped.
Even though I have to find another way to solve this problem, I'm just glad I found the reason why crawing stopped.
Thank you david as always. have a good day

19:58:29,231 WARN (f.p.e.c.ft.TikaDocParser] Failed to extract [100000]characters of text for /usr/local/lib/apache-tomcat-8.5.83/webapps/ROOT/doc/dONXwzf19R/MSE_and_Drilling Efficiency_1_1.pdf): Unable to extract PDF content -> Unable to end a page -> TesseractOCRParser timeout

Replies: 2 comments 4 replies

Comment options

You must be logged in to vote
1 reply
@paksak
Comment options

Comment options

You must be logged in to vote
3 replies
@dadoonet
Comment options

@paksak
Comment options

@paksak
Comment options

Answer selected by paksak
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants