-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
documents.log is empty, but documents are getting sent to my index #1667
Comments
ok that attachment got inserted not where I was expecting, sorry! but still kind of a relevant spot at least :) |
ok the document White.Dwarf.Magazine.Issue.001.-.Jun.1977.UK.-001.pdf is now in my index now that the job has completed (13,425 documents). Which is weird as it has the oldest created date, and the first name in alphabetical order, of all the documents, so it should have been the first documentin there, not sometime after 7000 other documents. documents.log is still empty. $cat logs/fscrawler.log
|
So what should be a good order in your opinion? For some use cases, I have the feeling that the most recent documents are the most relevant vs the oldest. What do you think? |
I think oldest file first, by the date time it arrives in the scanning folder (last modified date maybe?). Users will expect first in, first out, for the index when they're using it. So if a set of files get written in to the monitored folders over the day, the user would expect to see the first ones that went in appear in the index first. That's my thoughts anyway! I don't really mind as long as I can find whatever it is documented somewhere. I can mess around with data prep to get the order I need if I have a particular requirement, as long as I know what I'm aiming for.. |
I am experiencing the same issue with documents.log using docker although my documents.log file does record errors, it isn't recording documents indexed: 2023-08-02 08:38:31,003 [ERROR] [603.pdf][/23-90020/603.pdf] Unable to extract PDF content -> Unable to end a page -> TesseractOCRParser timeout |
Describe the bug
Running docker-compose, set the logging directory in the docker-compose file. fscrawler.log gets populated and rotated, but documents.log in that same folder does not.
Job Settings
$ cat config/whitedwarfscryer/_settings.yaml
$ cat docker-compose.yml
Logs
$ docker logs fscrawler
$ cat fscrawler.log
Expected behavior
I am expecting that as fscrawler runs from inside this container, that the documents.log would be populated. It seems like it got created the first time I ran this container a week ago, but has never had any info in it, despite my index being populated successfully and fscrawler.log getting populated and rotated. But assuming the documents are being scanned in alphabetical order (I could not find any info in the docs, but Bard said it was alphabetical first...?)
White Dwarf Magazine Issue 001 - Jun 1977 (UK)-001.pdf
, not all documents are going in to my index, so I suspect some are erroring out, but I can't see what ones they are and I don't want to manually check 13,000 documents. Hence looking for documents.log info.
Versions:
Attachment
Attempting to attach the document that should have been scanned first, but does not appear in my index.
The text was updated successfully, but these errors were encountered: