Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Files are not deleted when deleting folder #1904

Open
sagentac opened this issue Jul 17, 2024 · 5 comments
Open

Files are not deleted when deleting folder #1904

sagentac opened this issue Jul 17, 2024 · 5 comments
Labels
check_for_bug Needs to be reproduced

Comments

@sagentac
Copy link

sagentac commented Jul 17, 2024

Describe the bug

When deleting a folder the inner files are not deleted.
All nested folders are deleted correctly, and also the files are deleted correctly if they are removed directly.

Job Settings

---
name: "m9989"
fs:
  url: "\\\\domain.com\\global\\A\\Applications\\004\\006\\001\\M9988"
  update_rate: "1d"
  excludes:
  - "*\\*~*"
  - "*\\*History.xls"
  - "*\\*README.txt"
  - "*\\*.zip"
  - "*\\*.7z"
  - "*\\*.gz"
  - "*\\*.tgz"
  - "*\\*.rar"
  - "*\\*.iso"
  - "*\\*.001"
  - "*\\*.cab"
  - "*\\*.bz2"
  - "*\\*.rpm"
  - "*\\*.dmg"
  - "*\\*.arj"
  - "*\\*.z"
  - "*\\*.x_t"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  indexed_chars: "99%"
  attributes_support: false
  raw_metadata: false
  checksum: "MD5"
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: true
  tika_config_path: "D:/wap/tika/config/tika-config.xml"
  ocr:
    enabled: false
  follow_symlinks: false
elasticsearch:
  index: "mdoc_m9989"
  index_folder: "mdoc_m9989_folder"
  pipeline: "external_doc_v2"
  nodes:
  - url: "https://server.domain.com:9200/"
  bulk_size: 400
  flush_interval: "60s"
  byte_size: "100mb"
  ssl_verification: false

Logs

FULL FSCrawler LOGS HERE
12:51:52,786 TRACE [f.p.e.c.f.s.FsCrawlerManagementServiceElasticsearchImpl] Querying elasticsearch for files in dir [path.root:19db9e37d35ed2fbe9aed71b2271a5]
12:51:52,786 TRACE [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch query to run: {"size":10000,"stored_fields" : ["file.filename"],"query" : {"term": { "path.root": "19db9e37d35ed2fbe9aed71b2271a5"}}}
12:51:52,786 TRACE [f.p.e.c.f.c.ElasticsearchClient] Calling POST https://server.domain.com:9200//mdoc_m9989/_search with params [version=true]
12:51:52,786 TRACE [f.p.e.c.f.c.ElasticsearchClient] POST https://a1wapapp184.europe.prestagroup.com:9200//mdoc_m9989/_search gives {"took":0,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":2,"relation":"eq"},"max_score":0.18232156,"hits":[{"_index":"mdoc_m9989","_type":"_doc","_id":"Example.pdf","_version":1,"_score":0.18232156,"fields":{"file.filename":["Example.pdf"]}},{"_index":"mdoc_m9989","_type":"_doc","_id":"file-example_PDF_1MB.pdf","_version":1,"_score":0.18232156,"fields":{"file.filename":["file-example_PDF_1MB.pdf"]}}]}}
12:51:52,786 TRACE [f.p.e.c.f.s.FsCrawlerManagementServiceElasticsearchImpl] We found: [Example.pdf, file-example_PDF_1MB.pdf]
12:51:52,786 DEBUG [f.p.e.c.f.FsParserAbstract] Deleting mdoc_m9989/96cb945bf79a714ce1fef45f99043f8
12:51:52,786 DEBUG [f.p.e.c.f.FsParserAbstract] Deleting mdoc_m9989/f3894f59b290cce3a237e5b5b94b54c

This is just a snippet, here the full log:
log_delete.log

It seems like fscrawler is finding the files and trying to delete them, but the ids (96cb945bf79a714ce1fef45f99043f8, f3894f59b290cce3a237e5b5b94b54c) are not matching.
The files have different ids:
image

Expected behavior

When a folder is deleted, all files inside this folder are also deleted

Versions:

Best,
Philipp

@sagentac sagentac added the check_for_bug Needs to be reproduced label Jul 17, 2024
@dadoonet
Copy link
Owner

Thanks! This is smelling like a bug indeed in the way we are computing the _id. I need to check this later.
Thanks for opening this and sharing the details!

@sagentac
Copy link
Author

sagentac commented Jul 18, 2024

@dadoonet the issue seems to come from the url, since we are indexing from a network share.
If i map the network drive to the letter K for example the deletion works:

So i changed the url in the yaml from this:
\\\\domain.com\\global\\A\\Applications\\004\\006\\001\\M9988

to this:
K:/004/006/001/M9989

EDIT:
It's caused by the double \\ instead of a single /

So i guess this does not have any high prio :)

Best,
Philipp

@dadoonet
Copy link
Owner

w00t! Nice finding! Thanks for debugging this.
I'll try to find a way to fix that unless you have yourself an idea to fix it ;)

@dadoonet
Copy link
Owner

dadoonet commented Aug 7, 2024

12:51:52,786 TRACE [f.p.e.c.f.c.ElasticsearchClient] POST https://a1wapapp184.europe.prestagroup.com:9200//mdoc_m9989/_search gives {"took":0,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":2,"relation":"eq"},"max_score":0.18232156,"hits":[{"_index":"mdoc_m9989","_type":"_doc","_id":"Example.pdf","_version":1,"_score":0.18232156,"fields":{"file.filename":["Example.pdf"]}},{"_index":"mdoc_m9989","_type":"_doc","_id":"file-example_PDF_1MB.pdf","_version":1,"_score":0.18232156,"fields":{"file.filename":["file-example_PDF_1MB.pdf"]}}]}}

There's something I don't understand. In the index mdoc_m9989 you apparently have 2 docs.
The _id field for the 2 docs are: "_id":"file-example_PDF_1MB.pdf" and "_id":"Example.pdf".

Which does not make sense to me as you are using: filename_as_id: false.

How did you end up creating those documents?

Obviously, this can not match with the generated _id:

12:51:52,786 DEBUG [f.p.e.c.f.FsParserAbstract] Deleting mdoc_m9989/96cb945bf79a714ce1fef45f99043f8
12:51:52,786 DEBUG [f.p.e.c.f.FsParserAbstract] Deleting mdoc_m9989/f3894f59b290cce3a237e5b5b94b54c

Do you have an idea of why this is happening?

@sagentac
Copy link
Author

sagentac commented Aug 9, 2024

Hey, thanks for looking into this :)

Sorry, i think i messed up the index while trying to trace down the issue.

I now re-created the scenario (note the folder 1_neu with a file in it 1_neu3.txt)

  1. Indexed it from scratch
  2. deleted the folder 1_neu
  3. Indexed again with loop 1 and restart false

The document index looks the same before and after step 3:
image

The folder index had the folder after step 1:
image
Which was correctly deleted after step 3

Here the new trace log (step 3):
log.log

It seems like fscrawler is not able to find any files with by root
09:02:55,722 DEBUG [f.p.e.c.f.FsParserAbstract] Delete folder [https://doku.europe.prestagroup.com/006_ADoku/001_P/M9989/1_neu] 09:02:55,722 TRACE [f.p.e.c.f.s.FsCrawlerManagementServiceElasticsearchImpl] Querying elasticsearch for files in dir [path.root:827795e54272bd4aa6c6245f499bc76] 09:02:55,722 TRACE [f.p.e.c.f.c.ElasticsearchClient] Elasticsearch query to run: {"size":10000,"stored_fields" : ["file.filename"],"query" : {"term": { "path.root": "827795e54272bd4aa6c6245f499bc76"}}} 09:02:55,722 TRACE [f.p.e.c.f.c.ElasticsearchClient] Calling POST https://a1wapapp184.europe.prestagroup.com:9200//adoku_m9989/_search with params [version=true] 09:02:55,722 TRACE [f.p.e.c.f.c.ElasticsearchClient] POST https://a1wapapp184.europe.prestagroup.com:9200//adoku_m9989/_search gives {"took":0,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":0,"relation":"eq"},"max_score":null,"hits":[]}} 09:02:55,722 TRACE [f.p.e.c.f.s.FsCrawlerManagementServiceElasticsearchImpl] We found: []

I hope this helps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
check_for_bug Needs to be reproduced
Projects
None yet
Development

No branches or pull requests

2 participants