Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document gets indexed incorrectly in .doc #1602

Open
NitzaAg opened this issue Feb 8, 2023 · 0 comments
Open

Document gets indexed incorrectly in .doc #1602

NitzaAg opened this issue Feb 8, 2023 · 0 comments
Labels
check_for_bug Needs to be reproduced

Comments

@NitzaAg
Copy link

NitzaAg commented Feb 8, 2023

Describe the bug

We are indexing a .doc having page numbers in the headers. After indexing, we see a number at the beginning of the indexed document, which we presume belongs to a page number in the document's headers. We don't see any other header getting indexed. This job indexes the .doc and gets a 4 at the beginning of the indexed document.

Is there a way to prevent headers from getting indexed? We noticed that for this document, saving the .doc as .docx surpassed the error.

Job Settings

---
name: "test_doc"
fs:
  url: "C:\\Users\\usr\\Documents\\Work\\nas\\download\\LPR\\doc"
  update_rate: "15m"
  excludes:
  - "*\\.docx"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: false
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  nodes:
  - cloud_id: "XXXX"
  username: "XXXX"
  password: "XXXX"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"

Logs

12:11:17,791 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [230.1mb/3.9gb=5.72%], RAM [5.8gb/15.7gb=37.31%], Swap [5.9gb/22.2gb=26.82%].
12:11:18,052 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
12:11:18,054 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
12:11:19,508 INFO  [f.p.e.c.f.c.v.ElasticsearchClientV7] Elasticsearch Client for version 7.x connected to a node running version 7.17.6
12:11:20,039 INFO  [f.p.e.c.f.c.v.ElasticsearchClientV7] Elasticsearch Client for version 7.x connected to a node running version 7.17.6
12:11:20,457 WARN  [o.e.c.RestClient] request [PUT https://eed5e799ef89480a99731a58e8d2ec8b.us-east-2.aws.elastic-cloud.com:443/test_doc?master_timeout=30s&timeout=30s] returned 1 warnings: [299 Elasticsearch-7.17.6-f65e9d338dc1d07b642e14a27f338990148ee5b6 "Camel case format name dateOptionalTime is deprecated and will be removed in a future version. Use snake case name date_optional_time instead."]
12:11:20,818 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [test_doc] for [C:\Users\usr\Documents\Work\nas\download\LPR\doc] every [15m]
12:11:20,934 INFO  [f.p.e.c.f.t.TikaInstance] OCR is disabled.
12:11:21,839 WARN  [o.e.c.RestClient] request [POST https://eed5e799ef89480a99731a58e8d2ec8b.us-east-2.aws.elastic-cloud.com:443/test_doc/_search?typed_keys=true&max_concurrent_shard_requests=5&ignore_unavailable=true&expand_wildcards=open&allow_no_indices=true&ignore_throttled=false&search_type=query_then_fetch&batched_reduce_size=512] returned 1 warnings: [299 Elasticsearch-7.17.6-f65e9d338dc1d07b642e14a27f338990148ee5b6 "[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices."]
12:11:21,928 WARN  [o.e.c.RestClient] request [POST https://eed5e799ef89480a99731a58e8d2ec8b.us-east-2.aws.elastic-cloud.com:443/test_doc_folder/_search?typed_keys=true&max_concurrent_shard_requests=5&ignore_unavailable=true&expand_wildcards=open&allow_no_indices=true&ignore_throttled=false&search_type=query_then_fetch&batched_reduce_size=512] returned 1 warnings: [299 Elasticsearch-7.17.6-f65e9d338dc1d07b642e14a27f338990148ee5b6 "[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices."]

Expected behavior

Document indexed without headers, or indexed with all headers ordered.

Versions:

  • OS: Windows 11
  • Version 22H2
  • FSCrawler Version 2.9
@NitzaAg NitzaAg added the check_for_bug Needs to be reproduced label Feb 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
check_for_bug Needs to be reproduced
Projects
None yet
Development

No branches or pull requests

1 participant