Document gets indexed incorrectly in .doc #1602

NitzaAg · 2023-02-08T17:08:01Z

Describe the bug

We are indexing a .doc having page numbers in the headers. After indexing, we see a number at the beginning of the indexed document, which we presume belongs to a page number in the document's headers. We don't see any other header getting indexed. This job indexes the .doc and gets a 4 at the beginning of the indexed document.

Is there a way to prevent headers from getting indexed? We noticed that for this document, saving the .doc as .docx surpassed the error.

Job Settings

---
name: "test_doc"
fs:
  url: "C:\\Users\\usr\\Documents\\Work\\nas\\download\\LPR\\doc"
  update_rate: "15m"
  excludes:
  - "*\\.docx"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: false
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  nodes:
  - cloud_id: "XXXX"
  username: "XXXX"
  password: "XXXX"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"

Logs

12:11:17,791 INFO  [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [230.1mb/3.9gb=5.72%], RAM [5.8gb/15.7gb=37.31%], Swap [5.9gb/22.2gb=26.82%].
12:11:18,052 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
12:11:18,054 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
12:11:19,508 INFO  [f.p.e.c.f.c.v.ElasticsearchClientV7] Elasticsearch Client for version 7.x connected to a node running version 7.17.6
12:11:20,039 INFO  [f.p.e.c.f.c.v.ElasticsearchClientV7] Elasticsearch Client for version 7.x connected to a node running version 7.17.6
12:11:20,457 WARN  [o.e.c.RestClient] request [PUT https://eed5e799ef89480a99731a58e8d2ec8b.us-east-2.aws.elastic-cloud.com:443/test_doc?master_timeout=30s&timeout=30s] returned 1 warnings: [299 Elasticsearch-7.17.6-f65e9d338dc1d07b642e14a27f338990148ee5b6 "Camel case format name dateOptionalTime is deprecated and will be removed in a future version. Use snake case name date_optional_time instead."]
12:11:20,818 INFO  [f.p.e.c.f.FsParserAbstract] FS crawler started for [test_doc] for [C:\Users\usr\Documents\Work\nas\download\LPR\doc] every [15m]
12:11:20,934 INFO  [f.p.e.c.f.t.TikaInstance] OCR is disabled.
12:11:21,839 WARN  [o.e.c.RestClient] request [POST https://eed5e799ef89480a99731a58e8d2ec8b.us-east-2.aws.elastic-cloud.com:443/test_doc/_search?typed_keys=true&max_concurrent_shard_requests=5&ignore_unavailable=true&expand_wildcards=open&allow_no_indices=true&ignore_throttled=false&search_type=query_then_fetch&batched_reduce_size=512] returned 1 warnings: [299 Elasticsearch-7.17.6-f65e9d338dc1d07b642e14a27f338990148ee5b6 "[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices."]
12:11:21,928 WARN  [o.e.c.RestClient] request [POST https://eed5e799ef89480a99731a58e8d2ec8b.us-east-2.aws.elastic-cloud.com:443/test_doc_folder/_search?typed_keys=true&max_concurrent_shard_requests=5&ignore_unavailable=true&expand_wildcards=open&allow_no_indices=true&ignore_throttled=false&search_type=query_then_fetch&batched_reduce_size=512] returned 1 warnings: [299 Elasticsearch-7.17.6-f65e9d338dc1d07b642e14a27f338990148ee5b6 "[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices."]

Expected behavior

Document indexed without headers, or indexed with all headers ordered.

Versions:

OS: Windows 11
Version 22H2
FSCrawler Version 2.9

The text was updated successfully, but these errors were encountered:

NitzaAg added the check_for_bug Needs to be reproduced label Feb 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document gets indexed incorrectly in .doc #1602

Document gets indexed incorrectly in .doc #1602

NitzaAg commented Feb 8, 2023 •

edited

Loading

Document gets indexed incorrectly in .doc #1602

Document gets indexed incorrectly in .doc #1602

Comments

NitzaAg commented Feb 8, 2023 • edited Loading

NitzaAg commented Feb 8, 2023 •

edited

Loading