You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are indexing a .doc having page numbers in the headers. After indexing, we see a number at the beginning of the indexed document, which we presume belongs to a page number in the document's headers. We don't see any other header getting indexed. This job indexes the .doc and gets a 4 at the beginning of the indexed document.
Is there a way to prevent headers from getting indexed? We noticed that for this document, saving the .doc as .docx surpassed the error.
12:11:17,791 INFO [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [230.1mb/3.9gb=5.72%], RAM [5.8gb/15.7gb=37.31%], Swap [5.9gb/22.2gb=26.82%].
12:11:18,052 INFO [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
12:11:18,054 INFO [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
12:11:19,508 INFO [f.p.e.c.f.c.v.ElasticsearchClientV7] Elasticsearch Client for version 7.x connected to a node running version 7.17.6
12:11:20,039 INFO [f.p.e.c.f.c.v.ElasticsearchClientV7] Elasticsearch Client for version 7.x connected to a node running version 7.17.6
12:11:20,457 WARN [o.e.c.RestClient] request [PUT https://eed5e799ef89480a99731a58e8d2ec8b.us-east-2.aws.elastic-cloud.com:443/test_doc?master_timeout=30s&timeout=30s] returned 1 warnings: [299 Elasticsearch-7.17.6-f65e9d338dc1d07b642e14a27f338990148ee5b6 "Camel case format name dateOptionalTime is deprecated and will be removed in a future version. Use snake case name date_optional_time instead."]
12:11:20,818 INFO [f.p.e.c.f.FsParserAbstract] FS crawler started for [test_doc] for [C:\Users\usr\Documents\Work\nas\download\LPR\doc] every [15m]
12:11:20,934 INFO [f.p.e.c.f.t.TikaInstance] OCR is disabled.
12:11:21,839 WARN [o.e.c.RestClient] request [POST https://eed5e799ef89480a99731a58e8d2ec8b.us-east-2.aws.elastic-cloud.com:443/test_doc/_search?typed_keys=true&max_concurrent_shard_requests=5&ignore_unavailable=true&expand_wildcards=open&allow_no_indices=true&ignore_throttled=false&search_type=query_then_fetch&batched_reduce_size=512] returned 1 warnings: [299 Elasticsearch-7.17.6-f65e9d338dc1d07b642e14a27f338990148ee5b6 "[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices."]
12:11:21,928 WARN [o.e.c.RestClient] request [POST https://eed5e799ef89480a99731a58e8d2ec8b.us-east-2.aws.elastic-cloud.com:443/test_doc_folder/_search?typed_keys=true&max_concurrent_shard_requests=5&ignore_unavailable=true&expand_wildcards=open&allow_no_indices=true&ignore_throttled=false&search_type=query_then_fetch&batched_reduce_size=512] returned 1 warnings: [299 Elasticsearch-7.17.6-f65e9d338dc1d07b642e14a27f338990148ee5b6 "[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices."]
Expected behavior
Document indexed without headers, or indexed with all headers ordered.
Versions:
OS: Windows 11
Version 22H2
FSCrawler Version 2.9
The text was updated successfully, but these errors were encountered:
Describe the bug
We are indexing a .doc having page numbers in the headers. After indexing, we see a number at the beginning of the indexed document, which we presume belongs to a page number in the document's headers. We don't see any other header getting indexed. This job indexes the .doc and gets a 4 at the beginning of the indexed document.
Is there a way to prevent headers from getting indexed? We noticed that for this document, saving the .doc as .docx surpassed the error.
Job Settings
Logs
Expected behavior
Document indexed without headers, or indexed with all headers ordered.
Versions:
The text was updated successfully, but these errors were encountered: