v0.12.0
We are pleased to announce version 0.12.0 of ACHE Crawler!
Following is a detailed log of the changes since the last version:
- Upgrade
crawler-commons
dependency to version 0.9 - Removed Elasticsearch transport-client-based repository
- Removed Elasticsearch 1.4.4 binaries dependency
- Added DumpDataFromElasticsearch tool for dumping documents from Elasticsearch
repositories - Added configuration for minimum relevance in link selectors
- Added configuration for selecting whether should re-crawl sitemaps and
robots.txt links - Added documentaion about
relevance_threshold
parameters to the target page
classifiers documentation page - Added support for crawling via HTTP proxy in okhttp3 fetcher (by @maqzi)
- Added tracking of more HTTP error messages (301, 302, 3xx, 402) (by @maqzi)
- Upgrade
crawler-commons
library to version 1.0 - Upgrade
commons-validator
library to version 1.6 - Upgrade
okhttp3
library to version 3.14.0 - Fix issue #177: Links from recent TLDs are considered invalid
- Upgrade RocksDB dependency (rocksdbjni) to version 6.2.2
- Added error code details to RocksDB exception logs
- Upgrade gradle-node-plugin to version 1.3.1
- Upgrade npm version to 6.10.2
- Upgrade ache-dashboard npm dependencies
- Upgrade gradle wrapper to version 5.6.1
- Update Dockerfile to use openjdk:11-jdk (Java 11)
- Added content_type field to RegexTargetClassifier
- Change default link classifier to LinkClassifierBreadthSearch
- Update io.airlift:airline dependency to version 0.8
- Update gradle build script to use new plugins DSL
- Update coverals gradle plugin to version 2.9.0
- Update searchkit to version ^2.4.0