This is a demonstration project that leverages the Solr Payload Component from https://github.com/o19s/payload-component and the Offset Highlighter Component from https://github.com/o19s/offset-hl-formatter, as well as pdf.js to make PDF documents searchable and have highlighting of matches with the text in context of the PDF.
Checkout how amazing this is at http://pdf-discovery-demo.dev.o19s.com/ ;-)
Just run docker-compose up --build
and then browse to http://localhost:8080. You will need to wait till the init process finishes loading all of the Solr documents to use the website properly.
Solr is running on http://localhost:8983, and PDF images are served up on http://localhost:8443.
You may need to do docker-compose down -v
if you have already run the demo.
There are actually a couple of things that you can learn from this project. They are written up on the wiki:
- Using Tika/Tesseract standalone outside of Solr.
- Using Tika/Tesseract as an API exposed by Solr via ExtractingRequestHandler
- Parsing Tika/Tesseract output inside of Solr via the StatelessScriptUpdateProcessorFactory
- Tesseract 3 and Tika.
- Store binary data in Solr and serve it up like a object store!
cd
into the pdf-viewer
directory.
npm install
npm run serve
from inside the pdf-viewer
directory:
./build.sh
And the script will build and copy the dist
directory contents into /app/pdfviewer/
From the ./ocr/
directory, there are some Powershell ( ;-) ) scripts to recreate the files if you want.
-
cd ./ocr
-
Make sure you have Tesseract installed.
brew install tesseract
on OSX. Alternatively, check that the scriptextract.ps1
isn't pointing at the hosted pdf-discovery-demo version of Tika ;-) Or, if it is, then that's okay. -
Check the
./tika-properties/.../TesseractOCRConfig.properties
file, make sure it points to your Tesseract setup. -
Run the extraction process, creating the working docs in the
./ocr/extracts
directory from the PDF's in./ocr/files
. We have already a pattern of separate directory pairs of inputfilesN
and outputextractsN
.
pwsh extract-directory.ps1 ./files ./extracts
- Create Solr documents.
The output will end up in a docs_for_solrN
.
pwsh create-solr-docs.ps1 ./extracts ./files ./docs_for_solr/
-
Update Scripts for any new
docs_for_solrN
folder: -
Add it to the
./ocr/init/Dockerfile
COPY command. -
You will also need to add it to the
./app/Dockerfile
COPY command. -
Update the
./ocr/init/init.sh
to load the files. -
Now stand up the app with
docker-compose up --build
From the ./ocr/
directory run:
curl -T files/bcreg20090424a1.pdf http://pdf-discovery-demo.dev.o19s.com:9998/rmeta --header "X-Tika-OCRLanguage: eng" --header "X-Tika-PDFOcrStrategy: ocr_and_text_extraction" --header "X-Tika-OCRoutputType: hocr"
Make sure your solrconfig.xml has the name com.o19s.hl.OffsetFormatter
instead of the old com.o19s.labs.OffsetFormatter
:
<formatter name="html"
default="true"
class="com.o19s.hl.OffsetFormatter">
</formatter>
Delete the old offset-hl-formatter-1.0.1-solr7.1.0-SNAPSHOT.jar
and solr-payloads-1.0.3-solr7.1.0-SNAPSHOT.jar
jars from the deployment process, we have nice shiny packages now!!!!
Make sure Solr is package enabled on startup, we need another parameter. (Lets verify install script). -Denable.packages=true
docker-compose down -v
docker-compose build
docker-compose up
And then browse to http://localhost:8080/
To see payloads in action in Solr then run:
http://localhost:8983/solr/documents/select?fl=id,content,path,page_dimensions&hl=on&hl.snippets=10&hl.fl=content&indent=on&q=taxes&wt=json&pl=on&echoParams=all
Make sure your solrconfig.xml has the name com.o19s.hl.OffsetFormatter
instead of the old com.o19s.labs.OffsetFormatter
:
<formatter name="html"
default="true"
class="com.o19s.hl.OffsetFormatter">
</formatter>
Delete the old offset-hl-formatter-1.0.1-solr7.1.0-SNAPSHOT.jar
and solr-payloads-1.0.3-solr7.1.0-SNAPSHOT.jar
jars from the deployment process, we have nice shiny packag
es now!!!!
Make sure Solr is package enabled on startup, we need another parameter. (Lets verify install script). -Denable.packages=true
Build the docker images from scratch via:
docker-compose build
Deploy to our private Docker registry http://harbor.dev.o19s.com:
docker login harbor.dev.o19s.com
docker tag pdf-discovery-demo-app harbor.dev.o19s.com/pdf-discovery-demo/app
docker tag pdf-discovery-demo-solr harbor.dev.o19s.com/pdf-discovery-demo/solr
docker tag pdf-discovery-demo-init harbor.dev.o19s.com/pdf-discovery-demo/init
docker push harbor.dev.o19s.com/pdf-discovery-demo/app
docker push harbor.dev.o19s.com/pdf-discovery-demo/solr
docker push harbor.dev.o19s.com/pdf-discovery-demo/init