You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For text extraction, pdfboxing currently uses org.apache.pdfbox.text.PDFTextStripper which works on the entire document. However, any document structure is removed during text extraction, so the more data the pdf contains, the harder it becomes to sort it out.
As an alternative, there's also org.apache.pdfbox.text.PDFTextStripperByArea, which allows you to specify a rectangle to extract text from with pretty good results in PDF files with (visually) structured content.
I have prepared a rough prototype that seems to work:
(nsmy-ns
(:require [pdfboxing.common :as common])
(:import (org.apache.pdfbox.text PDFTextStripperByArea)
(java.awt Rectangle)))
(defnextract-by-area"get text from a specified area of a PDF document"
[pdfdoc x y w h page]
(with-open [doc (common/obtain-document pdfdoc)]
(let [rectangle (Rectangle. x y w h)
pdpage (.getPage doc (inc page))
pdftextstripper (doto (PDFTextStripperByArea.)
(.addRegion"region" rectangle)
(.extractRegions pdpage))]
(.getTextForRegion pdftextstripper "region"))))
@dotemacs would you (or anyone else around, for that matter) be interested in this functionality?
If so let me know and I'll put some time into making a proper PR.
note: the unit of measurement when defining the rectangle coordinates is a pt (~0.035cm or ~0.0139in)
The text was updated successfully, but these errors were encountered:
If you were to add it, instead of having positional arguments like you do in [pdfdoc x y w h page], maybe use destructuring and pass in a map, so that the arg vector looks something like [{:keys [pdfdoc x y w h page]}].
And maybe add this functionality in pdfboxing.text namespace.
For text extraction, pdfboxing currently uses org.apache.pdfbox.text.PDFTextStripper which works on the entire document. However, any document structure is removed during text extraction, so the more data the pdf contains, the harder it becomes to sort it out.
As an alternative, there's also org.apache.pdfbox.text.PDFTextStripperByArea, which allows you to specify a rectangle to extract text from with pretty good results in PDF files with (visually) structured content.
I have prepared a rough prototype that seems to work:
@dotemacs would you (or anyone else around, for that matter) be interested in this functionality?
If so let me know and I'll put some time into making a proper PR.
note: the unit of measurement when defining the rectangle coordinates is a pt (~0.035cm or ~0.0139in)
The text was updated successfully, but these errors were encountered: