Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract text from pdf area #61

Open
PavlosMelissinos opened this issue Jan 5, 2021 · 1 comment
Open

Extract text from pdf area #61

PavlosMelissinos opened this issue Jan 5, 2021 · 1 comment

Comments

@PavlosMelissinos
Copy link

PavlosMelissinos commented Jan 5, 2021

For text extraction, pdfboxing currently uses org.apache.pdfbox.text.PDFTextStripper which works on the entire document. However, any document structure is removed during text extraction, so the more data the pdf contains, the harder it becomes to sort it out.

As an alternative, there's also org.apache.pdfbox.text.PDFTextStripperByArea, which allows you to specify a rectangle to extract text from with pretty good results in PDF files with (visually) structured content.

I have prepared a rough prototype that seems to work:

(ns my-ns
  (:require [pdfboxing.common :as common])
  (:import (org.apache.pdfbox.text PDFTextStripperByArea)
           (java.awt Rectangle)))

(defn extract-by-area
  "get text from a specified area of a PDF document"
  [pdfdoc x y w h page]
  (with-open [doc (common/obtain-document pdfdoc)]
    (let [rectangle       (Rectangle. x y w h)
          pdpage          (.getPage doc (inc page))
          pdftextstripper (doto (PDFTextStripperByArea.)
                            (.addRegion "region" rectangle)
                            (.extractRegions pdpage))]
      (.getTextForRegion pdftextstripper "region"))))

@dotemacs would you (or anyone else around, for that matter) be interested in this functionality?

If so let me know and I'll put some time into making a proper PR.

note: the unit of measurement when defining the rectangle coordinates is a pt (~0.035cm or ~0.0139in)

@dotemacs
Copy link
Owner

dotemacs commented Jan 5, 2021

Hello @PavlosMelissinos

This looks like a good change.

If you were to add it, instead of having positional arguments like you do in [pdfdoc x y w h page], maybe use destructuring and pass in a map, so that the arg vector looks something like [{:keys [pdfdoc x y w h page]}].

And maybe add this functionality in pdfboxing.text namespace.

And please add tests.

Thanks for your time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants