-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add functionality to extract PDF text from specific regions #62
base: master
Are you sure you want to change the base?
Add functionality to extract PDF text from specific regions #62
Conversation
I had added my change in the beginning of the changelog, incorrectly. This commit fixes that mistake.
Thanks for your work on this @PavlosMelissinos! I'll try to merge soon. If you don't see any movement on this, do ping me to remind me. Thanks again! |
I just pushed a tiny commit that updates the docstring of the function! I also realized that:
How do you feel about defaulting missing coordinates to 0? {:w 280
:h 100} should give the same result as: {:x 0
:y 0
:w 280
:h 100
:page-number 0} |
* Missing coordinates are now assumed 0 * Added new test case with missing coords
I've been thinking about this for a while and, well, having
I think I like it better this way but let me know what you think and I'll revert if needed... |
@dotemacs what do you say? 🙂 |
So...? 😄 |
Sorry for the delay. Looking at it quickly, again, it looks good. But I want to look at it properly and try it before merging. Thanks for your work :) |
I'll merge this this weekend and I'll resolve the merge conflict in the CHANGELOG. Sorry for the delay |
No pressure at all, I think we have enough stress in our lives already! |
|
||
(defn extract-by-areas | ||
"get text from specified areas of a PDF document" | ||
[pdfdoc areas] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you tell me what was your thinking here?
Why is pdfdoc
an argument on it's own and areas
is a map?
Why can't it all go into a map?
My thinking is that if you're passing a map around, where all the arguments are in the map, you don't have to think about the position of your arguments.
Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's clearer this way. extract-by-areas
is an operation on a pdf document and the coordinates are just parameters. Sure, they're crucial, but they don't have the same weight as the actual document.
I don't have very strong feelings about this though, it's your library 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started off using mostly rest arguments for the functions in the library.
Then I accepted some PRs which used strict arity.
Let me think about this for a bit and see what option/approach to take, because once this is merged it'll be good to provide the least amount of surprise.
Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see. Yeah I could make it variadic if you'd prefer that. That would be consistent with split-pdf and other functions!
Thanks for the kind words and for your work here :) I left some comments, let me know what you think. Thanks |
Description of your pull request
(Feel free to squash & merge and use this as a commit message!)
Add functionality in
pdfboxing.text
to extract pdf text from specific regionsLarge PDF documents can contain too much content to be properly parsed at once. It would often be preferable to locate the regions that contain the information and extract text from those instead, increasing parsing accuracy and retaining the semantics at the same time.
It's a rather small change that should not introduce a significant maintenance overhead.
Addresses #61
Pull request checklist
Before submitting the PR make sure the following things have been done
(and denote this by checking the relevant checkboxes):
clj-kondo --lint src
).Thanks!