Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to remove headers and footers permanently)? #52

Open
Shohreh opened this issue May 12, 2023 · 7 comments
Open

How to remove headers and footers permanently)? #52

Shohreh opened this issue May 12, 2023 · 7 comments

Comments

@Shohreh
Copy link

Shohreh commented May 12, 2023

Hello,

I don't know much about PDF, and am confused about *box (mediabox, cropbox, etc.) and the units used in *box and pdfCropMargins (pt vs. %).

What would be the right way to permanently — not just for viewing: The data must no longer be in the output file — remove the headers and footers on most pages of a PDF, while leaving some pages untouched (eg. the first page of each chapter)?

Thank you.

image

@abarker
Copy link
Owner

abarker commented May 13, 2023

I'm hesitant to suggest a way to permanently remove margins, because if people want to use it for redaction they may end up being surprised. You mentioned mutool in your other issue, but I'm not certain how secure this removal is or exactly how it is implemented at the PDF level. It may be implemented the same way as pdfCropMargins and just modify the box data without changing the underlying PDF more than that.

@abarker
Copy link
Owner

abarker commented May 13, 2023

Points are the standard unit of PDF files, 1 point = 1/72 inch. The percentage values take a percentage of the existing margins, for example if the existing margin is 100 points then 50% would reduce it to 50 points.

@Shohreh
Copy link
Author

Shohreh commented May 14, 2023

Thanks. I'll keep looking at a way to remove stuff I need permanently removed, either through changing the mediabox or redaction annotations.

@DestoGit
Copy link

DestoGit commented Aug 8, 2023

Has a solution been implemented for this feature? It is badly needed. The current workaround I use is saving the pdfs as image only. And then performing ocr and saving it again with ABBY.
Is there a way to do it without re-ocring if possible, and in batch over multiple pdfs at once?

Could this be used to auto detect and use as reference to crop?
pdf header and footer detector

pdfminer, Apache Tika

grobid

Excluding the Header and Footer Contents of a page of a PDF file while extracting text?

[D] Data cleaning techniques for PDF documents with semantically meaningful parts

Perhaps these also for ideas
How to extract and structure text from PDF files with Python and machine learning

Convert PDFs to Audiobooks with Machine Learning

How to convert PDFs to audiobooks with machine learning

pdf2audiobook

@Shohreh
Copy link
Author

Shohreh commented Aug 8, 2023

The work-around I found is 1) finding the coords with SumatraPDF (hit the "m" key to see the coordinates), and 2) running a Python script to add and delete redaction annotations.

@abarker
Copy link
Owner

abarker commented Aug 8, 2023

All the current processing of PDF files is done with the PyMuPdf program. If there is a way to do this with that program then I would consider adding an option.

I'm not entirely clear what your exact use-case is. You want to remove the actual PDF content that is rendered outside a selected box, without turning the document into a rendered-image or scanned-style document? Does this need to be secure data destruction, such as for legal documents, etc.?

@DestoGit
Copy link

DestoGit commented Aug 10, 2023

All the current processing of PDF files is done with the PyMuPdf program. If there is a way to do this with that program then I would consider adding an option.

I'm not entirely clear what your exact use-case is. You want to remove the actual PDF content that is rendered outside a selected box, without turning the document into a rendered-image or scanned-style document? Does this need to be secure data destruction, such as for legal documents, etc.?

Thanks for the reply and sorry for the late return.

The use case is to process many different books, articles, plays etc.
with great variations in layouts and Headers and Footers locations.

Ideally, doing a batch process as this example:
On a folder with say 1000 pdfs,

  1. Auto-Detect the main page body text block vs the Header and Footers text blocks.
  2. Auto-Crop to the main page body text block only.
  3. Save the pdfs with body only - no Header and Footer sub layer,
    with the ocr content intact but trimed of the Header and Footer ocr blocks.

The end use would be to then process as text to speech or to port to audio format.
No secure data destruction needed, just removing the Header and footers text blocks
so it does not appear in the end use process output.

My problem doing it with ABBYY is:

  1. The cropping needs manual click and drag to select the body dimension.
  2. Once cropped, the pdf output needs to be saved as image else the headers and footers
    text sub layer is still there when end use processing.
  3. Once saved as image, the pdf output needs to be re-Ocred, which takes time and is less
    accurate if the pdf was not a scanned one.
  4. Once re-ocred, the output pdf needs to be saved as searchable pdf.

Thanks again for your suggestions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants