Skip to content

Latest commit

 

History

History
509 lines (411 loc) · 13.7 KB

README.adoc

File metadata and controls

509 lines (411 loc) · 13.7 KB

pdftools

badge badge badge

A collection of PDF command line tools and wrappers for Linux written in Bash Shell script. These are generally speaking convenience tools so one does not have to remember very long and cryptic options and switches.

The heavy lifting is done by backend tools such as pdftk, ghostscript and the poppler utils are used.

Installation instructions

The scripts are meant to be installed in a users' home directory. To do this quickly the Makefile in the root of the repository has a target called user_install.

$ make user_install
Install pdftools under /home/<user>/bin
-> Installing bin/img2pdf to /home/<user>/bin/img2pdf
-> Installing bin/ocrpdf to /home/<user>/bin/ocrpdf
-> Installing bin/pdf2pdfa to /home/<user>/bin/pdf2pdfa
-> Installing bin/pdfcat to /home/<user>/bin/pdfcat
-> Installing bin/pdfmeta to /home/<user>/bin/pdfmeta
-> Installing bin/pdfresize to /home/<user>/bin/pdfresize
-> Installing bin/scan2jpg to /home/<user>/bin/scan2jpg
-> Installing bin/scan2pdf to /home/<user>/bin/scan2pdf
-> Installing bin/scan2png to /home/<user>/bin/scan2png

To uninstall everything the target user_uninstall can be used.

img2pdf

A script to convert PNGs, TIFFs or JPEGs to PDF files.

License

MIT

Requires

bash, pdfcat, ImageMagick, pdftk

Examples

Create single page PDFs from images
img2pdf first.png second.jpg
Create single page PDFs from images and then delete the original images.
img2pdf --delete first.png second.jpg
Rotate image 180 degrees prior to creating the PDF.
img2pdf --rotate 180 myimage.png
Rotate images by 180 degrees and concatenate everything into file output.pdf
img2pdf --output output.pdf --rotate 180 page1.png page2.jpg

Usage

  Usage:
    img2pdf <options> <img-file> [<img-file> ... ]

  Options:
    -h  | --help           This message
    -d  | --delete         Delete the images after creating the PDF file.
    -o  | --output <name>  Write the output to specified file <name>.
    -r  | --rotate <value> Rotate the image by <value>
                           Where value can be a positive or negative integer
                           between 0 and 360.
    -V  | --version        Display version and exit

ocrpdf

Runs PDFs through OCR and saves the output as a text searchable PDF with the same name.

ℹ️

Only works with PDFs comprised of a single JPEG, LZW or ZIP compressed image per page. LZW compressed images are being converted to ZIP compressed one during the OCR process.

License

MIT

Requires

bash, pdfcat, pdfimages (poppler-utils), pdftk, tesseract

Examples

Run OCR with all installed languages on a couple of PDFs
ocrpdf first.pdf second.pdf
Run OCR with German dictionary on a single PDF
ocrpdf --lang deu german.pdf
Run OCR with German, French and English dictionaries on multiple PDFs
ocrpdf --lang deu+fra+eng  scanned_*.pdf

Usage

  Usage:
    ocrpdf [options] <file> [<file> [,,]]

  Options:
    -h | --help         This message
    -q | --quiet        Don't send display processed file names
    -V | --version      Print version information and exit
    -l | --lang <lang>  Set the OCR languages to use.
                        For multiple languages concatenate with a '+'
                        E.g eng+deu for English and German
                        Default: deu+eng+fra+ita+jpn+osd

  Description:
    Runs PDFs through OCR and saves the output as a text searchable PDF
    with the same name.

  Disclaimer:
    Only works with PDFs comprised of a single JPEG, LZW or ZIP compressed
    image per page.
    LZW compressed images will be converted to ZIP compressed ones during
    the OCR process.

pdfcat

A quick hack to replace pdfunite as it destroys too much of the original’s meta data.

License

MIT

Requires

bash, pdftk >= 2.0

Examples

Merging two PDFs into a new one
pdfcat first.pdf  second.pdf > merged.pdf
Merging sequentially ordered PDFs into a single document
pdfcat myscan*.pdf > merged.pdf

Usage

  Usage:
    pdfcat [<options>] <pdf> <pdf> [..]

  Options:
    -h | --help    This message.
    -V | --version Print version and exit.

pdfmeta

A wrapper script around pdftk to manipulate a PDFs meta data

License

MIT

Requires

bash >= 4.0, pdftk >= 2.0

Examples

Modify keywords
pdfmeta --keywords "rainbow, magical, unicorn" unicorn.pdf rainbow.pdf
Modify creation date
pdfmeta --creation-date "2017-01-01 22:30:45" unicorn.pdf

Usage

  Usage:
    pdfmeta <options> <pdf> [[<pdf>] ..]

    Options:
      -h | --help               This message
      -k | --keywords           Comma separated list of keywords
      -s | --subject            Define the PDFs subject
      -t | --title              Define the PDFs title
      -c | --creator            Define the PDFs creator program or library
      -p | --producer           Define the PDFs producing program
      -C | --creation-date      Set the creation date of the PDF
      -M | --modification-date  Set the modification date of the PDF
      -V | --version            Display version and exit
ℹ️

On Ubuntu 18.04 (bionic) pdfmeta works only with the snap or with version 3.2.x of pdftk-java. With every other version of pdftk CreationDate and ModDate will not work when running the unit tests. The changed PDF has no problem but pdfinfo from the poppler-utils package can’t handle the changed entries and reports them as empty.

pdfresize

A wrapper around ghostscript to reduce the size of a scanned document

ℹ️

pdfresize is very likely not working with PDF documents containing JBIG2 images.

License

MIT

Requires

bash, ghostscript

Examples

Resize to default resolution
pdresize --input input.pdf --output output.pdf
Resize to screen resolution
pdfresize --quality screen --input input.pdf --output output.pdf

Usage

  Usage:
    pdfresize [-q pdfsettings] -i <input> -o <output>

  Options:
    -h | --help              This message
    -i | --input <input>     A PDF file preferably of high resolution
    -o | --output <output>   Name of the PDF file to save the result to
    -q | --quality <quality> Quality settings for output PDF.
                             See quality keywords for acceptable input.
    -V | --version           Print version and exit.

  Quality keywords:
    screen   - low-resolution; comparable to "Screen Optimized" in Acrobat Distiller
    ebook    - medium-resolution; comparable to "eBook" in Acrobat Distiller
    printer  - comparable to "Print Optimized" in Acrobat Distiller
    prepress - comparable to "Prepress Optimized" in Acrobat Distiller
    default  - intended to be useful across a wide variety of uses

pdf2pdfa

Small script to convert a PDF to PDF/A type.

ℹ️

This is early beta and all the meta data in the PDF will be lost!

Examples

Convert pdf file sample.pdf to a PDF/A-2 named sample_a.pdf
pdf2pdfa sample.pdf
Convert pdf file sample.pdf to a PDF/A-2 named sample_pdfa.pdf
pdf2pdfa --suffix _pdfa sample.pdf
Convert pdf file sample.pdf to a PDF/A-1 named sample_a.pdf
pdf2pdfa --level 1 sample.pdf
Convert pdf file sample.pdf to a PDF/A-3 exiting on errors.
pdf2pdfa --level 3 --strict sample.pdf
Convert pdf file sample.pdf to a PDF/A-2 with color model CMYK.
pdf2pdfa --color-model CMYK sample.pdf

Usage

  Usage:
    pdf2pdfa [<options>] <pdf_file> [<pdf_file> [..]]

  Options:
    -c | --color-model <model> Color model to use for the conversion.
                               Valid input is RGB or CMYK.
                               Default: RGB
    -h | --help                This message
    -l | --level <number>      PDF-A specification level to use.
                               Valid input is 1 (A-1), 2 (A-2) and 3 (A-3).
                               Default: 2
    -S | --strict              Exit if errors are encountered during conversion.
    -s | --suffix <suffix>     Append <suffix> to filename
                               Default '_a'
    -V | --version             Display version and exit.

scan2pdf

Is frontend for scanimage but has only been tested against the Canon LiDE 210 scanner.

Some but not all notable features are:

  • Can OCR scanned documents using tesseract.

  • Scan a few predefined sizes such as A4 and A5 among others.

  • Symlinked to scan2png produces PNG and symlinked to scan2jpg produces JPEG image output.

  • Has command line mode only for single page or interactive mode for multi page scans.

Examples

Simple single page document produces file scan_YYYY-MM-DD_hh-mm-ss.pdf
scan2pdf
Simple single page document wth OCR produces file
`scan_YYYY-MM-DD_hh-mm-ss.pdf`
scan2pdf --ocr
Multi page interactive mode wth OCR.
scan2pdf --interactive --ocr
Enter filename [scan_2022-01-26_23-15-30]: (1)
1) Scan document (2)
2) Finish scan (3)
3) Wrap up and quit (4)
Choose action > 1 (5)
Choose action > 1 (6)
Choose action > 3 (7)
  1. Provide file name or press enter to accept the default name.

  2. Menu option 1 scans a page then returns to the prompt.

  3. Menu option 2 writes all pages to a PDF file and prompts for a new name.

  4. Menu option 3 writes all pages to a PDF file and exists.

  5. Scan one page.

  6. Scan another page.

  7. Write PDF and exit.

Scan and save as JPEG with filename scan_YYYY-MM-DD_hh-mm-ss.jpg
scan2jpg

Usage

  Usage: scan2pdf <options>

    --interactive  -I  Interactive mode
    --type         -t  Document Type
                       Possible values are:
                         d[ocument]      for a text document
                         i[llustration]  for a drawing
                         ph[otograph]    for a photographic pictue
                         pr[int]         for a scan from a print e.g. newspaper
                         r[aw]           for not applying any post-processing
                         Default: document
    --resolution   -r  Resolution of scan
                         Possible values are 75, 150, 300, 600, 1200
                         Default: 300
    --page         -p  Page Size
                         Possible values are A4, A5, A6, Letter, CreditCard, CD-Cover
                         Default: A4
    --depth        -d  Color depth of scan
                         1 for LineArt (Black & White)
                         8 for Grayscale and Color
                         16 for Color
                         Default: 8
    --format       -f  PDF image compression
                         Possible values are jpeg, zip, lzw
                         Default: jpeg
    --quality      -q  Recommended for jpeg, zip, png
                         Values for jpeg from 0 to 100
                         Values for png and zip from 0 to 9
                         Default: 90
     --mode        -m  Color mode of scan
                         Possible values are Lineart, Gray, Color
                         Default: Color
     --ocr         -R  Run the scan through character recognition
                         Default: false
     --ocr-lang    -L  Set the language for the character recognition
                         Every language 'tesseract' supports
                         Default: deu+eng+fra+ita+jpn+osd
     --output      -o  Filename of PDF file
                         Default: scan_2022-01-26_23-10-20
     --orientation -O  Document orientation
                         Possible options p[ortrait], l[andscape]
                         Default: portrait
     --scanner     -s  Set the scanner to be used
                         E.g: gensys:libusb:001:005
     --help        -h  This message