Skip to content

Java utility for parsing PDF tabular data using Apache PDFBox and OpenCV

License

Notifications You must be signed in to change notification settings

rostrovsky/pdf-table

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

61 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF-table

What is PDF-table?

PDF-table is Java utility library that can be used for parsing tabular data in PDF documents.
Core processing of PDF documents is performed with utilization of Apache PDFBox and OpenCV.

Prerequisites

JDK

JAVA 8 is required.

External dependencies

pdf-table requires compiled OpenCV 3.4.2 to work properly:

  1. Download OpenCV v3.4.2 from https://github.com/opencv/opencv/releases/tag/3.4.2

  2. Unpack it and add to your system PATH:

    • Windows: <opencv dir>\build\java\x64

    • Linux: TODO

Installation

<dependency>
  <groupId>com.github.rostrovsky</groupId>
  <artifactId>pdf-table</artifactId>
  <version>1.0.0</version>
</dependency>

Usage

Parsing PDFs

When PDF document page is being parsed, following operations are performed:

  1. Page is converted to grayscale image [OpenCV].

  2. Binary Inverted Threshold (BIT) is applied to grayscaled image [OpenCV].

  3. Contours are detected on BIT image and contour mask is created (additional Canny filtering can be turned on in this step) [OpenCV].

  4. Contour mask is XORed with BIT image [OpenCV].

  5. Contours are detected once again on XORed image (additional Canny filtering can be turned on in this step) [OpenCV].

  6. Final contours are drawn [OpenCV].

  7. Bounding rectangles are detected from final contours [OpenCV].

  8. PDF is being parsed region-by-region using bounding rectangles coordinates [Apache PDFBox].

Above algorithm is mostly derived from http://stackoverflow.com/a/23106594.

For more information about parsed output, refer to Output format

single-threaded example

class SingleThreadParser {
    public static void main(String[] args) throws IOException {
        PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
        PdfTableReader reader = new PdfTableReader();
        List<ParsedTablePage> parsed = reader.parsePdfTablePages(pdfDoc, 1, pdfDoc.getNumberOfPages());
    }
}

multi-threaded example

class MultiThreadParser {
    public static void main(String[] args) throws IOException {
        final int THREAD_COUNT = 8;
        PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
        PdfTableReader reader = new PdfTableReader();

        // parse pages simultaneously
        ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);
        List<Future<ParsedTablePage>> futures = new ArrayList<>();
        for (final int pageNum : IntStream.rangeClosed(1, pdfDoc.getNumberOfPages()).toArray()) {
            Callable<ParsedTablePage> callable = () -> {
                ParsedTablePage page = reader.parsePdfTablePage(pdfDoc, pageNum);
                return page;
            };
            futures.add(executor.submit(callable));
        }

        // collect parsed pages
        List<ParsedTablePage> unsortedParsedPages = new ArrayList<>(pdfDoc.getNumberOfPages());
        try {
            for (Future<ParsedTablePage> f : futures) {
                ParsedTablePage page = f.get();
                unsortedParsedPages.add(page.getPageNum() - 1, page);
            }
        } catch (Exception e) {
            throw new RuntimeException(e);
        }

        // sort pages by pageNum
        List<ParsedTablePage> sortedParsedPages = unsortedParsedPages.stream()
                .sorted((p1, p2) -> Integer.compare(p1.getPageNum(), p2.getPageNum())).collect(Collectors.toList());
    }
}

Saving PDF pages as PNG images

PDF-Table provides methods for saving PDF pages as PNG images.
Rendering DPI can be modified in PdfTableSettings (see: Parsing settings).

single-threaded example

class SingleThreadPNGDump {
    public static void main(String[] args) throws IOException {
        PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
        Path outputPath = Paths.get("C:", "some_directory");
        PdfTableReader reader = new PdfTableReader();
        reader.savePdfPagesAsPNG(pdfDoc, 1, pdfDoc.getNumberOfPages(), outputPath);
    }
}

multi-threaded example

class MultiThreadPNGDump {
    public static void main(String[] args) throws IOException {
        final int THREAD_COUNT = 8;
        Path outputPath = Paths.get("C:", "some_directory");
        PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
        PdfTableReader reader = new PdfTableReader();

        ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);
        List<Future<Boolean>> futures = new ArrayList<>();
        for (final int pageNum : IntStream.rangeClosed(1, pdfDoc.getNumberOfPages()).toArray()) {
            Callable<Boolean> callable = () -> {
                reader.savePdfPageAsPNG(pdfDoc, pageNum, outputPath);
                return true;
            };
            futures.add(executor.submit(callable));
        }

        try {
            for (Future<Boolean> f : futures) {
                f.get();
            }
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }
}

Saving debug PNG images

When tables in PDF document cannot be parsed correctly with default settings, user can save debug images that show page at various stages of processing.
Using these images, user can adjust PdfTableSettings accordingly to achieve desired results (see: Parsing settings).

single-threaded example

class SingleThreadDebugImgsDump {
    public static void main(String[] args) throws IOException {
        PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
        Path outputPath = Paths.get("C:", "some_directory");
        PdfTableReader reader = new PdfTableReader();
        reader.savePdfTablePagesDebugImages(pdfDoc, 1, pdfDoc.getNumberOfPages(), outputPath);
    }
}

multi-threaded example

class MultiThreadDebugImgsDump {
    public static void main(String[] args) throws IOException {
        final int THREAD_COUNT = 8;
        Path outputPath = Paths.get("C:", "some_directory");
        PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
        PdfTableReader reader = new PdfTableReader();

        ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);
        List<Future<Boolean>> futures = new ArrayList<>();
        for (final int pageNum : IntStream.rangeClosed(1, pdfDoc.getNumberOfPages()).toArray()) {
            Callable<Boolean> callable = () -> {
                reader.savePdfTablePagesDebugImage(pdfDoc, pageNum, outputPath);
                return true;
            };
            futures.add(executor.submit(callable));
        }

        try {
            for (Future<Boolean> f : futures) {
                f.get();
            }
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }
}

Parsing settings

PDF rendering and OpenCV filtering settings are stored in PdfTableSettings object.

Custom settings instance can be passed to PdfTableReader constructor when non-default values are needed:

(...)

// build settings object
PdfTableSettings settings = PdfTableSettings.getBuilder()
                .setCannyFiltering(true)
                .setCannyApertureSize(5)
                .setCannyThreshold1(40)
                .setCannyThreshold2(190.5)
                .setPdfRenderingDpi(160)
                .build();

// pass settings to reader
PdfTableReader reader = new PdfTableReader(settings);

Output format

Each parsed PDF page is being returned as ParsedTablePage object:

(...)

PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
PdfTableReader reader = new PdfTableReader();

// first page in document has index == 1, not 0 !
ParsedTablePage firstPage = reader.parsePdfTablePage(pdfDoc, 1);

// getting page number
assert firstPage.getPageNum() == 1;

// rows and cells are zero-indexed just like elements of the List
// getting first row
ParsedTablePage.ParsedTableRow firstRow = firstPage.getRow(0);

// getting third cell in second row
String thirdCellContent = firstPage.getRow(1).getCell(2);

// cell content usually contain <CR><LF> characters,
// so it is recommended to trim them before processing
double thirdCellNumericValue = Double.valueOf(thirdCellContent.trim());