File exception: UglyToad.PdfPig.Core.PdfDocumentFormatException' was thrown. #860

FinalFrontierPrototyping · 2024-06-27T08:14:16Z

Hello,

I found this really nice project because I need to read and process many pdf files.
(At the moment I am using V0.19-Alpha but also tested V0.18)
The pdf file can be opened with adobe, however, when I want to read it with PdfPig an error is thrown:

Once in a while I get the following exception while reading a file: var document = PdfDocument.Open(fileEntry);

'Exception of type 'UglyToad.PdfPig.Core.PdfDocumentFormatException' was thrown.'

UglyToad.PdfPig.Core.PdfDocumentFormatException
HResult=0x80131500
Message=Exception of type 'UglyToad.PdfPig.Core.PdfDocumentFormatException' was thrown.
Source=UglyToad.PdfPig
StackTrace:
at UglyToad.PdfPig.Parser.FileStructure.CrossReferenceParser.Parse(IInputBytes bytes, Boolean isLenientParsing, Int64 crossReferenceLocation, Int64 offsetCorrection, IPdfTokenScanner pdfScanner, ISeekableTokenScanner tokenScanner)
at UglyToad.PdfPig.Parser.PdfDocumentFactory.OpenDocument(IInputBytes inputBytes, ISeekableTokenScanner scanner, InternalParsingOptions parsingOptions)

Since the PDF files are confidential, I cannot share them. What can be the cause?

Thanks.

FinalFrontierPrototyping · 2024-06-27T15:51:59Z

I noticed that when I open the file, add one character to a field, save it and reprocess it, it gives no error?

FinalFrontierPrototyping · 2024-07-01T13:09:55Z

Anything I can provide in order to support you as efficient as possible?
This issue is making my current tool non-functional because 5% of the PDF files cannot be processed.

EliotJones · 2024-07-01T15:10:07Z

Unfortunately this error can be due to basically any unexpected formatting in the source file. Without the source file it is very difficult to tell.

The error message suggests the error is happening when trying to find the information near the end of the document which looks like:

xref
0 103
0000000000 65535 f 
0000058002 00000 n 
0000000019 00000 n 
0000001903 00000 n 
0000058273 00000 n
...

It might be possible to get more information about the error locally by debugging the PdfPig code. You can clone this repository and locally set the version of .NET you have available with this script https://github.com/UglyToad/PdfPig/blob/master/tools/set-dotnet-version.ps1

Then you can load the file in a test and see what is going wrong: https://github.com/UglyToad/PdfPig/blob/master/src/UglyToad.PdfPig.Tests/Integration/LocalTests.cs

FinalFrontierPrototyping · 2024-07-18T15:24:33Z

Hello @EliotJones,

Thanks for the feedback.
I removed the Nuget package and cloned the repository.

When I run my code I see the following:

PdfTokenScanner: Line: 241 : Debug.WriteLine("Found more than 1 token in an object.");
The exception is thrown by: CrossReferenceParser (line 164:

if (!TryParseCrossReferenceStream(previousCrossReferenceLocation, pdfScanner, null, out var tablePart))
                    {
                        if (!TryBruteForceXrefTableLocate(bytes, previousCrossReferenceLocation, out var actualOffset))
                        {
                            //This is the point when it gives the error
                            throw new PdfDocumentFormatException();
                        }

                        previousCrossReferenceLocation = actualOffset;
                        missedAttempts++;
                        continue;
                    }

EliotJones · 2024-09-29T15:29:58Z

Sorry I've basically run out of will-to-continue with this library so you're probably long gone by now. But based on the error you're getting now it sounds like there could be a malformed object in the specific file which PdfPig doesn't yet have a workaround for.

What you'd need to find is which object PdfTokenScanner is complaining about, an object in a PDF file has the form:

123 0 obj
// more stuff here
endobj

PdfTokenScanner will have the value of the object number in line 249:

var reference = new IndirectReference(objectNumber.Long, generation.Int);

You can then find the object in the file on the basis of finding the text:

{objectNumber.Long} {generation.Int} obj

In the file using e.g. Notepad++ and copying everything until the endobj marker into this issue. That should give more insight into what error-correction PdfPig is currently lacking, if this is indeed the problem.

EliotJones added bug document-reading Related to reading documents labels Sep 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File exception: UglyToad.PdfPig.Core.PdfDocumentFormatException' was thrown. #860

File exception: UglyToad.PdfPig.Core.PdfDocumentFormatException' was thrown. #860

FinalFrontierPrototyping commented Jun 27, 2024 •

edited

Loading

FinalFrontierPrototyping commented Jun 27, 2024

FinalFrontierPrototyping commented Jul 1, 2024

EliotJones commented Jul 1, 2024 •

edited

Loading

FinalFrontierPrototyping commented Jul 18, 2024

EliotJones commented Sep 29, 2024

File exception: UglyToad.PdfPig.Core.PdfDocumentFormatException' was thrown. #860

File exception: UglyToad.PdfPig.Core.PdfDocumentFormatException' was thrown. #860

Comments

FinalFrontierPrototyping commented Jun 27, 2024 • edited Loading

FinalFrontierPrototyping commented Jun 27, 2024

FinalFrontierPrototyping commented Jul 1, 2024

EliotJones commented Jul 1, 2024 • edited Loading

FinalFrontierPrototyping commented Jul 18, 2024

EliotJones commented Sep 29, 2024

FinalFrontierPrototyping commented Jun 27, 2024 •

edited

Loading

EliotJones commented Jul 1, 2024 •

edited

Loading