admin管理员组

文章数量:1323723

I am attempting to extract plain text out of a pdf document using pdf.js and for some reason am unable to get past the Invalid PDF structure error.

My code as such:

const pdfjslib = require('pdfjs-dist');

const pdfPath = '.pdf'

var loadingTask = pdfjslib.getDocument(pdfPath);
loadingTask.promise.then(async (doc) => {
    console.log(doc);
    return null
})
.catch((err)=>{
    console.log(err)
});

I have tried other pdf documents ing from the same domain but all throws the same error:

...
Warning: Ignoring invalid character "34" in hex string
Warning: Ignoring invalid character "104" in hex string
Warning: Indexing all PDF objects
{ Error
    at InvalidPDFExceptionClosure (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:658:35)
    at Object.<anonymous> (...pdf_test/node_modules/pdfjs-dist/build/pdf.js:661:2)
    at __w_pdfjs_require__ (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:52:30)
    at Object.defineProperty.value (...pdf_test/node_modules/pdfjs-dist/build/pdf.js:129:23)
    at __w_pdfjs_require__ (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:52:30)
    at pdfjsVersion (...pdf_test/node_modules/pdfjs-dist/build/pdf.js:116:18)
    at .../pdf_test/node_modules/pdfjs-dist/build/pdf.js:119:10
    at webpackUniversalModuleDefinition (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:25:20)
    at Object.<anonymous> (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:32:3)
    at Module._pile (internal/modules/cjs/loader.js:776:30)
  name: 'InvalidPDFException',
  message: 'Invalid PDF structure' }

Other pdfs from other domains seem to work. Note that downloading the pdf from the above domain works well, and can be viewed on Chrome browser. I doubt that the pdf document is corrupted. I am not implementing any front end code as the intention of the above code is host it on cloud.

I am attempting to extract plain text out of a pdf document using pdf.js and for some reason am unable to get past the Invalid PDF structure error.

My code as such:

const pdfjslib = require('pdfjs-dist');

const pdfPath = 'https://www.corenet.gov.sg/media/2268607/dc19-07.pdf'

var loadingTask = pdfjslib.getDocument(pdfPath);
loadingTask.promise.then(async (doc) => {
    console.log(doc);
    return null
})
.catch((err)=>{
    console.log(err)
});

I have tried other pdf documents ing from the same domain but all throws the same error:

...
Warning: Ignoring invalid character "34" in hex string
Warning: Ignoring invalid character "104" in hex string
Warning: Indexing all PDF objects
{ Error
    at InvalidPDFExceptionClosure (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:658:35)
    at Object.<anonymous> (...pdf_test/node_modules/pdfjs-dist/build/pdf.js:661:2)
    at __w_pdfjs_require__ (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:52:30)
    at Object.defineProperty.value (...pdf_test/node_modules/pdfjs-dist/build/pdf.js:129:23)
    at __w_pdfjs_require__ (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:52:30)
    at pdfjsVersion (...pdf_test/node_modules/pdfjs-dist/build/pdf.js:116:18)
    at .../pdf_test/node_modules/pdfjs-dist/build/pdf.js:119:10
    at webpackUniversalModuleDefinition (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:25:20)
    at Object.<anonymous> (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:32:3)
    at Module._pile (internal/modules/cjs/loader.js:776:30)
  name: 'InvalidPDFException',
  message: 'Invalid PDF structure' }

Other pdfs from other domains seem to work. Note that downloading the pdf from the above domain works well, and can be viewed on Chrome browser. I doubt that the pdf document is corrupted. I am not implementing any front end code as the intention of the above code is host it on cloud.

Share Improve this question edited Nov 30, 2019 at 23:48 halfer 20.3k19 gold badges109 silver badges202 bronze badges asked Nov 14, 2019 at 11:17 KohKoh 2,8971 gold badge28 silver badges71 bronze badges 6
  • This answer has an alternative script for extracting data you could try - stackoverflow./a/29032269/2570277 – Nick Commented Nov 14, 2019 at 11:27
  • Maybe the website you are downloading the PDF from checks the request headers. Can you try downloading the PDF with chrome and load it locally? – Cr4xy Commented Nov 14, 2019 at 11:35
  • @Cr4xy yes, I could download the PDF and load it locally. It loads and extract the plain text correctly. If request header is the "issue", any idea how do I go around it without downloading the pdf? – Koh Commented Nov 14, 2019 at 11:45
  • You could try to copy all the headers you can find with chrome dev tools, and add them to getDocument like here – Cr4xy Commented Nov 14, 2019 at 13:02
  • 1 "I doubt that the pdf document is corrupted" don't guess, verify. So, step 1: grab Adobe Acrobat, the free version, and try to open those PDFs. Then edit your post so that it mentions whether, according to the most authoritative PDF reader that exists, these PDF files are broken or not. Because if Acrobat doesn't like them, this isn't a problem with pdfjs in the slightest. – Mike 'Pomax' Kamermans Commented Apr 29, 2023 at 16:06
 |  Show 1 more ment

2 Answers 2

Reset to default 1

I have seen this type of poor file structure before. whilst most readers will initially accept them as full of deletions they cause later problems.

The info metadata has been culled along with many other redactions but the index has not been rebuilt cleanly. These files may cause distorted behaviours such as missing annotation and other processing oddities.

One important indexed entry is declared as active (the one for pages) then later declared as deleted, so the number of pages is technically invalidated. However as mented by @mkl the bined methods are valid, and will usually be passed as acceptable according to the accesability standards EXCEPT the title had been removed along with the metadata. Once the "Title" is added as part of the manual checks it is fully up to scratch, but will naturally remove redundant entries and add others so file size is reduced.

endobj
startxref
27552
%%EOF

Summary
The checker found no problems in this document.

Needs manual check: 0
Passed manually: 2
Failed manually: 0
Skipped: 1
Passed: 29
Failed: 0

Looking at the source file position 125 is the list of pages.

2 0 obj <</Type/Pages/Count 2/Kids[ 3 0 R 17 0 R] >> endobj

However then the whole table is encoded yet again.

trailer
<</Size 156/Root 1 0 R/Info 148 0 R/ID[<F8D1F9DF5B960D47837AA3719AED9B81><F8D1F9DF5B960D47837AA3719AED9B81>] >>
startxref
27702
%%EOF
xref
0 0
trailer
<</Size 156/Root 1 0 R/Info 148 0 R/ID[<F8D1F9DF5B960D47837AA3719AED9B81><F8D1F9DF5B960D47837AA3719AED9B81>] /Prev 27702/XRefStm 27153>>
startxref
30982
%%EOF

The best thing to do is run the file through any cleaning process to rationalise the working index even if that means a percentage increase in file size.

Beware avoid using Ghostscript to rebuild the file (without any redundant objects) it will reduce the count drastically from 156 down to 33. Thus the cleaned optimised file is much smaller, but has lost all accessibility data.

Avoid basic optimisation gs -sDEVICE=pdfwrite -oTestIt.pdf dc19-07.pdf

0000012448 00000 n 
0000017006 00000 n 
0000020240 00000 n 
0000007646 00000 n 
0000019954 00000 n 
0000022761 00000 n 
trailer
<< /Size 33 /Root 1 0 R /Info 2 0 R
/ID [<3488BED994E047F0C0FAB9AB78CA5FF8><3488BED994E047F0C0FAB9AB78CA5FF8>]
>>
startxref
24137
%%EOF

Mutool cleaning is more likely to keep desired source data.
mutool clean -m -s -f -i -gggg dc19-07.pdf clean-DC19-07.pdf

Browser console log errors did not help me to fix it.

I run a PHP app (Moodle) and I went to the PHP error log and saw some variables expected to be replaced within the html source body of my certificate to be generated.

Check your backend app for error logs and the html source body provided to PDF.js for missing and undefined variables.

Try starting over the html body provided to PDF.js from scratch will help debugging the source of the exception.

本文标签: javascriptPDFJS Invalid PDF structureStack Overflow