javascript - PDFJS: Invalid PDF structure - Stack Overflow-软件玩家

admin管理员组
文章数量:1324837

I am attempting to extract plain text out of a pdf document using pdf.js and for some reason am unable to get past the Invalid PDF structure error.

My code as such:

const pdfjslib = require('pdfjs-dist');

const pdfPath = '.pdf'

var loadingTask = pdfjslib.getDocument(pdfPath);
loadingTask.promise.then(async (doc) => {
    console.log(doc);
    return null
})
.catch((err)=>{
    console.log(err)
});

I have tried other pdf documents ing from the same domain but all throws the same error:

...
Warning: Ignoring invalid character "34" in hex string
Warning: Ignoring invalid character "104" in hex string
Warning: Indexing all PDF objects
{ Error
    at InvalidPDFExceptionClosure (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:658:35)
    at Object.<anonymous> (...pdf_test/node_modules/pdfjs-dist/build/pdf.js:661:2)
    at __w_pdfjs_require__ (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:52:30)
    at Object.defineProperty.value (...pdf_test/node_modules/pdfjs-dist/build/pdf.js:129:23)
    at __w_pdfjs_require__ (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:52:30)
    at pdfjsVersion (...pdf_test/node_modules/pdfjs-dist/build/pdf.js:116:18)
    at .../pdf_test/node_modules/pdfjs-dist/build/pdf.js:119:10
    at webpackUniversalModuleDefinition (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:25:20)
    at Object.<anonymous> (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:32:3)
    at Module._pile (internal/modules/cjs/loader.js:776:30)
  name: 'InvalidPDFException',
  message: 'Invalid PDF structure' }

Other pdfs from other domains seem to work. Note that downloading the pdf from the above domain works well, and can be viewed on Chrome browser. I doubt that the pdf document is corrupted. I am not implementing any front end code as the intention of the above code is host it on cloud.

I am attempting to extract plain text out of a pdf document using pdf.js and for some reason am unable to get past the Invalid PDF structure error.

My code as such:

const pdfjslib = require('pdfjs-dist');

const pdfPath = 'https://www.corenet.gov.sg/media/2268607/dc19-07.pdf'

var loadingTask = pdfjslib.getDocument(pdfPath);
loadingTask.promise.then(async (doc) => {
    console.log(doc);
    return null
})
.catch((err)=>{
    console.log(err)
});

I have tried other pdf documents ing from the same domain but all throws the same error:

...
Warning: Ignoring invalid character "34" in hex string
Warning: Ignoring invalid character "104" in hex string
Warning: Indexing all PDF objects
{ Error
    at InvalidPDFExceptionClosure (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:658:35)
    at Object.<anonymous> (...pdf_test/node_modules/pdfjs-dist/build/pdf.js:661:2)
    at __w_pdfjs_require__ (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:52:30)
    at Object.defineProperty.value (...pdf_test/node_modules/pdfjs-dist/build/pdf.js:129:23)
    at __w_pdfjs_require__ (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:52:30)
    at pdfjsVersion (...pdf_test/node_modules/pdfjs-dist/build/pdf.js:116:18)
    at .../pdf_test/node_modules/pdfjs-dist/build/pdf.js:119:10
    at webpackUniversalModuleDefinition (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:25:20)
    at Object.<anonymous> (.../pdf_test/node_modules/pdfjs-dist/build/pdf.js:32:3)
    at Module._pile (internal/modules/cjs/loader.js:776:30)
  name: 'InvalidPDFException',
  message: 'Invalid PDF structure' }

Share Improve this question edited Nov 30, 2019 at 23:48 halfer 20.3k19 gold badges109 silver badges202 bronze badges asked Nov 14, 2019 at 11:17 Koh 2,8971 gold badge28 silver badges71 bronze badges

This answer has an alternative script for extracting data you could try - stackoverflow./a/29032269/2570277 – Nick Commented Nov 14, 2019 at 11:27
Maybe the website you are downloading the PDF from checks the request headers. Can you try downloading the PDF with chrome and load it locally? – Cr4xy Commented Nov 14, 2019 at 11:35
@Cr4xy yes, I could download the PDF and load it locally. It loads and extract the plain text correctly. If request header is the "issue", any idea how do I go around it without downloading the pdf? – Koh Commented Nov 14, 2019 at 11:45
You could try to copy all the headers you can find with chrome dev tools, and add them to getDocument like here – Cr4xy Commented Nov 14, 2019 at 13:02
1 "I doubt that the pdf document is corrupted" don't guess, verify. So, step 1: grab Adobe Acrobat, the free version, and try to open those PDFs. Then edit your post so that it mentions whether, according to the most authoritative PDF reader that exists, these PDF files are broken or not. Because if Acrobat doesn't like them, this isn't a problem with pdfjs in the slightest. – Mike 'Pomax' Kamermans Commented Apr 29, 2023 at 16:06

| Show 1 more ment

2 Answers 2

Sorted by: Reset to default 1

I have seen this type of poor file structure before. whilst most readers will initially accept them as full of deletions they cause later problems.

The info metadata has been culled along with many other redactions but the index has not been rebuilt cleanly. These files may cause distorted behaviours such as missing annotation and other processing oddities.

One important indexed entry is declared as active (the one for pages) then later declared as deleted, so the number of pages is technically invalidated. However as mented by @mkl the bined methods are valid, and will usually be passed as acceptable according to the accesability standards EXCEPT the title had been removed along with the metadata. Once the "Title" is added as part of the manual checks it is fully up to scratch, but will naturally remove redundant entries and add others so file size is reduced.

endobj
startxref
27552
%%EOF

Summary
The checker found no problems in this document.

Needs manual check: 0
Passed manually: 2
Failed manually: 0
Skipped: 1
Passed: 29
Failed: 0

Looking at the source file position 125 is the list of pages.

2 0 obj <</Type/Pages/Count 2/Kids[ 3 0 R 17 0 R] >> endobj

However then the whole table is encoded yet again.

trailer
<</Size 156/Root 1 0 R/Info 148 0 R/ID[<F8D1F9DF5B960D47837AA3719AED9B81><F8D1F9DF5B960D47837AA3719AED9B81>] >>
startxref
27702
%%EOF
xref
0 0
trailer
<</Size 156/Root 1 0 R/Info 148 0 R/ID[<F8D1F9DF5B960D47837AA3719AED9B81><F8D1F9DF5B960D47837AA3719AED9B81>] /Prev 27702/XRefStm 27153>>
startxref
30982
%%EOF

The best thing to do is run the file through any cleaning process to rationalise the working index even if that means a percentage increase in file size.

Beware avoid using Ghostscript to rebuild the file (without any redundant objects) it will reduce the count drastically from 156 down to 33. Thus the cleaned optimised file is much smaller, but has lost all accessibility data.

Avoid basic optimisation gs -sDEVICE=pdfwrite -oTestIt.pdf dc19-07.pdf

0000012448 00000 n 
0000017006 00000 n 
0000020240 00000 n 
0000007646 00000 n 
0000019954 00000 n 
0000022761 00000 n 
trailer
<< /Size 33 /Root 1 0 R /Info 2 0 R
/ID [<3488BED994E047F0C0FAB9AB78CA5FF8><3488BED994E047F0C0FAB9AB78CA5FF8>]
>>
startxref
24137
%%EOF

Mutool cleaning is more likely to keep desired source data.
mutool clean -m -s -f -i -gggg dc19-07.pdf clean-DC19-07.pdf

Browser console log errors did not help me to fix it.

I run a PHP app (Moodle) and I went to the PHP error log and saw some variables expected to be replaced within the html source body of my certificate to be generated.

Check your backend app for error logs and the html source body provided to PDF.js for missing and undefined variables.

Try starting over the html body provided to PDF.js from scratch will help debugging the source of the exception.

本文标签： javascriptPDFJS Invalid PDF structureStack Overflow

版权声明：本文标题：javascript - PDFJS: Invalid PDF structure - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1742122817a2421802.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

javascript - PDFJS: Invalid PDF structure - Stack Overflow

2 Answers 2

更多相关文章

javascript - PDFJS: Invalid PDF structure - Stack Overflow

发表评论

推荐文章

Variable methods in Javascript objects - Stack Overflow

python - How to loop through a permutation group in SymPy including the identity - Stack Overflow

Same comment section on every page

asp.net - Pass array to client side for display - Stack Overflow

javascript - Change section opacity onclick? - Stack Overflow

热门文章

windows - VIsual Studio 2013 - Javascript - Stack Overflow

javascript - Passing Multiple arguments to jQuery Function - Stack Overflow

javascript - sequelize where date clause - Stack Overflow

urls - Wordpress on a subdirectory of Laravel - Wordpress pretty permalinks inner page shows laravel

ssl - How to set up HTTPS Wordpress from Install Step?

javascript - Create invalid UTF8 string - Stack Overflow

javascript - How to add Dynamic RTL support for NEXT.js Material UI Emotion - Stack Overflow

javascript - Kendo UI Grid Row Height set Dynamically by control - Stack Overflow

javascript - Change content of a div on another page - Stack Overflow

javascript - Auto Capitalize ONLY the First Letter of Each Word in an Input Field - Stack Overflow

最新文章

路由器配置基础

信号与系统第五版吴大正PDF资源下载

基于PHP的学生成绩管理系统(多用户版)

麒麟系统在VMware Workstation Linux环境下的安装

Red Hat Enterprise Linux 7.9 下载

javascript - Pass a Variable Reference In React? - Stack Overflow

Can I ignore a promise in javascript? - Stack Overflow

c# - How to remove dangerous characters(ie script tags)? - Stack Overflow

Creating query to show which editor (classic or block) was last used to edit a postpage

javascript - Sum values from JSON - Stack Overflow

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价