pymupdf - Why can't I extract text from this pdf? - Stack Overflow-软件玩家

admin管理员组
文章数量:1352819

I have tried a variety of methods both online and using python to extract text from this pdf:

.pdf

And every time I just get seemingly random characters for instance 1$$2!!2!"34$+5

I have tried online options as well as this:

import fitz

text = ""
path = "/home/serveracct/342.pdf"

doc = fitz.open(path)

for page in doc:
    text += page.get_text()
print(text)

I believe since a court doc the format is PDF/A but not 100% sure. I tried detecting an image file but could not. ocrmypdf works but the files become huge..For now I am just trying to determine what about the structure is preventing me from extracting the text. Also when I open this pdf in Adobe there are random boxes showing up on the page:

本文标签： pymupdfWhy can39t I extract text from this pdfStack Overflow

版权声明：本文标题：pymupdf - Why can't I extract text from this pdf? - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1743879865a2555004.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

pymupdf - Why can't I extract text from this pdf? - Stack Overflow

更多相关文章