pdf - What character encoding uses these four bytes for digits? - Stack Overflow

IT技术

更新时间：2025-02-010

admin管理员组
文章数量:1186428

I've received a business PDF document that I must process programmatically, but that I can't decode.

Part of the document reads like this in Acrobat:

180CTNS 1800PCS

But when I extract the bytes from the underlying text layer, I get:

f4 80 80 94 f4 80 80 9b  f4 80 80 93 f4 80 80 a6
f4 80 80 b7 f4 80 80 b1  f4 80 80 b6 20 f4 80 80
94 f4 80 80 9b f4 80 80  93 f4 80 80 93 f4 80 80
b3 f4 80 80 a6 f4 80 80  b6

So apparently whatever software created this file encodes '0' as F480 8093, '1' as F480 8094 and so on. But what encoding does this? I can't decode this byte sequence with any of the codecs in the Python standard library.

Edit

I've removed the Python tag, since this is really about what to do with this byte sequence, as obtained e.g. by copying and pasting from acrobat to a text file.

So the question really is:

Can I somehow interrogate the PDF document to learn that \U00100014 (the Plane-B code point encoded as f4 80 80 94) is supposed to look like the digit '1'?
Or am I better off writing my own pseudo-decoder based on the content of that file (and others I might receive from the same customer)?

本文标签： pdfWhat character encoding uses these four bytes for digitsStack Overflow

版权声明：本文标题：pdf - What character encoding uses these four bytes for digits? - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1738356632a2079955.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

pdf - What character encoding uses these four bytes for digits? - Stack Overflow

更多相关文章