admin管理员组文章数量:1186428
I've received a business PDF document that I must process programmatically, but that I can't decode.
Part of the document reads like this in Acrobat:
180CTNS 1800PCS
But when I extract the bytes from the underlying text layer, I get:
f4 80 80 94 f4 80 80 9b f4 80 80 93 f4 80 80 a6
f4 80 80 b7 f4 80 80 b1 f4 80 80 b6 20 f4 80 80
94 f4 80 80 9b f4 80 80 93 f4 80 80 93 f4 80 80
b3 f4 80 80 a6 f4 80 80 b6
So apparently whatever software created this file encodes '0' as F480 8093
, '1' as F480 8094
and so on. But what encoding does this? I can't decode this byte sequence with any of the codecs in the Python standard library.
Edit
I've removed the Python tag, since this is really about what to do with this byte sequence, as obtained e.g. by copying and pasting from acrobat to a text file.
So the question really is:
- Can I somehow interrogate the PDF document to learn that \U00100014 (the Plane-B code point encoded as
f4 80 80 94
) is supposed to look like the digit '1'? - Or am I better off writing my own pseudo-decoder based on the content of that file (and others I might receive from the same customer)?
本文标签: pdfWhat character encoding uses these four bytes for digitsStack Overflow
版权声明:本文标题:pdf - What character encoding uses these four bytes for digits? - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1738356632a2079955.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论