admin管理员组

文章数量:1186428

I've received a business PDF document that I must process programmatically, but that I can't decode.

Part of the document reads like this in Acrobat:

180CTNS 1800PCS

But when I extract the bytes from the underlying text layer, I get:

f4 80 80 94 f4 80 80 9b  f4 80 80 93 f4 80 80 a6
f4 80 80 b7 f4 80 80 b1  f4 80 80 b6 20 f4 80 80
94 f4 80 80 9b f4 80 80  93 f4 80 80 93 f4 80 80
b3 f4 80 80 a6 f4 80 80  b6

So apparently whatever software created this file encodes '0' as F480 8093, '1' as F480 8094 and so on. But what encoding does this? I can't decode this byte sequence with any of the codecs in the Python standard library.

Edit

I've removed the Python tag, since this is really about what to do with this byte sequence, as obtained e.g. by copying and pasting from acrobat to a text file.

So the question really is:

  • Can I somehow interrogate the PDF document to learn that \U00100014 (the Plane-B code point encoded as f4 80 80 94) is supposed to look like the digit '1'?
  • Or am I better off writing my own pseudo-decoder based on the content of that file (and others I might receive from the same customer)?

本文标签: pdfWhat character encoding uses these four bytes for digitsStack Overflow