CC++ encode binary into utf8 - Stack Overflow-软件玩家

admin管理员组
文章数量:1122843

I have a block of text data, almost all of which is valid utf8. Almost all -- but not all. It contains a mixture of other popular encodings, such as iso8859-xx, windows-1252 and even EBCDIC. The encodings are NOT demarcated or indicated in this block of text. The encoding info is not available to me.

I want to be able to pass this text into systems that accept only valid utf8. That is, I would like to encode the invalid bytes into some kind of valid escapes. I'd like to do this reversibly, so that the original text can be restored, when needed. What is the most minimal, most-standard way of doing this?

For example, I could URL-encode everything. But this is ugly; blank spaces turn into %20 and so on. It's overkill: it takes perfectly valid UTF8 and converts it to hard-to-read gorp.

I could invent my own encoding. I already tried converting the invalid bytes into the so-called "surrogate code points" in the range U+D800—U+DFFF but promptly discovered that utf8 subsystems hate these (throw exceptions, errors, etc.)

I could invent my own encoding, printing the invalid bytes as ascii hexadecimal text. Not sure how to demarcate the beginning and end. Replacing the invalid bytes with "yo dude here comes binary>>deadbeef<<end of binary, resume utf8 now" seems a tad unprofessional.

Rather than inventing my own encoding, I'd like to do something that "everyone else does", ideally following some standard. Is this possible? How? Where?

(p.s. I do not need to encode null bytes. The text is null-byte-terminated, as always, from the C/C++ perspective.)

p.p.s. Some context. String escape problems are common: there must be hundreds of variants of this question here in stackexchange. This issue commonly occurs for SQL, where invalid bytes need to be stored seamlessly in SQL tables. For this special case, there are many SQL tools. This issue commonly comes up in python, for which the python call fooobar.path.encode('utf8', 'surrogateescape')) will provide the needed encoding/decoding. Similar issues arise in most programming languages; many/most are now standardizing on utf-8 for the internal string representation. I'm coding in C/C++. I have to support bindings in many languages, including python, scheme, CaML, haskel. Probably rust, soon. I could use iconv and specify a conversion from UTF-8 to UTF-8//IGNORE but this discards the invalid bytes, instead of encoding them. I'm kind of stumped.

p.p.p.s. More context. A classic example is Linux filenames. Linux filesystems do not care about the encoding; its just bytes. Files sitting on older Samba servers will often have Microsoft 1252 encodings for quote-marks appearing in the filename. Handing that raw filename to python results in an exception. But one does not know what that encoding was, a-priori. It's probably Microsoft. But who knows; it might be KOI-8 from back in the day. Microsoft was world-wide. The Linux filesystems never tracked this info; it was left to "higher layers", now gone. Other in-the-wild examples include the content at InternetArchive/Wayback Machine and Project Gutenberg. These files may or may not include encodings. When they do, they are often incorrectly marked; human error from decades ago. The escaping issue is of finding the escapes: If I have some terabytes of text, I'd rather not crawl over every byte, trying to scrub it. I'd rather encode it once, and when joe-blow-user comes back and says "oh hey, that data you gave me?" I would very much prefer to say: "its 100% valid utf8, and the binary chars are encoded in well-known industry-standard XYZ, look it up."

p.p.p.p.s. The question was down-voted an closed, without any explanation of why it is being down-voted, or what is unclear. It is conventional in text processing that one is given 3rd-party data, and 3rd-party tools, and asked to do something with it: split it, chunk it, count it, lexicalize, extract, cluster, and, these days, deep-learning/neural-net it. The file data is always separate from the meta-data: the owner might know all about the encoding, but processing pipeline does not have access. That's OK: almost all processing one could ever do simply does not care about the encoding. LLM's certainly don't care, and tokeninzing/counting requires knowing only that 0x20 is the separator token. The user/customer/client, who does know what the encoding is, wants the processing done without you wrecking that encoding. Here's the rub: many of the processing tools also don't care about the encoding: e.g. RocksDB works fine with binary strings. Others care: sqlite3 fails unless given utf8 with quotes escaped. Some languages provide tools: python will throw errors, but it also provides encode and decode functions that do "surrogateescape". Other languages don't: for example, guile scheme won't work unless the string is encoded. Even if the only thing you ever do is copy that string. And then there's Java ... and javascript ... well, it never stops.

I was hoping for an answer along the lines of "now that unicode has matured (it's been 30+ years) everyone is migrating towards a single solution that unifies all frameworks and programming languages."

If there was this kind of agreed-upon solution, some RFC or PEP or whatever, then I could get on the dozen different mailing lists and nag the three dozen different maintainers "hey guys, lets all do it exactly the same way". But that kind of nagging is useless, if there's no starting point to gel on, to nucleate, where at least some of the maintainers can reach a consensus. Given the comment thread and the downvotes, there's no light at the end of the tunnel.

For now, the best solution seems to be "take any invalid byte, and add U+F000 to it". This is what Microsoft NTFS did. With Microsoft setting a precedent, maybe that's OK. The other tack was to do what python3 did: add U+DC80 to it. I discovered that non-python3 systems throw exceptions when given chars in this range, so that's a non-starter, at least for now.

本文标签： CC encode binary into utf8Stack Overflow

版权声明：本文标题：CC++ encode binary into utf8 - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1736442144a1944323.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

CC++ encode binary into utf8 - Stack Overflow

更多相关文章