encoding - Using Javascript's atob to decode base64 doesn't properly decode utf-8 strings - Stack Overflow

IT技术

更新时间：2025-01-132

admin管理员组
文章数量:1128286

I'm using the Javascript window.atob() function to decode a base64-encoded string (specifically the base64-encoded content from the GitHub API). Problem is I'm getting ASCII-encoded characters back (like â¢ instead of ™). How can I properly handle the incoming base64-encoded stream so that it's decoded as utf-8?

Share Improve this question edited May 7, 2015 at 16:18 asked May 7, 2015 at 16:12 brandonscript 72.8k35 gold badges172 silver badges237 bronze badges

4 The MDN page you linked has a paragraph starting with the phrase "For use with Unicode or UTF-8 strings,". – Pointy Commented May 7, 2015 at 16:16
2 Are you on node? There are better solutions than atob – Bergi Commented May 7, 2015 at 16:16

Add a comment |

17 Answers 17

Sorted by: Reset to default 559

The Unicode Problem

Though JavaScript (ECMAScript) has matured, the fragility of Base64, ASCII, and Unicode encoding has caused a lot of headaches (much of it is in this question's history).

Consider the following example:

const ok = "a";
console.log(ok.codePointAt(0).toString(16)); //   61: occupies < 1 byte

const notOK = "✓"
console.log(notOK.codePointAt(0).toString(16)); // 2713: occupies > 1 byte

console.log(btoa(ok));    // YQ==
console.log(btoa(notOK)); // error

Why do we encounter this?

Base64, by design, expects binary data as its input. In terms of JavaScript strings, this means strings in which each character occupies only one byte. So if you pass a string into btoa() containing characters that occupy more than one byte, you will get an error, because this is not considered binary data.

Source: MDN (2021)

The original MDN article also covered the broken nature of window.btoa and .atob, which have since been mended in modern ECMAScript. The original, now-dead MDN article explained:

The "Unicode Problem" Since DOMStrings are 16-bit-encoded strings, in most browsers calling window.btoa on a UTF-8 string will cause a Character Out Of Range exception if a character exceeds the range of a 8-bit byte (0x00~0xFF).

Solution with binary interoperability

If you're not sure which solution you want, this is probably the one you want. Keep scrolling for the ASCII base64 solution and history of this answer.

You may also be interested in some of the answers that use TextDecoder like https://stackoverflow.com/a/77383580/1214800

Source: MDN (2021)

The solution recommended by MDN is to actually encode to and from a binary string representation:

Encoding UTF-8 ⇢ binary

// convert a UTF-8 string to a string in which
// each 16-bit unit occupies only one byte
function toBinary(string) {
  const codeUnits = new Uint16Array(string.length);
  for (let i = 0; i < codeUnits.length; i++) {
    codeUnits[i] = string.charCodeAt(i);
  }
  return btoa(String.fromCharCode(...new Uint8Array(codeUnits.buffer)));
}

// a string that contains characters occupying > 1 byte
let encoded = toBinary("✓ à la mode") // "EycgAOAAIABsAGEAIABtAG8AZABlAA=="

Decoding binary ⇢ UTF-8

function fromBinary(encoded) {
  const binary = atob(encoded);
  const bytes = new Uint8Array(binary.length);
  for (let i = 0; i < bytes.length; i++) {
    bytes[i] = binary.charCodeAt(i);
  }
  return String.fromCharCode(...new Uint16Array(bytes.buffer));
}

// our previous Base64-encoded string
let decoded = fromBinary(encoded) // "✓ à la mode"

Where this fails a little, is that you'll notice the encoded string EycgAOAAIABsAGEAIABtAG8AZABlAA== no longer matches the previous solution's string 4pyTIMOgIGxhIG1vZGU=. This is because it is a binary-encoded native JavaScript string, not a UTF8-encoded string. If this doesn't matter to you (i.e., you aren't converting strings represented in Unicode from another system or are fine with JavaScript's native UTF-16LE encoding), then you're good to go. If, however, you want to preserve the UTF-8 functionality, you're better off using the solution described below.

Solution with ASCII base64 interoperability

The entire history of this question shows just how many different ways we've had to work around broken encoding systems over the years. Though the original MDN article no longer exists, this solution is still arguably a better one, and does a great job of solving "The Unicode Problem" while maintaining plain text base64 strings that you can decode on, say, base64decode.org.

There are two possible methods to solve this problem:

the first one is to escape the whole string (see encodeURIComponent) and then encode it;

the second one is to convert the UTF-16 DOMString to an unsigned 8-bit integer array (Uint8Array) of characters and then encode it.

A note on previous solutions: the MDN article originally suggested using unescape and escape to solve the Character Out Of Range exception problem, but they have since been deprecated. Some other answers here have suggested working around this with decodeURIComponent and encodeURIComponent, this has proven to be unreliable and unpredictable. The most recent update to this answer uses modern JavaScript functions to improve speed and modernize code.

If you're trying to save yourself some time, you could also consider using a library:

js-base64 (NPM, great for Node.js)
base64-js

Encoding UTF-8 ⇢ base64

function b64EncodeUnicode(str) {
    // first we use encodeURIComponent to get percent-encoded Unicode,
    // then we convert the percent encodings into raw bytes which
    // can be fed into btoa.
    return btoa(encodeURIComponent(str).replace(/%([0-9A-F]{2})/g,
        function toSolidBytes(match, p1) {
            return String.fromCharCode('0x' + p1);
    }));
}

b64EncodeUnicode('✓ à la mode'); // "4pyTIMOgIGxhIG1vZGU="
b64EncodeUnicode('\n'); // "Cg=="

Decoding base64 ⇢ UTF-8

function b64DecodeUnicode(str) {
    // Going backwards: from bytestream, to percent-encoding, to original string.
    return decodeURIComponent(atob(str).split('').map(function(c) {
        return '%' + ('00' + c.charCodeAt(0).toString(16)).slice(-2);
    }).join(''));
}

b64DecodeUnicode('4pyTIMOgIGxhIG1vZGU='); // "✓ à la mode"
b64DecodeUnicode('Cg=='); // "\n"

(Why do we need to do this? ('00' + c.charCodeAt(0).toString(16)).slice(-2) prepends a 0 to single character strings, for example, when c == \n, the c.charCodeAt(0).toString(16) returns a, forcing a to be represented as 0a).

TypeScript support

Here's the same solution with some additional TypeScript compatibility (via @MA-Maddin):

// Encoding UTF-8 ⇢ base64

function b64EncodeUnicode(str) {
    return btoa(encodeURIComponent(str).replace(/%([0-9A-F]{2})/g, function(match, p1) {
        return String.fromCharCode(parseInt(p1, 16))
    }))
}

// Decoding base64 ⇢ UTF-8

function b64DecodeUnicode(str) {
    return decodeURIComponent(Array.prototype.map.call(atob(str), function(c) {
        return '%' + ('00' + c.charCodeAt(0).toString(16)).slice(-2)
    }).join(''))
}

The first solution (deprecated)

This used escape and unescape (which are now deprecated, though this still works in all modern browsers):

function utf8_to_b64( str ) {
    return window.btoa(unescape(encodeURIComponent( str )));
}

function b64_to_utf8( str ) {
    return decodeURIComponent(escape(window.atob( str )));
}

// Usage:
utf8_to_b64('✓ à la mode'); // "4pyTIMOgIGxhIG1vZGU="
b64_to_utf8('4pyTIMOgIGxhIG1vZGU='); // "✓ à la mode"

And one last thing: I first encountered this problem when calling the GitHub API. To get this to work on (Mobile) Safari properly, I actually had to strip all white space from the base64 source before I could even decode the source. Whether or not this is still relevant in 2021, I don't know:

function b64_to_utf8( str ) {
    str = str.replace(/\s/g, '');    
    return decodeURIComponent(escape(window.atob( str )));
}

Decoding base64 to UTF8 String

Below is current most voted answer by @brandonscript

function b64DecodeUnicode(str) {
    // Going backwards: from bytestream, to percent-encoding, to original string.
    return decodeURIComponent(atob(str).split('').map(function(c) {
        return '%' + ('00' + c.charCodeAt(0).toString(16)).slice(-2);
    }).join(''));
}

Above code can work, but it's very slow. If your input is a very large base64 string, for example 30,000 chars for a base64 html document. It will need lots of computation.

Here is my answer, use built-in TextDecoder, nearly 10x faster than above code for large input.

function decodeBase64(base64) {
    const text = atob(base64);
    const length = text.length;
    const bytes = new Uint8Array(length);
    for (let i = 0; i < length; i++) {
        bytes[i] = text.charCodeAt(i);
    }
    const decoder = new TextDecoder(); // default is utf-8
    return decoder.decode(bytes);
}

Things change. The escape/unescape methods have been deprecated.

You can URI encode the string before you Base64-encode it. Note that this does't produce Base64-encoded UTF8, but rather Base64-encoded URL-encoded data. Both sides must agree on the same encoding.

See working example here: http://codepen.io/anon/pen/PZgbPW

// encode string
var base64 = window.btoa(encodeURIComponent('€ 你好 æøåÆØÅ'));
// decode string
var str = decodeURIComponent(window.atob(tmp));
// str is now === '€ 你好 æøåÆØÅ'

For OP's problem a third party library such as js-base64 should solve the problem.

The complete article that works for me: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Base64_encoding_and_decoding

The part where we encode from Unicode/UTF-8 is

function utf8_to_b64( str ) {
   return window.btoa(unescape(encodeURIComponent( str )));
}

function b64_to_utf8( str ) {
   return decodeURIComponent(escape(window.atob( str )));
}

// Usage:
utf8_to_b64('✓ à la mode'); // "4pyTIMOgIGxhIG1vZGU="
b64_to_utf8('4pyTIMOgIGxhIG1vZGU='); // "✓ à la mode"

This is one of the most used methods nowadays.

If treating strings as bytes is more your thing, you can use the following functions

function u_atob(ascii) {
    return Uint8Array.from(atob(ascii), c => c.charCodeAt(0));
}

function u_btoa(buffer) {
    var binary = [];
    var bytes = new Uint8Array(buffer);
    for (var i = 0, il = bytes.byteLength; i < il; i++) {
        binary.push(String.fromCharCode(bytes[i]));
    }
    return btoa(binary.join(''));
}


// example, it works also with astral plane characters such as '
                本文标签：
                encodingUsing Javascript39s atob to decode base64 doesn39t properly decode utf8 stringsStack Overflow

                        版权声明：本文标题：encoding - Using Javascript&#39;s atob to decode base64 doesn&#39;t properly decode utf-8 strings - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，
                        转载请联系作者并注明出处：http://www.betaflare.com/web/1736703035a1948528.html，
                        本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

`更多相关文章`

encoding - Using Javascript&#39;s atob to decode base64 doesn&#39;t properly decode utf-8 strings - Stack OverflowIT技术
7小时前
I'm using the Javascript window.atob() function to decode a base64-encoded string (specifically th

编程频道|软件玩家 - 软件改变生活！

encoding - Using Javascript&#39;s atob to decode base64 doesn&#39;t properly decode utf-8 strings - Stack Overflow

17 Answers 17

The Unicode Problem

Solution with binary interoperability

Encoding UTF-8 ⇢ binary

Decoding binary ⇢ UTF-8

Solution with ASCII base64 interoperability

Encoding UTF-8 ⇢ base64

Decoding base64 ⇢ UTF-8

TypeScript support

The first solution (deprecated)

更多相关文章

encoding - Using Javascript&#39;s atob to decode base64 doesn&#39;t properly decode utf-8 strings - Stack Overflow

发表评论

推荐文章

hooks - trigger save_post event programmatically

javascript - How to deep merge instead of shallow merge? - Stack Overflow

javascript - How to call reduce on an array of objects to sum their properties? - Stack Overflow

database - How to use contain in GraphQL query - Stack Overflow

visual studio code - &quot;explorer.fileNesting.patterns&quot; do nesting of inner documents ; how to avoid? - Stack Ove

热门文章

reactjs - React Query Infinite Scroll does not work in an MUI Dialog - Stack Overflow

Will Using Process or usrbinzip in a SwiftSwiftUI macOS App Cause App Store Rejection or Post-Approval Issues? - Stack Overflow

blazor - Hot reload is no longer working after repairing Visual Studio - Stack Overflow

How to update post view count?

url rewriting - WordPress pagination error when using ACF CPT rewrite slug &amp; Polylang CPT slug translation

Multi-tenant Rust diesel + PostgreSQL without tenant_id in each model - Stack Overflow

Why doesn&#39;t binding the animation-name dynamically work in Vue.js and Uni-App? - Stack Overflow

code formatting - Tool to UnminifyDecompress JavaScript - Stack Overflow

javascript - Failed to execute &#39;postMessage&#39; on &#39;DOMWindow&#39;: https:www.youtube.com !== http:loca

python - How to receive value inside callback function - Stack Overflow

最新文章

Java入门级教学（IDEA的下载与安装与JDK的环境配置）

华硕笔记本电脑用U盘重装windows系统

物理网卡MAC修改器v3.0 - 真实网卡硬件MAC地址修改，重装系统不变！

如何一键安装win7系统(一键安装win7系统步骤)

Windows 11最稳定版本详解

javascript - React setState not updating state - Stack Overflow

java - how to user multiple data value as a table in feature file in BBD framwork - Stack Overflow

javascript - React-Router: No Not Found Route? - Stack Overflow

javascript - What&#39;s the most elegant way to cap a number to a segment? - Stack Overflow

javascript - Appending HTML string to the DOM - Stack Overflow

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价

encoding - Using Javascript's atob to decode base64 doesn't properly decode utf-8 strings - Stack Overflow

`更多相关文章`

encoding - Using Javascript's atob to decode base64 doesn't properly decode utf-8 strings - Stack Overflow

`发表评论`

`推荐文章`

visual studio code - "explorer.fileNesting.patterns" do nesting of inner documents ; how to avoid? - Stack Ove

`热门文章`

url rewriting - WordPress pagination error when using ACF CPT rewrite slug & Polylang CPT slug translation

Why doesn't binding the animation-name dynamically work in Vue.js and Uni-App? - Stack Overflow

javascript - Failed to execute 'postMessage' on 'DOMWindow': https:www.youtube.com !== http:loca

`最新文章`

javascript - What's the most elegant way to cap a number to a segment? - Stack Overflow