Expressing UTF-16 unicode characters in JavaScript - Stack Overflow-软件玩家

admin管理员组
文章数量:1195123

To express, for example, the character U+10400 in JavaScript, I use "\uD801\uDC00" or String.fromCharCode(0xD801) + String.fromCharCode(0xDC00). How do I figure that out for a given unicode character? I want the following:

var char = getUnicodeCharacter(0x10400);

How do I find 0xD801 and 0xDC00 from 0x10400?

var char = getUnicodeCharacter(0x10400);

How do I find 0xD801 and 0xDC00 from 0x10400?

Share Improve this question edited Aug 19, 2011 at 22:09 asked Aug 19, 2011 at 19:22 gilly3 91.5k26 gold badges147 silver badges179 bronze badges

See the wikipedia article on UTF-16. – hmakholm left over Monica Commented Aug 19, 2011 at 19:24
I can't believe that this many years later Javascript is still in the Stone Age regarding Unicode. Having only BMP characters was something that should have gone out the door with Unicode 1.1 something like 15 years ago. Why is Javascript still so broken? – tchrist Commented Aug 20, 2011 at 3:48
6 @tchrist: because you can't change a language's basic string model without widespread application breakage. Java, .NET and Windows in general are in the same boat: most of the world is afflicted by the UTF-16 curse. Browser JavaScript has a further hurdle in that the DOM standard also requires strings to be indexed by UTF-16 code units. – bobince Commented Aug 20, 2011 at 10:06
@bobince: I agree that the UTF-16 Curse sucks, but it may not be insurmountable. There are still measures that can be taken. You can provide alternate libraries available by explicit declaration that have a code point interface sitting on top the original code unit one. On the other hand, the UCS-2 that afflicts Javascript and many aspects of narrow builds of Python is a scourge, and some of the JVM languages can't make use of the code point interfaces that Java is able to provide if you ask nicely enough. – tchrist Commented Aug 20, 2011 at 10:29
3 String.fromCharCode(0xD801) + String.fromCharCode(0xDC00) can be written as String.fromCharCode(0xD801, 0xDC00). – Mathias Bynens Commented Feb 2, 2012 at 13:08

| Show 1 more comment

2 Answers 2

Sorted by: Reset to default 17

Based on the wikipedia article given by Henning Makholm, the following function will return the correct character for a code point:

function getUnicodeCharacter(cp) {

    if (cp >= 0 && cp <= 0xD7FF || cp >= 0xE000 && cp <= 0xFFFF) {
        return String.fromCharCode(cp);
    } else if (cp >= 0x10000 && cp <= 0x10FFFF) {

        // we substract 0x10000 from cp to get a 20-bits number
        // in the range 0..0xFFFF
        cp -= 0x10000;

        // we add 0xD800 to the number formed by the first 10 bits
        // to give the first byte
        var first = ((0xffc00 & cp) >> 10) + 0xD800

        // we add 0xDC00 to the number formed by the low 10 bits
        // to give the second byte
        var second = (0x3ff & cp) + 0xDC00;

        return String.fromCharCode(first) + String.fromCharCode(second);
    }
}

How do I find 0xD801 and 0xDC00 from 0x10400?

JavaScript uses UCS-2 internally. That’s why String#charCodeAt() doesn’t work the way you’d want it to.

If you want to get the code point of every Unicode character (including non-BMP characters) in a string, you could use Punycode.js’s utility functions to convert between UCS-2 strings and UTF-16 code points:

// String#charCodeAt() replacement that only considers full Unicode characters
punycode.ucs2.decode('
                本文标签：
                Expressing UTF16 unicode characters in JavaScriptStack Overflow

                        版权声明：本文标题：Expressing UTF-16 unicode characters in JavaScript - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，
                        转载请联系作者并注明出处：http://www.betaflare.com/web/1738482224a2089208.html，
                        本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

Expressing UTF-16 unicode characters in JavaScript - Stack Overflow

2 Answers 2

`更多相关文章`