admin管理员组

文章数量:1195123

To express, for example, the character U+10400 in JavaScript, I use "\uD801\uDC00" or String.fromCharCode(0xD801) + String.fromCharCode(0xDC00). How do I figure that out for a given unicode character? I want the following:

var char = getUnicodeCharacter(0x10400);

How do I find 0xD801 and 0xDC00 from 0x10400?

To express, for example, the character U+10400 in JavaScript, I use "\uD801\uDC00" or String.fromCharCode(0xD801) + String.fromCharCode(0xDC00). How do I figure that out for a given unicode character? I want the following:

var char = getUnicodeCharacter(0x10400);

How do I find 0xD801 and 0xDC00 from 0x10400?

Share Improve this question edited Aug 19, 2011 at 22:09 gilly3 asked Aug 19, 2011 at 19:22 gilly3gilly3 91.5k26 gold badges147 silver badges179 bronze badges 6
  • See the wikipedia article on UTF-16. – hmakholm left over Monica Commented Aug 19, 2011 at 19:24
  • I can't believe that this many years later Javascript is still in the Stone Age regarding Unicode. Having only BMP characters was something that should have gone out the door with Unicode 1.1 something like 15 years ago. Why is Javascript still so broken? – tchrist Commented Aug 20, 2011 at 3:48
  • 6 @tchrist: because you can't change a language's basic string model without widespread application breakage. Java, .NET and Windows in general are in the same boat: most of the world is afflicted by the UTF-16 curse. Browser JavaScript has a further hurdle in that the DOM standard also requires strings to be indexed by UTF-16 code units. – bobince Commented Aug 20, 2011 at 10:06
  • @bobince: I agree that the UTF-16 Curse sucks, but it may not be insurmountable. There are still measures that can be taken. You can provide alternate libraries available by explicit declaration that have a code point interface sitting on top the original code unit one. On the other hand, the UCS-2 that afflicts Javascript and many aspects of narrow builds of Python is a scourge, and some of the JVM languages can't make use of the code point interfaces that Java is able to provide if you ask nicely enough. – tchrist Commented Aug 20, 2011 at 10:29
  • 3 String.fromCharCode(0xD801) + String.fromCharCode(0xDC00) can be written as String.fromCharCode(0xD801, 0xDC00). – Mathias Bynens Commented Feb 2, 2012 at 13:08
 |  Show 1 more comment

2 Answers 2

Reset to default 17

Based on the wikipedia article given by Henning Makholm, the following function will return the correct character for a code point:

function getUnicodeCharacter(cp) {

    if (cp >= 0 && cp <= 0xD7FF || cp >= 0xE000 && cp <= 0xFFFF) {
        return String.fromCharCode(cp);
    } else if (cp >= 0x10000 && cp <= 0x10FFFF) {

        // we substract 0x10000 from cp to get a 20-bits number
        // in the range 0..0xFFFF
        cp -= 0x10000;

        // we add 0xD800 to the number formed by the first 10 bits
        // to give the first byte
        var first = ((0xffc00 & cp) >> 10) + 0xD800

        // we add 0xDC00 to the number formed by the low 10 bits
        // to give the second byte
        var second = (0x3ff & cp) + 0xDC00;

        return String.fromCharCode(first) + String.fromCharCode(second);
    }
}

How do I find 0xD801 and 0xDC00 from 0x10400?

JavaScript uses UCS-2 internally. That’s why String#charCodeAt() doesn’t work the way you’d want it to.

If you want to get the code point of every Unicode character (including non-BMP characters) in a string, you could use Punycode.js’s utility functions to convert between UCS-2 strings and UTF-16 code points:

// String#charCodeAt() replacement that only considers full Unicode characters
punycode.ucs2.decode('

本文标签: Expressing UTF16 unicode characters in JavaScriptStack Overflow