admin管理员组

文章数量:1335895

I have a UTF-8 encoded string that es from an ajax response, I want to get substring of that string up to the first ma. For the string "Привет, мир" it would be "Привет".

Will this work and not run into "multibyte-ness" issues?

var i = text.indexOf(',');
if (i != -1) text = text.substr(0, i);

Or is it better to use split?

I have a UTF-8 encoded string that es from an ajax response, I want to get substring of that string up to the first ma. For the string "Привет, мир" it would be "Привет".

Will this work and not run into "multibyte-ness" issues?

var i = text.indexOf(',');
if (i != -1) text = text.substr(0, i);

Or is it better to use split?

Share Improve this question asked May 24, 2013 at 15:29 galymzhangalymzhan 5,5232 gold badges31 silver badges45 bronze badges 1
  • Your code is 100% correct. This is one of the features of both UTF-16 and UTF-8: even though index of a string is a code unit and not a code point, indexOf and alike search will never match a ma (or other <127 ascii char) that is not a ma character. More information on utf8everywhere – Pavel Radzivilovsky Commented Jun 2, 2013 at 18:03
Add a ment  | 

2 Answers 2

Reset to default 5

Javascript treats strings by characters, not by bytes.
As such, yes, that's fine from an encoding/string handling standpoint.
You may treat strings in Javascript as not having any particular encoding, but as a string of characters.

> "漢字".substr(1)
  "字"

Note that the above is only a simplification though. As pointed out in the ments, Javascript treats strings as 16-bit code points. This enables you to treat strings "by character" for the majority of mon characters, but for characters which are encoded with more than 2 bytes in UTF-16 or characters posed of more than one code point, this abstraction breaks down.

UTF-8 uses only values higher than 128 to encode characters other than ASCII so an ASCII ma is never part of a multibyte sequence.

本文标签: javascriptGetting substring without messing up UTF8 stringStack Overflow