admin管理员组文章数量:1399944
I have two strings in Javascript: "_strange_chars_µö¬é@zendesk.eml"
(f1
) and "_strange_chars_µö¬é@zendesk.eml"
(f2
). At first glance, they look identical (and, indeed, on StackOverflow, they may be; I'm not sure what happens when they are pasted into a form like this.) In my application, however,
f1[16] // ö
f2[16] // o
f1[17] // ¬
f2[17] // ̈
That is, where f1
uses the ö character, f2
uses an o and a diacritic ¨ as a separate character. What parison can I do that will show these two strings to be "equal"?
I have two strings in Javascript: "_strange_chars_µö¬é@zendesk..eml"
(f1
) and "_strange_chars_µö¬é@zendesk..eml"
(f2
). At first glance, they look identical (and, indeed, on StackOverflow, they may be; I'm not sure what happens when they are pasted into a form like this.) In my application, however,
f1[16] // ö
f2[16] // o
f1[17] // ¬
f2[17] // ̈
That is, where f1
uses the ö character, f2
uses an o and a diacritic ¨ as a separate character. What parison can I do that will show these two strings to be "equal"?
- 4 One solution -- perhaps the only one -- would be to "canonicalize" (in the Unicode sense) the two strings, but I haven't been able find a library or function for that yet. – James A. Rosen Commented Aug 17, 2011 at 18:53
- 1 Are you sure that you have declared UTF-8 in your meta tags? – cwallenpoole Commented Aug 17, 2011 at 18:56
- Great question, @cwallenpoole. I'm not, but I'll double-check now. The two strings I've described definitely can both be valid Unicode, but I'm not certain they are. – James A. Rosen Commented Aug 17, 2011 at 19:02
-
@cwallenpoole the page declares
<meta charset="utf-8">
and the form (a file input is the source of the first string) declaresaccept-charset="UTF-8"
. And, of course, the HTTP request and response are also UTF-8. I think this is just a case of different systems (browser vs. server) using different Unicode canonicalization. (Or using versus not using canonicalization.) – James A. Rosen Commented Aug 17, 2011 at 19:13
1 Answer
Reset to default 8
f1
uses the ö character,f2
uses an o and a diacritic ¨ as a separate character.
f1
is in Normal Form C (posed) and f2
in Normal Form D (deposed). In general Normal Form C is the most mon on Windows and the web, with the Unicode FAQ describing it as “the best form for general text”. Unfortunately the Apple world plumped for Normal Form D in order to be gratuitously different.
The strings are canonically equivalent by the rules of Unicode equivalence.
What parison can I do that will show these two strings to be "equal"?
In general, you convert both strings to one Normal Form of your choosing and then pare them. For example in Python:
>>> import unicodedata
>>> a= u'\u00F6' # ö posed
>>> b= u'o\u0308' # o then bining umlaut
>>> unicodedata.normalize('NFC', a)==unicodedata.normalize('NFC', b)
True
Similarly Java has the Normalizer
class, .NET has String.Normalize
, and may languages have bindings available to the ICU library which also offers this feature.
Unfortunately, JavaScript has no native Unicode normalisation ability. This means either:
doing it yourself, carting around large Unicode data tables to cover it all in JavaScript (see eg here for an example implementation); or
sending it back to the server-side (eg via XMLHttpRequest), where you've got a better-equipped language to do it.
本文标签: How do I check equality of Unicode strings in JavascriptStack Overflow
版权声明:本文标题:How do I check equality of Unicode strings in Javascript? - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1744140631a2592601.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论