admin管理员组

文章数量:1293970

I want to match this character in the African Yoruba language 'ẹ́'. Usually this is made by bining an 'é' with a '\u0323' under dot diacritic. I found that:

'é\u0323'.match(/[é]\u0323/) works but
'ẹ́'.match(/[é]\u0323/) does not work.

I don't just want to match e. I want to match all binations. Right now, my solution involves enumerating all binations. Like so: /[ÁÀĀÉÈĒẸE̩Ẹ́É̩Ẹ̀È̩Ẹ̄Ē̩ÍÌĪÓÒŌỌO̩Ọ́Ó̩Ọ̀Ò̩Ọ̄Ō̩ÚÙŪṢS̩áàāéèēẹe̩ẹ́é̩ẹ̀è̩ẹ̄ē̩íìīóòōọo̩ọ́ó̩ọ̀ò̩ọ̄ō̩úùūṣs̩]/

Could there not be a shorter and thus better way to do this, or does regex matching in javascript of unicode diacritic bining characters not work this easily? Thank you

I want to match this character in the African Yoruba language 'ẹ́'. Usually this is made by bining an 'é' with a '\u0323' under dot diacritic. I found that:

'é\u0323'.match(/[é]\u0323/) works but
'ẹ́'.match(/[é]\u0323/) does not work.

I don't just want to match e. I want to match all binations. Right now, my solution involves enumerating all binations. Like so: /[ÁÀĀÉÈĒẸE̩Ẹ́É̩Ẹ̀È̩Ẹ̄Ē̩ÍÌĪÓÒŌỌO̩Ọ́Ó̩Ọ̀Ò̩Ọ̄Ō̩ÚÙŪṢS̩áàāéèēẹe̩ẹ́é̩ẹ̀è̩ẹ̄ē̩íìīóòōọo̩ọ́ó̩ọ̀ò̩ọ̄ō̩úùūṣs̩]/

Could there not be a shorter and thus better way to do this, or does regex matching in javascript of unicode diacritic bining characters not work this easily? Thank you

Share Improve this question asked Jun 28, 2013 at 5:20 user2530580user2530580 1571 silver badge5 bronze badges 3
  • If I have to be honest, I'd much rather read and maintain that short string of chars than decrypt and understand the \uxxxx part of a possible more clever regex. Using a lookup table will always be faster than calculating a char first. A possible way if the regex fails you is to render the char in a span and then pare – mplungjan Commented Jun 28, 2013 at 5:23
  • That's a good point. Maybe the current way is better. – user2530580 Commented Jun 28, 2013 at 5:27
  • I ended up going with the \uxxxx part because editing it in vim made a lot more sense when there weren't varying width unicode points everywhere with differing flow directions doing quite wonderful things with the cursor position: its position basically became a random variable. – user2530580 Commented Apr 28, 2015 at 15:38
Add a ment  | 

2 Answers 2

Reset to default 6

Normally the solution would be to use Unicode properties and/or scripts, but JavaScript does not support them natively.

But there exists the lib XRegExp that adds this support. With this lib you can use

\p{L}: to match any kind of letter from any language.

\p{M}: a character intended to be bined with another character (e.g. accents, umlauts, enclosing boxes, etc.).

So your character class would look like this:

[\p{L}\p{M}]+

that would match all possible letters that are in the Unicode table.

If you want to limit it, you can have a look at Unicode scripts and replace \p{L} by a script, they collect all letters from certain languages. e.g. \p{Latin} for all Latin letters or \p{Cyrillic} for all Cyrillic letters.

Usually this is made by bining an 'é' with a '\u0323' under dot diacritic

However, that isn't what you have here:

'ẹ́'

that's not U+0065,U+0323 but U+1EB9,U+0301 - bining an with an acute diacritic.

The usual solution would be to normalise each string (typically to Unicode Normal Form C) before doing the parison.

I don't just want to match e. I want to match all binations

Matching without diacriticals is typically done by normalising to Normal Form D and removing all the bining diacritical characters.

Unfortunately normalisation is not available in JS, so if you want it you would have to drag in code to do it, which would have to include a large Unicode data table. One such effort is unorm. For picking up characters based on Unicode preoperties like being a bining diacritical, you'd also need a regexp engine with support for the Unicode database, such as XRegExp Unicode Categories.

Server-side languages (eg Python, .NET) typically have native support for Unicode normalisation, so if you can do the processing on the server that would generally be easier.

本文标签: Javascript RegexUnicode Diacritic Combining CharactersStack Overflow