admin管理员组文章数量:1336632
Is there solution to find word boundaries in Japanese string (E.g.: "私はマーケットに行きました。") via JavaScript regular expressions("xregexp" JS library cab be used)?
E.g.:
var xr = RegExp("\\bst","g");
xr.test("The string") // --> true
I need the same logic for Japanese strings.
Is there solution to find word boundaries in Japanese string (E.g.: "私はマーケットに行きました。") via JavaScript regular expressions("xregexp" JS library cab be used)?
E.g.:
var xr = RegExp("\\bst","g");
xr.test("The string") // --> true
I need the same logic for Japanese strings.
Share Improve this question edited Feb 24, 2013 at 11:53 hippietrail 17k21 gold badges109 silver badges179 bronze badges asked Oct 28, 2011 at 9:49 AndreiAndrei 4,2373 gold badges28 silver badges32 bronze badges 3-
I don't understand, what is
\\bst
? – hippietrail Commented May 11, 2013 at 1:43 - A way to match the boundaries between Han, Hiragana, and Katakana would assist but not solve this problem on its own. So far I can't even find a way to match those, even with xregexp. You may be interested in a question I just asked about that: stackoverflow./questions/16492933/… – hippietrail Commented May 11, 2013 at 1:47
- For Japanese it would be better to use a full morphological analyzer. Here's one in JavaScript: github./takuyaa/kuromoji.js – katspaugh Commented Oct 1, 2015 at 8:56
2 Answers
Reset to default 6However, the actual problem of separating the Japanese sentence into words is more plicated than it appears, since words are not separated into spaces as is the case, for example, in English.
For example, the sentence 私はマーケットに行きました。 ("I went to the market") has the following words:
- 私 - watakushi
- は - wa
- マーケット - maaketto
- に - ni
- 行きました - ikimashita
- 。 - (period)
A reliable parser of Japanese sentences would, among other things, have to find where the particles (wa and ni) lie in the sentence, in order to find the remaining words.
\b
, as well as \w
and \W
, isn't Unicode-aware in JavaScript. You have to define your word boundaries as a specific character set. Like (^|$|[\s.,:\u3002]+)
or similar.
\u3002
is from ('。'.charCodeAt(0)).toString(16)
. Is it a punctuation symbol in Japanese?
Or, a contrario, define a Unicode range of word-constructing letters and negate it:
var boundaries = /(^|$|\s+|[^\u30A0–\u30FA]+)/g;
The example katakana range taken from http://www.unicode/charts/PDF/U30A0.pdf.
本文标签: regexJavascript regular expression for searching word boundaries in Unicode stringStack Overflow
版权声明:本文标题:regex - Javascript regular expression for searching word boundaries in Unicode string - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1742413795a2470316.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论