admin管理员组

文章数量:1336632

Is there solution to find word boundaries in Japanese string (E.g.: "私はマーケットに行きました。") via JavaScript regular expressions("xregexp" JS library cab be used)?

E.g.:

var xr = RegExp("\\bst","g");
xr.test("The string") // --> true

I need the same logic for Japanese strings.

Is there solution to find word boundaries in Japanese string (E.g.: "私はマーケットに行きました。") via JavaScript regular expressions("xregexp" JS library cab be used)?

E.g.:

var xr = RegExp("\\bst","g");
xr.test("The string") // --> true

I need the same logic for Japanese strings.

Share Improve this question edited Feb 24, 2013 at 11:53 hippietrail 17k21 gold badges109 silver badges179 bronze badges asked Oct 28, 2011 at 9:49 AndreiAndrei 4,2373 gold badges28 silver badges32 bronze badges 3
  • I don't understand, what is \\bst? – hippietrail Commented May 11, 2013 at 1:43
  • A way to match the boundaries between Han, Hiragana, and Katakana would assist but not solve this problem on its own. So far I can't even find a way to match those, even with xregexp. You may be interested in a question I just asked about that: stackoverflow./questions/16492933/… – hippietrail Commented May 11, 2013 at 1:47
  • For Japanese it would be better to use a full morphological analyzer. Here's one in JavaScript: github./takuyaa/kuromoji.js – katspaugh Commented Oct 1, 2015 at 8:56
Add a ment  | 

2 Answers 2

Reset to default 6

However, the actual problem of separating the Japanese sentence into words is more plicated than it appears, since words are not separated into spaces as is the case, for example, in English.

For example, the sentence 私はマーケットに行きました。 ("I went to the market") has the following words:

  • 私 - watakushi
  • は - wa
  • マーケット - maaketto
  • に - ni
  • 行きました - ikimashita
  • 。 - (period)

A reliable parser of Japanese sentences would, among other things, have to find where the particles (wa and ni) lie in the sentence, in order to find the remaining words.

\b, as well as \w and \W, isn't Unicode-aware in JavaScript. You have to define your word boundaries as a specific character set. Like (^|$|[\s.,:\u3002]+) or similar.

\u3002 is from ('。'.charCodeAt(0)).toString(16). Is it a punctuation symbol in Japanese?

Or, a contrario, define a Unicode range of word-constructing letters and negate it:

var boundaries = /(^|$|\s+|[^\u30A0–\u30FA]+)/g;

The example katakana range taken from http://www.unicode/charts/PDF/U30A0.pdf.

本文标签: regexJavascript regular expression for searching word boundaries in Unicode stringStack Overflow