admin管理员组

文章数量:1410697

In the text page, I would like to examine each word. What is the best way to read each word at the time? It is easy to find words that are surrounded by space, but once you get into parsing out words in text it can get plicated.

Instead of defining my own way of parsing the words from text, is there something already built that parse out the words in regular expression or other methods?

Some example of words in text.

  word word. word(word) word's word word' "word" .word. 'word' sub-word 

In the text page, I would like to examine each word. What is the best way to read each word at the time? It is easy to find words that are surrounded by space, but once you get into parsing out words in text it can get plicated.

Instead of defining my own way of parsing the words from text, is there something already built that parse out the words in regular expression or other methods?

Some example of words in text.

  word word. word(word) word's word word' "word" .word. 'word' sub-word 
Share edited May 25, 2011 at 12:18 Mike asked May 25, 2011 at 12:13 MikeMike 411 silver badge3 bronze badges
Add a ment  | 

4 Answers 4

Reset to default 4

You can use:

text = "word word. word(word) word's word word' \"word\" .word. 'word' sub-word";
words = text.match(/[-\w]+/g);

This will give you an array with all your words.

In regular expressions, \w means any character that is either a-z, A-Z, 0-9 or _. [-\w] means any character that is a \w or a -. [-\w]+ means any of these characters that appear 1 ore more times.

If you would like to define a word as being something more than the above expression, add the other characters that pose your words inside the [-\w] character class. For example, if you'd like words to also contain ( and ), make the character class be [-\w()].

For an introduction in regular expressions, check out the great tutorial at regular-expressions.info.

What you're talking about is Tokenisation. It's non-trivial to say the least, and a subject of intense reasearch at the major search engines. There are a number of open source tokenisation libraries in various server-side languages (e.g see the Stanford NLP and Lucene projects) but as far as I am aware there's nothing that would even e close to these in javascript. You may have to roll your own :) or perhaps do the processing server-side, and load the results via AJAX?

I support Richard's answer here - but to add to it - one of the easiest ways of building a tokeniser (imho) is Antlr; and some maniac has built a Javascript target for it; thus allowing you to run and execute a grammar in the web browser (look under 'runtime libraries' section here)

I won't pretend that there's not a learning curve there though.

Take a look at regular expressions - you can define almost any parsing algorithm you want.

本文标签: How to parse for a word in text in JavaScriptStack Overflow