admin管理员组文章数量:1410697
In the text page, I would like to examine each word. What is the best way to read each word at the time? It is easy to find words that are surrounded by space, but once you get into parsing out words in text it can get plicated.
Instead of defining my own way of parsing the words from text, is there something already built that parse out the words in regular expression or other methods?
Some example of words in text.
word word. word(word) word's word word' "word" .word. 'word' sub-word
In the text page, I would like to examine each word. What is the best way to read each word at the time? It is easy to find words that are surrounded by space, but once you get into parsing out words in text it can get plicated.
Instead of defining my own way of parsing the words from text, is there something already built that parse out the words in regular expression or other methods?
Some example of words in text.
word word. word(word) word's word word' "word" .word. 'word' sub-word
Share
edited May 25, 2011 at 12:18
Mike
asked May 25, 2011 at 12:13
MikeMike
411 silver badge3 bronze badges
4 Answers
Reset to default 4You can use:
text = "word word. word(word) word's word word' \"word\" .word. 'word' sub-word";
words = text.match(/[-\w]+/g);
This will give you an array with all your words.
In regular expressions, \w
means any character that is either a-z
, A-Z
, 0-9
or _
. [-\w]
means any character that is a \w
or a -
. [-\w]+
means any of these characters that appear 1 ore more times.
If you would like to define a word as being something more than the above expression, add the other characters that pose your words inside the [-\w]
character class. For example, if you'd like words to also contain (
and )
, make the character class be [-\w()]
.
For an introduction in regular expressions, check out the great tutorial at regular-expressions.info.
What you're talking about is Tokenisation. It's non-trivial to say the least, and a subject of intense reasearch at the major search engines. There are a number of open source tokenisation libraries in various server-side languages (e.g see the Stanford NLP and Lucene projects) but as far as I am aware there's nothing that would even e close to these in javascript. You may have to roll your own :) or perhaps do the processing server-side, and load the results via AJAX?
I support Richard's answer here - but to add to it - one of the easiest ways of building a tokeniser (imho) is Antlr; and some maniac has built a Javascript target for it; thus allowing you to run and execute a grammar in the web browser (look under 'runtime libraries' section here)
I won't pretend that there's not a learning curve there though.
Take a look at regular expressions - you can define almost any parsing algorithm you want.
本文标签: How to parse for a word in text in JavaScriptStack Overflow
版权声明:本文标题:How to parse for a word in text in JavaScript? - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1745014304a2637743.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论