admin管理员组文章数量:1320661
Using JavaScript I need to efficiently remove ~10000 keywords from a ~100000 word document, of which ~1000 will be keywords. What approach would you suggest?
Would a massive regular expression be practical? Or should I just iterate through the document characters looking for keywords (boring)?
Edit:
Good point - only whole words, not parts. And some keywords contain spaces.
I am trying to do it all client side to reduce pressure on the backend.
Using JavaScript I need to efficiently remove ~10000 keywords from a ~100000 word document, of which ~1000 will be keywords. What approach would you suggest?
Would a massive regular expression be practical? Or should I just iterate through the document characters looking for keywords (boring)?
Edit:
Good point - only whole words, not parts. And some keywords contain spaces.
I am trying to do it all client side to reduce pressure on the backend.
- 1 Interesting question. On one hand, a state machine handwritten in a piled language would beat the hell out of regex, but on the other, Javascript itself is quite slow, so you'd need to try and benchmark whether the regex engine is faster due to being piled. – Max Shawabkeh Commented Feb 3, 2010 at 8:15
- Does it have to be JavaScript or can you push it to the server for transformation? It's hard to say which will be more efficient without some data to test it on. If you're using Python, for example, you can segment the data and thread the process if you really need to. – Justin Johnson Commented Feb 3, 2010 at 8:18
- Are you required to replace only whole words or parts of word too -- eg word, keyword, word-stem all have the word 'word' in them, how must they be treated? – meouw Commented Feb 3, 2010 at 8:34
3 Answers
Reset to default 6Using a regular expression might be a good option:
var words = ['bon', 'mad'];
'joe bon joe mad'.replace(new RegExp('(' + words.join('|') + ')', 'g'), '');
// 'joe joe '
The regex1 isn't very plicated with things like look-ahead, and the regexp engine is written in C/C++, so you can expect it be quite fast. Nevertheless - benchmark and see if the performance fits your needs.
I don't think that implementing your own parser will be faster, but I might be wrong - benchmark.
Sending the document to the server doesn't sound very good to me. With 100k words you are looking at a payload in the megabytes range, and you still have to do something with it on the server and push it back.
1 You might have to tune the regexp to do something with the spaces.
My instinct tells me that for such a large number of keywords - sorting the keywords and creating a per character state machine would be much faster than a regular expression, since the state machine is trivial, it can be generated automatically.
A state machine seems to be often used for similar tasks, e.g. http://www.codeproject./KB/string/civstringset.aspx
本文标签: javascriptefficient method to replace multiple words in textStack Overflow
版权声明:本文标题:javascript - efficient method to replace multiple words in text - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1742064806a2418770.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论