admin管理员组

文章数量:1320661

Using JavaScript I need to efficiently remove ~10000 keywords from a ~100000 word document, of which ~1000 will be keywords. What approach would you suggest?

Would a massive regular expression be practical? Or should I just iterate through the document characters looking for keywords (boring)?

Edit:
Good point - only whole words, not parts. And some keywords contain spaces.
I am trying to do it all client side to reduce pressure on the backend.

Using JavaScript I need to efficiently remove ~10000 keywords from a ~100000 word document, of which ~1000 will be keywords. What approach would you suggest?

Would a massive regular expression be practical? Or should I just iterate through the document characters looking for keywords (boring)?

Edit:
Good point - only whole words, not parts. And some keywords contain spaces.
I am trying to do it all client side to reduce pressure on the backend.

Share Improve this question edited Feb 3, 2010 at 13:53 hoju asked Feb 3, 2010 at 8:07 hojuhoju 29.5k39 gold badges137 silver badges178 bronze badges 3
  • 1 Interesting question. On one hand, a state machine handwritten in a piled language would beat the hell out of regex, but on the other, Javascript itself is quite slow, so you'd need to try and benchmark whether the regex engine is faster due to being piled. – Max Shawabkeh Commented Feb 3, 2010 at 8:15
  • Does it have to be JavaScript or can you push it to the server for transformation? It's hard to say which will be more efficient without some data to test it on. If you're using Python, for example, you can segment the data and thread the process if you really need to. – Justin Johnson Commented Feb 3, 2010 at 8:18
  • Are you required to replace only whole words or parts of word too -- eg word, keyword, word-stem all have the word 'word' in them, how must they be treated? – meouw Commented Feb 3, 2010 at 8:34
Add a ment  | 

3 Answers 3

Reset to default 6

Using a regular expression might be a good option:

var words = ['bon', 'mad'];
'joe bon joe mad'.replace(new RegExp('(' + words.join('|') + ')', 'g'), '');
// 'joe  joe  '

The regex1 isn't very plicated with things like look-ahead, and the regexp engine is written in C/C++, so you can expect it be quite fast. Nevertheless - benchmark and see if the performance fits your needs.

I don't think that implementing your own parser will be faster, but I might be wrong - benchmark.

Sending the document to the server doesn't sound very good to me. With 100k words you are looking at a payload in the megabytes range, and you still have to do something with it on the server and push it back.


1 You might have to tune the regexp to do something with the spaces.

My instinct tells me that for such a large number of keywords - sorting the keywords and creating a per character state machine would be much faster than a regular expression, since the state machine is trivial, it can be generated automatically.

A state machine seems to be often used for similar tasks, e.g. http://www.codeproject./KB/string/civstringset.aspx

本文标签: javascriptefficient method to replace multiple words in textStack Overflow