admin管理员组

文章数量:1193557

How do I use regex to match any word (\w) except a list of certain words? For example:

I want to match the words use and utilize and any words after it except if the words are something or fish.

use this  <-- match
utilize that  <-- match
use something   <-- don't want to match this
utilize fish <-- don't want to match this

How do I specify a list of words I don't want to match against?

How do I use regex to match any word (\w) except a list of certain words? For example:

I want to match the words use and utilize and any words after it except if the words are something or fish.

use this  <-- match
utilize that  <-- match
use something   <-- don't want to match this
utilize fish <-- don't want to match this

How do I specify a list of words I don't want to match against?

Share Improve this question edited Nov 24, 2016 at 0:33 Todd A. Jacobs 84.3k15 gold badges145 silver badges208 bronze badges asked Jan 13, 2012 at 17:39 Jake WilsonJake Wilson 91.2k96 gold badges260 silver badges371 bronze badges 1
  • What language? It matters. – Todd A. Jacobs Commented Jan 15, 2016 at 14:46
Add a comment  | 

4 Answers 4

Reset to default 20

You can use a negative lookahead to determine that the word you are about to match is not a particular thing. You can use the following regex to do this:

(use|utilize)\s(?!fish|something)(\w+)

This will match "use" or "utilize" followed by a space, and then if the following word is not "fish" or "something", it will match that next word.

This should do it:

/(?:use|utilize)\s+(?!something|fish)\w+/

Don't Hard-Code Your Regular Expressions

Rather than trying to put all your search and exclusion terms into a single, hard-coded regular expression, it's often more maintainable (and certainly more readable) to use short-circuit evaluation to select strings that match desirable terms, and then reject strings that contain undesirable terms.

You can then encapsulate this testing into a function that returns a Boolean value based on the run-time values of your arrays. For example:

'use strict';

// Join arrays of terms with the alternation operator.
var searchTerms   = new RegExp(['use', 'utilize'].join('|'));
var excludedTerms = new RegExp(['fish', 'something'].join('|'));

// Return true if a string contains only valid search terms without any
// excluded terms.
var isValidStr = function (str) {
    return (searchTerms.test(str) && !excludedTerms.test(str));
};

isValidStr('use fish');       // false
isValidStr('utilize hammer'); // true

Some people, when confronted with a problem, think “I know, I'll use regular expressions.”

Now they have two problems.

Regular expressions are suited to match regular sequences of symbols, not words. Any lexer+parser would be much more suitable. For example, the grammar for this task will look very simple in Antlr. If you can't afford a learning curve behind lexers/parsers (they are pretty easy for your given task), then splitting your text into words with regular expression, and then simple search with look-ahead of 1 would be enough.

Regular expressions with words get very complex very fast. They are hard to read and hard to mantain.

Update: Thanks for all the downvotes. Here's an example of what I meant.

import re

def Tokenize( text ):
    return re.findall( "\w+", text )

def ParseWhiteListedWordThenBlackListedWord( tokens, whiteList, blackList ):
    for i in range( 0, len( tokens ) - 1 ):
        if tokens[i] in whiteList and tokens[i + 1] not in blackList:
            yield ( tokens[i], tokens[i + 1] )

Here is some performance testing:

>>> timeit.timeit( 'oldtime()', 'from __main__ import oldtime', number=1 )
0.02636446265387349

>>> timeit.timeit( 'oldtime()', 'from __main__ import oldtime', number=1000 )
28.80968123656703

>>> timeit.timeit( 'newtime()', 'from __main__ import newtime', number=100 )
44.24506212427741

>>> timeit.timeit( 'newtime11()', 'from __main__ import newtime11', number=1 ) +
timeit.timeit( 'newtime13()', 'from __main__ import newtime13', number=1000 )
103.07938725936083

>>> timeit.timeit( 'newtime11()', 'from __main__ import newtime11', number=1 ) + 
timeit.timeit( 'newtime12()', 'from __main__ import newtime12', number=1000 )
0.3191265909927097

Some notes: testing was over the English text of Pride anf Prejudice by Jane Austin, first words were 'Mr' and 'my', second words were 'Bennet' and 'dear'.

oldtime() is regular expression. newtime() is Tokenizer+Parser , mind that it was run 100 times, not 1000, so, a comparable time for it would be ~442.

The next two test are to simulate repeated runs of Parser over the same text, as you reuse Tokenizer results.

newtime11() is Tokenizer only. newtime13() is Parser with results converted to list (to simulate traversal of the results). newtime12() is just Parser.

Well, regular expressions are faster, by quite a lot in the case of a single pass, even in the case of generator (the bulk of the time is spent tokenizing text, in case of Tokenizer+Parser). But generator expressions are extremely fast when you can reuse tokenized text and evaluate parser results lazily.

There is quite a bit of performance optimization possible, but it'll complicate the solution, possibly to the point where regular expressions are to become the best implementation.

The tokenizer+parser approach has both advantages and disadvantages: - the structure of the solution is more complex (more elements) but each element is simpler - elements are easy to test, including automatic testing - it IS slow, but it gets better with reusing the same text and evaluating results lazily - due to generators and lazy evaluation, some work may be avoided - it is trivial to change white list and/or black list - it is trivial to have several white lists, several black lists and/or their combinations - it is trivial to add new parsers reusing tokenizer results

Now to that thorny You Aint Gonna Need It question. Neither you are gonna need the solution to the original question, unless it is a part of a bigger task. And that bigger task should dictate the best approach.

Update: There is a good discussion of regualr expressions in lexing and parsing at http://commandcenter.blogspot.ru/2011/08/regular-expressions-in-lexing-and.html. I'll summarize it with a quote

Encouraging regular expressions as a panacea for all text processing problems is not only lazy and poor engineering, it also reinforces their use by people who shouldn't be using them at all.

本文标签: