In JavaScript, how can I use regex to match unless words are in a list of excluded words? - Stack Overflow

IT技术

更新时间：2025-02-020

admin管理员组
文章数量:1193557

How do I use regex to match any word (\w) except a list of certain words? For example:

I want to match the words use and utilize and any words after it except if the words are something or fish.

use this  <-- match
utilize that  <-- match
use something   <-- don't want to match this
utilize fish <-- don't want to match this

How do I specify a list of words I don't want to match against?

How do I use regex to match any word (\w) except a list of certain words? For example:

I want to match the words use and utilize and any words after it except if the words are something or fish.

use this  <-- match
utilize that  <-- match
use something   <-- don't want to match this
utilize fish <-- don't want to match this

How do I specify a list of words I don't want to match against?

Share Improve this question edited Nov 24, 2016 at 0:33 Todd A. Jacobs 84.3k15 gold badges145 silver badges208 bronze badges asked Jan 13, 2012 at 17:39 Jake Wilson 91.2k96 gold badges260 silver badges371 bronze badges

What language? It matters. – Todd A. Jacobs Commented Jan 15, 2016 at 14:46

Add a comment |

4 Answers 4

Sorted by: Reset to default 20

You can use a negative lookahead to determine that the word you are about to match is not a particular thing. You can use the following regex to do this:

(use|utilize)\s(?!fish|something)(\w+)

This will match "use" or "utilize" followed by a space, and then if the following word is not "fish" or "something", it will match that next word.

This should do it:

/(?:use|utilize)\s+(?!something|fish)\w+/

Don't Hard-Code Your Regular Expressions

Rather than trying to put all your search and exclusion terms into a single, hard-coded regular expression, it's often more maintainable (and certainly more readable) to use short-circuit evaluation to select strings that match desirable terms, and then reject strings that contain undesirable terms.

You can then encapsulate this testing into a function that returns a Boolean value based on the run-time values of your arrays. For example:

'use strict';

// Join arrays of terms with the alternation operator.
var searchTerms   = new RegExp(['use', 'utilize'].join('|'));
var excludedTerms = new RegExp(['fish', 'something'].join('|'));

// Return true if a string contains only valid search terms without any
// excluded terms.
var isValidStr = function (str) {
    return (searchTerms.test(str) && !excludedTerms.test(str));
};

isValidStr('use fish');       // false
isValidStr('utilize hammer'); // true

Some people, when confronted with a problem, think “I know, I'll use regular expressions.”

Now they have two problems.

Regular expressions are suited to match regular sequences of symbols, not words. Any lexer+parser would be much more suitable. For example, the grammar for this task will look very simple in Antlr. If you can't afford a learning curve behind lexers/parsers (they are pretty easy for your given task), then splitting your text into words with regular expression, and then simple search with look-ahead of 1 would be enough.

Regular expressions with words get very complex very fast. They are hard to read and hard to mantain.

Update: Thanks for all the downvotes. Here's an example of what I meant.

import re

def Tokenize( text ):
    return re.findall( "\w+", text )

def ParseWhiteListedWordThenBlackListedWord( tokens, whiteList, blackList ):
    for i in range( 0, len( tokens ) - 1 ):
        if tokens[i] in whiteList and tokens[i + 1] not in blackList:
            yield ( tokens[i], tokens[i + 1] )

Here is some performance testing:

>>> timeit.timeit( 'oldtime()', 'from __main__ import oldtime', number=1 )
0.02636446265387349

>>> timeit.timeit( 'oldtime()', 'from __main__ import oldtime', number=1000 )
28.80968123656703

>>> timeit.timeit( 'newtime()', 'from __main__ import newtime', number=100 )
44.24506212427741

>>> timeit.timeit( 'newtime11()', 'from __main__ import newtime11', number=1 ) +
timeit.timeit( 'newtime13()', 'from __main__ import newtime13', number=1000 )
103.07938725936083

>>> timeit.timeit( 'newtime11()', 'from __main__ import newtime11', number=1 ) + 
timeit.timeit( 'newtime12()', 'from __main__ import newtime12', number=1000 )
0.3191265909927097

Some notes: testing was over the English text of Pride anf Prejudice by Jane Austin, first words were 'Mr' and 'my', second words were 'Bennet' and 'dear'.

oldtime() is regular expression. newtime() is Tokenizer+Parser , mind that it was run 100 times, not 1000, so, a comparable time for it would be ~442.

The next two test are to simulate repeated runs of Parser over the same text, as you reuse Tokenizer results.

newtime11() is Tokenizer only. newtime13() is Parser with results converted to list (to simulate traversal of the results). newtime12() is just Parser.

Well, regular expressions are faster, by quite a lot in the case of a single pass, even in the case of generator (the bulk of the time is spent tokenizing text, in case of Tokenizer+Parser). But generator expressions are extremely fast when you can reuse tokenized text and evaluate parser results lazily.

There is quite a bit of performance optimization possible, but it'll complicate the solution, possibly to the point where regular expressions are to become the best implementation.

The tokenizer+parser approach has both advantages and disadvantages: - the structure of the solution is more complex (more elements) but each element is simpler - elements are easy to test, including automatic testing - it IS slow, but it gets better with reusing the same text and evaluating results lazily - due to generators and lazy evaluation, some work may be avoided - it is trivial to change white list and/or black list - it is trivial to have several white lists, several black lists and/or their combinations - it is trivial to add new parsers reusing tokenizer results

Now to that thorny You Aint Gonna Need It question. Neither you are gonna need the solution to the original question, unless it is a part of a bigger task. And that bigger task should dictate the best approach.

Update: There is a good discussion of regualr expressions in lexing and parsing at http://commandcenter.blogspot.ru/2011/08/regular-expressions-in-lexing-and.html. I'll summarize it with a quote

Encouraging regular expressions as a panacea for all text processing problems is not only lazy and poor engineering, it also reinforces their use by people who shouldn't be using them at all.

本文标签：

版权声明：本文标题：In JavaScript, how can I use regex to match unless words are in a list of excluded words? - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1738482813a2089245.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

发表评论

全部评论 0

暂无评论

编程频道|软件玩家 - 软件改变生活！

In JavaScript, how can I use regex to match unless words are in a list of excluded words? - Stack Overflow

4 Answers 4

Don't Hard-Code Your Regular Expressions

更多相关文章

c - CMake dynamically include required submodules - Stack Overflow

javascript - bootstrap modal popup c# codebehind - Stack Overflow

javascript - React how to update another component&#39;s state? - Stack Overflow

vue.js - Zod + Quasar: Error - Unresolvable type reference or unsupported built-in utility type - Stack Overflow

javascript - Get the offset position of an HTML5 range slider handle - Stack Overflow

Empty lines with incorrect indentation in YAML file modified in Python - Stack Overflow

javascript - How to capture the onscreen keyboard &#39;keydown&#39; and &#39;keyup&#39; events for touch devices

javascript - How do I get the absolute value of an integer without using Math.abs? - Stack Overflow

dictionary - Determine how many fields a Javascript object has - Stack Overflow

nlp - Is it possible to evaluate Machine Translations using Sentence BERT? - Stack Overflow

How to handle common code in multiple HTML&#39;s using javascript or jQuery - Stack Overflow

javascript - Why is the keyup handler being invoked here, and can I prevent it? - Stack Overflow

javascript regex to count whitespace characters - Stack Overflow

functions - Insert menu into theme location depending on user logged inout status

json - compressing object hierarchies in JavaScript - Stack Overflow

c# - Azure Queue Trigger is hit but not executing the logic inside of it - Stack Overflow

testing - Rest-assured test to upload a file to a signed URL - Stack Overflow

javascript - Rails js files &amp; Page specific js - Stack Overflow

javascript - AWS APIGateway CORS for Lambda Proxy does not apply - Stack Overflow

plugin development - Store custom meta box data as serialized array

发表评论

推荐文章

javascript - List rotation with limited elements - Stack Overflow

Create taxonomy terms from custom field values

c# - javascript countdown timer for session timeout - Stack Overflow

how to get data from api using angular ionic - Stack Overflow

api - Customizer List of Items

热门文章

jquery - Dynamically update a table using javascript - Stack Overflow

How can I account for summer time when adding hours to a DateTime base value in PowerShell? - Stack Overflow

javascript - Why use getDerivedStateFromProps when you have componentDidUpdate? - Stack Overflow

javascript - Get width of d3.js SVG text element after it&#39;s created - Stack Overflow

Video Block does not autoplay

javascript - Searching for the Ultimate Resizing Textarea - Stack Overflow

java - MapStruct - nested multi-level &quot;uses&quot; or mappers grouping - Stack Overflow

user meta - add_user_meta allows multiple clicks and inserts more than 1 value

javascript - jQuery onload - .load() - event not working with a dynamically loaded iframe - Stack Overflow

javascript - Remove duplicates from arrays using reduce - Stack Overflow

最新文章

电脑公司 GHOST WIN7 SP1 旗舰版

电脑公司GHOST WIN7 装机旗舰版 201309

技术员 Ghost Win 7 Sp1（X86X64）旗舰加强版201804

深度至尊 GHOST WIN7 SP1 旗舰装机版 V2012.03

win7重装的那些事儿

plugin development - Store custom meta box data as serialized array

javascript - A component is changing the uncontrolled value state of Select to be controlled - Stack Overflow

git - PR validation that pipeline has run on source branch - Stack Overflow

javascript - Sound does not work in service worker in Chrome desktop push notification - Stack Overflow

javascript - AWS APIGateway CORS for Lambda Proxy does not apply - Stack Overflow

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价

javascript - React how to update another component's state? - Stack Overflow

javascript - How to capture the onscreen keyboard 'keydown' and 'keyup' events for touch devices

How to handle common code in multiple HTML's using javascript or jQuery - Stack Overflow

javascript - Rails js files & Page specific js - Stack Overflow

javascript - Get width of d3.js SVG text element after it's created - Stack Overflow

java - MapStruct - nested multi-level "uses" or mappers grouping - Stack Overflow