admin管理员组

文章数量:1304841

I'm writing a rudimentary lexer using regular expressions in JavaScript and I have two regular expressions (one for single quoted strings and one for double quoted strings) which I wish to bine into one. These are my two regular expressions (I added the ^ and $ characters for testing purposes):

var singleQuotedString = /^'(?:[^'\\]|\\'|\\\\|\\\/|\\b|\\f|\\n|\\r|\\t|\\u[0-9A-F]{4})*'$/gi;
var doubleQuotedString = /^"(?:[^"\\]|\\"|\\\\|\\\/|\\b|\\f|\\n|\\r|\\t|\\u[0-9A-F]{4})*"$/gi;

Now I tried to bine them into a single regular expression as follows:

var string = /^(["'])(?:[^\1\\]|\\\1|\\\\|\\\/|\\b|\\f|\\n|\\r|\\t|\\u[0-9A-F]{4})*\1$/gi;

However when I test the input "Hello"World!" it returns true instead of false:

alert(string.test('"Hello"World!"')); //should return false as a double quoted string must escape double quote characters

I figured that the problem is in [^\1\\] which should match any character besides matching group \1 (which is either a single or a double quote - the delimiter of the string) and \\ (which is the backslash character).

The regular expression correctly filters out backslashes and matches the delimiters, but it doesn't filter out the delimiter within the string. Any help will be greatly appreciated. Note that I referred to Crockford's railroad diagrams to write the regular expressions.

I'm writing a rudimentary lexer using regular expressions in JavaScript and I have two regular expressions (one for single quoted strings and one for double quoted strings) which I wish to bine into one. These are my two regular expressions (I added the ^ and $ characters for testing purposes):

var singleQuotedString = /^'(?:[^'\\]|\\'|\\\\|\\\/|\\b|\\f|\\n|\\r|\\t|\\u[0-9A-F]{4})*'$/gi;
var doubleQuotedString = /^"(?:[^"\\]|\\"|\\\\|\\\/|\\b|\\f|\\n|\\r|\\t|\\u[0-9A-F]{4})*"$/gi;

Now I tried to bine them into a single regular expression as follows:

var string = /^(["'])(?:[^\1\\]|\\\1|\\\\|\\\/|\\b|\\f|\\n|\\r|\\t|\\u[0-9A-F]{4})*\1$/gi;

However when I test the input "Hello"World!" it returns true instead of false:

alert(string.test('"Hello"World!"')); //should return false as a double quoted string must escape double quote characters

I figured that the problem is in [^\1\\] which should match any character besides matching group \1 (which is either a single or a double quote - the delimiter of the string) and \\ (which is the backslash character).

The regular expression correctly filters out backslashes and matches the delimiters, but it doesn't filter out the delimiter within the string. Any help will be greatly appreciated. Note that I referred to Crockford's railroad diagrams to write the regular expressions.

Share Improve this question edited Jan 31, 2020 at 22:26 rogerdpack 67k40 gold badges284 silver badges404 bronze badges asked Apr 27, 2012 at 18:36 Aadit M ShahAadit M Shah 74.2k31 gold badges175 silver badges307 bronze badges
Add a ment  | 

3 Answers 3

Reset to default 7

You can't refer to a matched group inside a character class: (['"])[^\1\\]. Try something like this instead:

(['"])((?!\1|\\).|\\[bnfrt]|\\u[a-fA-F\d]{4}|\\\1)*\1

(you'll need to add some more escapes, but you get my drift...)

A quick explanation:

(['"])             # match a single or double quote and store it in group 1
(                  # start group 2
  (?!\1|\\).       #   if group 1 or a backslash isn't ahead, match any non-line break char
  |                #   OR
  \\[bnfrt]        #   match an escape sequence
  |                #   OR
  \\u[a-fA-F\d]{4} #   match a Unicode escape
  |                #   OR
  \\\1             #   match an escaped quote
)*                 # close group 2 and repeat it zero or more times
\1                 # match whatever group 1 matched

This should work too (raw regex).
If speed is a factor, this is the 'unrolled' method, said to be the fastest for this kind of thing.

(['"])(?:(?!\\|\1).)*(?:\\(?:[\/bfnrt]|u[0-9A-F]{4}|\1)(?:(?!\\|\1).)*)*/1  

Expanded

(['"])            # Capture a quote
(?:
   (?!\\|\1).             # As many non-escape and non-quote chars as possible
)*

(?:                       
    \\                     # escape plus,
    (?:
        [\/bfnrt]          # /,b,f,n,r,t or u[a-9A-f]{4} or captured quote
      | u[0-9A-F]{4}
      | \1
    )
    (?:                
        (?!\\|\1).         # As many non-escape and non-quote chars as possible
    )*
)*

/1                # Captured quote

Well, you can always just create a larger regex by just using the alternation operator on the smaller regexes

/(?:single-quoted-regex)|(?:double-quoted-regex)/

Or explicitly:

var string = /(?:^'(?:[^'\\]|\\'|\\\\|\\\/|\\b|\\f|\\n|\\r|\\t|\\u[0-9A-F]{4})*'$)|(?:^"(?:[^"\\]|\\"|\\\\|\\\/|\\b|\\f|\\n|\\r|\\t|\\u[0-9A-F]{4})*"$)/gi;

Finally, if you want to avoid the code duplication, you can build up this regex dynamically, using the new Regex constructor.

var quoted_string = function(delimiter){
    return ('^' + delimiter + '(?:[^' + delimiter + '\\]|\\' + delimiter + '|\\\\|\\\/|\\b|\\f|\\n|\\r|\\t|\\u[0-9A-F]{4})*' + delimiter + '$').replace(/\\/g, '\\\\');
    //in the general case you could consider using a regex excaping function to avoid backslash hell.
};

var string = new RegExp( '(?:' + quoted_string("'") + ')|(?:' + quoted_string('"') + ')' , 'gi' );

本文标签: javascriptHow do I combine these two regular expressions into oneStack Overflow