admin管理员组

文章数量:1326277

I am building a Javascript script that splits a paragraph into sentences. I am using this code with a regex to do that right now:

paragraph.match( /[^\.!\?]+[\.!\?(?="|')]+(\s|$)/g );

This works great except for the fact that if there is a word with the punctuation in it in the sentence then it splits it there. So for example if I have a sentence like: Why is about.me so popular? I want it to say that that is one sentence and return an array like ['Why is about.me so popular?'], but with this regex it splits it at the . in about.me. I know the issue is in this part of the regex [^\.!\?] because I am saying that it can't have end mark punctuation in the sentence. So what I really need is one that only allows characters that are not the punctuation followed by a space. The issue is I cannot figure out how to do that.

Any ideas? I tried [^\.!\?(?=\s)], but that did not work. Any ideas?

Clarification:

I need to use .match() because I want to be able to retain the punctuation.

I am building a Javascript script that splits a paragraph into sentences. I am using this code with a regex to do that right now:

paragraph.match( /[^\.!\?]+[\.!\?(?="|')]+(\s|$)/g );

This works great except for the fact that if there is a word with the punctuation in it in the sentence then it splits it there. So for example if I have a sentence like: Why is about.me so popular? I want it to say that that is one sentence and return an array like ['Why is about.me so popular?'], but with this regex it splits it at the . in about.me. I know the issue is in this part of the regex [^\.!\?] because I am saying that it can't have end mark punctuation in the sentence. So what I really need is one that only allows characters that are not the punctuation followed by a space. The issue is I cannot figure out how to do that.

Any ideas? I tried [^\.!\?(?=\s)], but that did not work. Any ideas?

Clarification:

I need to use .match() because I want to be able to retain the punctuation.

Share Improve this question edited Jul 20, 2013 at 19:21 chromedude asked Jul 20, 2013 at 18:46 chromedudechromedude 4,30216 gold badges69 silver badges96 bronze badges 12
  • you cannot have a lookahead within a class – Walter Tross Commented Jul 20, 2013 at 18:49
  • @WalterTross ok that's what I thought so any idea how to do the same thing, but with a different method? – chromedude Commented Jul 20, 2013 at 18:50
  • I don't think you should assume [Why is about.me so popular?] to be a single sentence. – Sebas Commented Jul 20, 2013 at 19:18
  • @Sebas um... why not? – chromedude Commented Jul 20, 2013 at 19:21
  • Because it is a character class containing the characters "W", "h", "y", " ", "i", "s", "a", "b", "o", "u", "t", ".", "m", "e", "p", "l", "r", and "?". IOW, it is equivalent to [Why isabout.meplr?] or [ .?Wabehilmoprstuy] or [ .?Wabehilmopr-uy]. – PointedEars Commented Jul 20, 2013 at 20:33
 |  Show 7 more ments

5 Answers 5

Reset to default 3

You could use a "lazy plus" (+?):

paragraph.match(/([\S\s]+?)[.!?](\s|$)/g);

This way, the match will end as soon as it hits the end of a sentence.

[\S\s] stands for "any character".

var arry = paragraph.split(/([.!?])\s/);
var sentences = [];
for (i=0; i < arry.length; i+=2) {
  // In case the last sentence is not delimited
  if (i < arry.length-1) {
    sentences.push(arry[i] + arry[i+1]);
  } else {
    sentences.push(arry[i]);
  }
}

Using a capturing group for the delimiter adds the delimiter to split's returned array. Then you just need to fold it to put the delimiter back on the end. This could be done a lot more cutely using a reduce or foldl method available in some frameworks, but I kept it to pure javascript for this example.

So for example if I have a sentence like: "Why is about.me so popular?" I want it to say that that is one sentence and return an array like ['Why is about.me so popular?'], but with this regex it splits it at the "." in "about.me".

For a start, you can make the assumption that sentence-ending punctuation must be followed by whitespace or the end of input. A sentence is then the shortest possible sequence of characters followed by either sentence-ending punctuation, followed by whitespace or the end of input. “Shortest possible sequence” means that the matching must be non-greedy (…+?):

/*
 * ["The quick brown fox jumped over the lazy dog. ",
 *  "Why is about.me so popular? ",
 *  "Give me a break!"]
 */
("The quick brown fox jumped over the lazy dog."
  + " Why is about.me so popular?"
  + " Give me a break!").match(/[\S\s]+?[.!?](?:\s+|$)/g)

Your expression

/[^\.!\?]+[\.!\?(?="|')]+(\s|$)/g

is mostly nonsense; it is equivalent to

/[^.!?]+[=|!.'"()?]+(\s|$)/g

You do not need to escape special characters in character classes (with the exception of - when between two other characters), and escaping them has no effect (with the exception of \- which then means a literal -). Especially, you cannot use assertions like (?=…) in character classes; a character class is a (non-zero-width) assertion already.

Instead of match, use split:

var sentences=paragraph.split(/\.\s/);

Grab everything that isn't a period then the period. ([^.].)

http://rubular./r/pVxAPNCNxO

Edit:
(.*?(?:. ))

http://rubular./r/yv9kEPrKU2

本文标签: javascriptRegex to match a string at all punctuation followed by a space or the endStack Overflow