admin管理员组文章数量:1318564
I'm trying to split this string into sentences, but I need to handle abbreviations (which have the fixed format x.y.
as a word:
content = "This is a long string with some numbers 123.456,78 or 100.000 and e.g. some abbreviations in it, which shouldn't split the sentence. Sometimes there are problems, i.e. in this one. here and abbr at the end x.y.. cool."
I tried this regex:
content.replace(/([.?!])\s+(?=[A-Za-z])/g, "$1|").split("|");
But as you can see there are problems with abbreviations. As all the abbreviations are of the format x.y.
it should be possible to handle them as a word, without splitting the string at this point.
"This is a long string with some numbers 123.456,78 or 100.000 and e.g.",
"some abbreviations in it, which shouldn't split the sentence."
"Sometimes there are problems, i.e.",
"in this one.",
"here and abbr at the end x.y..",
"cool."
The result should be:
"This is a long string with some numbers 123.456,78 or 100.000 and e.g. some abbreviations in it, which shouldn't split the sentence."
"Sometimes there are problems, i.e. in this one.",
"here and abbr at the end x.y..",
"cool."
I'm trying to split this string into sentences, but I need to handle abbreviations (which have the fixed format x.y.
as a word:
content = "This is a long string with some numbers 123.456,78 or 100.000 and e.g. some abbreviations in it, which shouldn't split the sentence. Sometimes there are problems, i.e. in this one. here and abbr at the end x.y.. cool."
I tried this regex:
content.replace(/([.?!])\s+(?=[A-Za-z])/g, "$1|").split("|");
But as you can see there are problems with abbreviations. As all the abbreviations are of the format x.y.
it should be possible to handle them as a word, without splitting the string at this point.
"This is a long string with some numbers 123.456,78 or 100.000 and e.g.",
"some abbreviations in it, which shouldn't split the sentence."
"Sometimes there are problems, i.e.",
"in this one.",
"here and abbr at the end x.y..",
"cool."
The result should be:
"This is a long string with some numbers 123.456,78 or 100.000 and e.g. some abbreviations in it, which shouldn't split the sentence."
"Sometimes there are problems, i.e. in this one.",
"here and abbr at the end x.y..",
"cool."
Share
Improve this question
asked Jan 14, 2016 at 8:21
user3142695user3142695
17.4k55 gold badges195 silver badges375 bronze badges
1
- What do you mean by string manipulation? – user3142695 Commented Jan 14, 2016 at 8:26
2 Answers
Reset to default 7The solution is to match and capture the abbreviations and build the replacement using a callback:
var re = /\b(\w\.\w\.)|([.?!])\s+(?=[A-Za-z])/g;
var str = 'This is a long string with some numbers 123.456,78 or 100.000 and e.g. some abbreviations in it, which shouldn\'t split the sentence. Sometimes there are problems, i.e. in this one. here and abbr at the end x.y.. cool.';
var result = str.replace(re, function(m, g1, g2){
return g1 ? g1 : g2+"\r";
});
var arr = result.split("\r");
document.body.innerHTML = "<pre>" + JSON.stringify(arr, 0, 4) + "</pre>";
Regex explanation:
\b(\w\.\w\.)
- match and capture into Group 1 the abbreviation (consisting of a word character, then.
and again a word character and a.
) as a whole word|
- or...([.?!])\s+(?=[A-Za-z])
:([.?!])
- match and capture into Group 2 either.
or?
or!
\s+
- match 1 or more whitespace symbols...(?=[A-Za-z])
- that are before an ASCII letter.
Given your example, I have managed to achieve what you are after through the use of this expression: (?<!\..)[.?!]\s+
(example here).
This expression will look for period, question mark or exclamation mark characters which are not preceded by a character and a period.
You would then need to replace them with the |
character and finally, you replace the |
with .\n
.
本文标签: javascriptSplit string into sentencesignoring abbreviations for splittingStack Overflow
版权声明:本文标题:javascript - Split string into sentences - ignoring abbreviations for splitting - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1742049877a2418008.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论