admin管理员组文章数量:1127915
If I have a string with any type of non-alphanumeric character in it:
"This., -/ is #! an $ % ^ & * example ;: {} of a = -_ string with `~)() punctuation"
How would I get a no-punctuation version of it in JavaScript:
"This is an example of a string with punctuation"
If I have a string with any type of non-alphanumeric character in it:
"This., -/ is #! an $ % ^ & * example ;: {} of a = -_ string with `~)() punctuation"
How would I get a no-punctuation version of it in JavaScript:
"This is an example of a string with punctuation"
Share
Improve this question
edited May 27, 2016 at 3:35
hichris123
10.2k15 gold badges57 silver badges70 bronze badges
asked Dec 1, 2010 at 19:58
Quentin FiskQuentin Fisk
2,2412 gold badges14 silver badges3 bronze badges
19 Answers
Reset to default 270If you want to remove specific punctuation from a string, it will probably be best to explicitly remove exactly what you want like
replace(/[.,\/#!$%\^&\*;:{}=\-_`~()]/g,"")
Doing the above still doesn't return the string as you have specified it. If you want to remove any extra spaces that were left over from removing crazy punctuation, then you are going to want to do something like
replace(/\s{2,}/g," ");
My full example:
var s = "This., -/ is #! an $ % ^ & * example ;: {} of a = -_ string with `~)() punctuation";
var punctuationless = s.replace(/[.,\/#!$%\^&\*;:{}=\-_`~()]/g,"");
var finalString = punctuationless.replace(/\s{2,}/g," ");
Results of running code in firebug console:
str = str.replace(/[^\w\s\']|_/g, "")
.replace(/\s+/g, " ");
Removes everything except alphanumeric characters and whitespace, then collapses multiple adjacent whitespace to single spaces.
Detailed explanation:
\w
is any digit, letter, or underscore.\s
is any whitespace.[^\w\s\']
is anything that's not a digit, letter, whitespace, underscore or a single quote.[^\w\s\']|_
is the same as #3 except with the underscores added back in.
Here are the standard punctuation characters for US-ASCII: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
For Unicode punctuation (such as curly quotes, em-dashes, etc), you can easily match on specific block ranges. The General Punctuation block is \u2000-\u206F
, and the Supplemental Punctuation block is \u2E00-\u2E7F
.
Put together, and properly escaped, you get the following RegExp:
/[\u2000-\u206F\u2E00-\u2E7F\\'!"#$%&()*+,\-.\/:;<=>?@\[\]^_`{|}~]/
That should match pretty much any punctuation you encounter. So, to answer the original question:
var punctRE = /[\u2000-\u206F\u2E00-\u2E7F\\'!"#$%&()*+,\-.\/:;<=>?@\[\]^_`{|}~]/g;
var spaceRE = /\s+/g;
var str = "This, -/ is #! an $ % ^ & * example ;: {} of a = -_ string with `~)() punctuation";
str.replace(punctRE, '').replace(spaceRE, ' ');
>> "This is an example of a string with punctuation"
US-ASCII source: http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#posix
Unicode source: http://kourge.net/projects/regexp-unicode-block
As of 2021, many modern browsers support JavaScript built-in: RegExp: Unicode property escapes. So you can now simply use \p{P}
:
str.replace(/[\p{P}$+<=>^`|~]/gu, '')
The regex can be further simplified if you want to ignore all symbols (\p{S}
) and punctuations.
str.replace(str.replace(/[\p{P}\p{S}]/gu, '')
If you want to strip everything except letters (\p{L}
), numbers (\p{N}
) and separators (\p{Z}
). You may use a negated character set like this (works for non-English alphanumeric characters too):
str.replace(/[^\p{L}\p{N}\p{Z}]/gu, '')
The above regex works, but more common use-case is to use regex whitespace class instead of Unicode separator character set as the latter does not include tabs and line feed. Try this:
str.replace(/[^\p{L}\p{N}\s]/gu, '')
const str = 'This., -/ is #! an $ % ^ & * example ;: {} of a = -_ string with `~)() punctuation';
console.log(str.replace(/[\p{P}$+<=>^`|~]/gu, ''));
console.log(str.replace(/[\p{P}\p{S}]/gu, ''));
console.log(str.replace(/[^\p{L}\p{N}\p{Z}]/gu, ''));
console.log(str.replace(/[^\p{L}\p{N}\s]/gu, ''));
You may also like to chain a .replace(/ +/g, ' ')
to remove consecutive spaces.
Feel free to play around with these! Ref:
Unicode Character Properties - Wikipedia
Unicode Property Escapes - MDN
I ran across the same issue, this solution did the trick and was very readable:
var sentence = "This., -/ is #! an $ % ^ & * example ;: {} of a = -_ string with `~)() punctuation";
var newSen = sentence.match(/[^_\W]+/g).join(' ');
console.log(newSen);
Result:
"This is an example of a string with punctuation"
The trick was to create a negated set. This means that it matches anything that is not within the set i.e. [^abc]
- not a, b or c
\W
is any non-word, so [^\W]+
will negate anything that is not a word char.
By adding in the _ (underscore) you can negate that as well.
Make it apply globally /g
, then you can run any string through it and clear out the punctuation:
/[^_\W]+/g
Nice and clean ;)
/[^A-Za-z0-9\s]/g should match all punctuation but keep the spaces.
So you can use .replace(/\s{2,}/g, " ")
to replace extra spaces if you need to do so. You can test the regex in http://rubular.com/
.replace(/[^A-Za-z0-9\s]/g,"").replace(/\s{2,}/g, " ")
Update: Will only work if the input is ANSI English.
In a Unicode-aware language, the Unicode Punctuation character property is \p{P}
— which you can usually abbreviate \pP
and sometimes expand to \p{Punctuation}
for readability.
Are you using a Perl Compatible Regular Expression library?
If you want to remove punctuation from any string you should use the P
Unicode class.
But, because classes are not accepted in the JavaScript RegEx, you could try this RegEx that should match all the punctuation. It matches the following categories: Pc Pd Pe Pf Pi Po Ps Sc Sk Sm So GeneralPunctuation SupplementalPunctuation CJKSymbolsAndPunctuation CuneiformNumbersAndPunctuation.
I created it using this online tool that generates Regular Expressions specifically for JavaScript. That's the code to reach your goal:
var punctuationRegEx = /[!-/:-@[-`{-~¡-©«-¬®-±´¶-¸»¿×÷˂-˅˒-˟˥-˫˭˯-˿͵;΄-΅·϶҂՚-՟։-֊־׀׃׆׳-״؆-؏؛؞-؟٪-٭۔۩۽-۾܀-܍߶-߹।-॥॰৲-৳৺૱୰௳-௺౿ೱ-ೲ൹෴฿๏๚-๛༁-༗༚-༟༴༶༸༺-༽྅྾-࿅࿇-࿌࿎-࿔၊-၏႞-႟჻፠-፨᎐-᎙᙭-᙮᚛-᚜᛫-᛭᜵-᜶។-៖៘-៛᠀-᠊᥀᥄-᥅᧞-᧿᨞-᨟᭚-᭪᭴-᭼᰻-᰿᱾-᱿᾽᾿-῁῍-῏῝-῟῭-`´-῾\u2000-\u206e⁺-⁾₊-₎₠-₵℀-℁℃-℆℈-℉℔№-℘℞-℣℥℧℩℮℺-℻⅀-⅄⅊-⅍⅏←-⏧␀-␦⑀-⑊⒜-ⓩ─-⚝⚠-⚼⛀-⛃✁-✄✆-✉✌-✧✩-❋❍❏-❒❖❘-❞❡-❵➔➘-➯➱-➾⟀-⟊⟌⟐-⭌⭐-⭔⳥-⳪⳹-⳼⳾-⳿⸀-\u2e7e⺀-⺙⺛-⻳⼀-⿕⿰-⿻\u3000-〿゛-゜゠・㆐-㆑㆖-㆟㇀-㇣㈀-㈞㈪-㉃㉐㉠-㉿㊊-㊰㋀-㋾㌀-㏿䷀-䷿꒐-꓆꘍-꘏꙳꙾꜀-꜖꜠-꜡꞉-꞊꠨-꠫꡴-꡷꣎-꣏꤮-꤯꥟꩜-꩟﬩﴾-﴿﷼-﷽︐-︙︰-﹒﹔-﹦﹨-﹫!-/:-@[-`{-・¢-₩│-○-�]|\ud800[\udd00-\udd02\udd37-\udd3f\udd79-\udd89\udd90-\udd9b\uddd0-\uddfc\udf9f\udfd0]|\ud802[\udd1f\udd3f\ude50-\ude58]|\ud809[\udc00-\udc7e]|\ud834[\udc00-\udcf5\udd00-\udd26\udd29-\udd64\udd6a-\udd6c\udd83-\udd84\udd8c-\udda9\uddae-\udddd\ude00-\ude41\ude45\udf00-\udf56]|\ud835[\udec1\udedb\udefb\udf15\udf35\udf4f\udf6f\udf89\udfa9\udfc3]|\ud83c[\udc00-\udc2b\udc30-\udc93]/g;
var string = "This., -/ is #! an $ % ^ & * example ;: {} of a = -_ string with `~)() punctuation";
var newString = string.replace(punctuationRegEx, '').replace(/(\s){2,}/g, '$1');
console.log(newString)
I'll just put it here for others.
Match all punctuation chars for for all languages:
Constructed from Unicode punctuation category and added some common keyboard symbols like $
and brackets and \-=_
http://www.fileformat.info/info/unicode/category/Po/list.htm
basic replace:
".test'da, te\"xt".replace(/[\-=_!"#%&'*{},.\/:;?\(\)\[\]@\\$\^*+<>~`\u00a1\u00a7\u00b6\u00b7\u00bf\u037e\u0387\u055a-\u055f\u0589\u05c0\u05c3\u05c6\u05f3\u05f4\u0609\u060a\u060c\u060d\u061b\u061e\u061f\u066a-\u066d\u06d4\u0700-\u070d\u07f7-\u07f9\u0830-\u083e\u085e\u0964\u0965\u0970\u0af0\u0df4\u0e4f\u0e5a\u0e5b\u0f04-\u0f12\u0f14\u0f85\u0fd0-\u0fd4\u0fd9\u0fda\u104a-\u104f\u10fb\u1360-\u1368\u166d\u166e\u16eb-\u16ed\u1735\u1736\u17d4-\u17d6\u17d8-\u17da\u1800-\u1805\u1807-\u180a\u1944\u1945\u1a1e\u1a1f\u1aa0-\u1aa6\u1aa8-\u1aad\u1b5a-\u1b60\u1bfc-\u1bff\u1c3b-\u1c3f\u1c7e\u1c7f\u1cc0-\u1cc7\u1cd3\u2016\u2017\u2020-\u2027\u2030-\u2038\u203b-\u203e\u2041-\u2043\u2047-\u2051\u2053\u2055-\u205e\u2cf9-\u2cfc\u2cfe\u2cff\u2d70\u2e00\u2e01\u2e06-\u2e08\u2e0b\u2e0e-\u2e16\u2e18\u2e19\u2e1b\u2e1e\u2e1f\u2e2a-\u2e2e\u2e30-\u2e39\u3001-\u3003\u303d\u30fb\ua4fe\ua4ff\ua60d-\ua60f\ua673\ua67e\ua6f2-\ua6f7\ua874-\ua877\ua8ce\ua8cf\ua8f8-\ua8fa\ua92e\ua92f\ua95f\ua9c1-\ua9cd\ua9de\ua9df\uaa5c-\uaa5f\uaade\uaadf\uaaf0\uaaf1\uabeb\ufe10-\ufe16\ufe19\ufe30\ufe45\ufe46\ufe49-\ufe4c\ufe50-\ufe52\ufe54-\ufe57\ufe5f-\ufe61\ufe68\ufe6a\ufe6b\uff01-\uff03\uff05-\uff07\uff0a\uff0c\uff0e\uff0f\uff1a\uff1b\uff1f\uff20\uff3c\uff61\uff64\uff65]+/g,"")
"testda text"
added \s as space
".da'fla, te\"te".split(/[\s\-=_!"#%&'*{},.\/:;?\(\)\[\]@\\$\^*+<>~`\u00a1\u00a7\u00b6\u00b7\u00bf\u037e\u0387\u055a-\u055f\u0589\u05c0\u05c3\u05c6\u05f3\u05f4\u0609\u060a\u060c\u060d\u061b\u061e\u061f\u066a-\u066d\u06d4\u0700-\u070d\u07f7-\u07f9\u0830-\u083e\u085e\u0964\u0965\u0970\u0af0\u0df4\u0e4f\u0e5a\u0e5b\u0f04-\u0f12\u0f14\u0f85\u0fd0-\u0fd4\u0fd9\u0fda\u104a-\u104f\u10fb\u1360-\u1368\u166d\u166e\u16eb-\u16ed\u1735\u1736\u17d4-\u17d6\u17d8-\u17da\u1800-\u1805\u1807-\u180a\u1944\u1945\u1a1e\u1a1f\u1aa0-\u1aa6\u1aa8-\u1aad\u1b5a-\u1b60\u1bfc-\u1bff\u1c3b-\u1c3f\u1c7e\u1c7f\u1cc0-\u1cc7\u1cd3\u2016\u2017\u2020-\u2027\u2030-\u2038\u203b-\u203e\u2041-\u2043\u2047-\u2051\u2053\u2055-\u205e\u2cf9-\u2cfc\u2cfe\u2cff\u2d70\u2e00\u2e01\u2e06-\u2e08\u2e0b\u2e0e-\u2e16\u2e18\u2e19\u2e1b\u2e1e\u2e1f\u2e2a-\u2e2e\u2e30-\u2e39\u3001-\u3003\u303d\u30fb\ua4fe\ua4ff\ua60d-\ua60f\ua673\ua67e\ua6f2-\ua6f7\ua874-\ua877\ua8ce\ua8cf\ua8f8-\ua8fa\ua92e\ua92f\ua95f\ua9c1-\ua9cd\ua9de\ua9df\uaa5c-\uaa5f\uaade\uaadf\uaaf0\uaaf1\uabeb\ufe10-\ufe16\ufe19\ufe30\ufe45\ufe46\ufe49-\ufe4c\ufe50-\ufe52\ufe54-\ufe57\ufe5f-\ufe61\ufe68\ufe6a\ufe6b\uff01-\uff03\uff05-\uff07\uff0a\uff0c\uff0e\uff0f\uff1a\uff1b\uff1f\uff20\uff3c\uff61\uff64\uff65]+/g)
added ^ to invert patternt to match not punctuation but the words them selves
".test';the, te\"xt".match(/[^\s\-=_!"#%&'*{},.\/:;?\(\)\[\]@\\$\^*+<>~`\u00a1\u00a7\u00b6\u00b7\u00bf\u037e\u0387\u055a-\u055f\u0589\u05c0\u05c3\u05c6\u05f3\u05f4\u0609\u060a\u060c\u060d\u061b\u061e\u061f\u066a-\u066d\u06d4\u0700-\u070d\u07f7-\u07f9\u0830-\u083e\u085e\u0964\u0965\u0970\u0af0\u0df4\u0e4f\u0e5a\u0e5b\u0f04-\u0f12\u0f14\u0f85\u0fd0-\u0fd4\u0fd9\u0fda\u104a-\u104f\u10fb\u1360-\u1368\u166d\u166e\u16eb-\u16ed\u1735\u1736\u17d4-\u17d6\u17d8-\u17da\u1800-\u1805\u1807-\u180a\u1944\u1945\u1a1e\u1a1f\u1aa0-\u1aa6\u1aa8-\u1aad\u1b5a-\u1b60\u1bfc-\u1bff\u1c3b-\u1c3f\u1c7e\u1c7f\u1cc0-\u1cc7\u1cd3\u2016\u2017\u2020-\u2027\u2030-\u2038\u203b-\u203e\u2041-\u2043\u2047-\u2051\u2053\u2055-\u205e\u2cf9-\u2cfc\u2cfe\u2cff\u2d70\u2e00\u2e01\u2e06-\u2e08\u2e0b\u2e0e-\u2e16\u2e18\u2e19\u2e1b\u2e1e\u2e1f\u2e2a-\u2e2e\u2e30-\u2e39\u3001-\u3003\u303d\u30fb\ua4fe\ua4ff\ua60d-\ua60f\ua673\ua67e\ua6f2-\ua6f7\ua874-\ua877\ua8ce\ua8cf\ua8f8-\ua8fa\ua92e\ua92f\ua95f\ua9c1-\ua9cd\ua9de\ua9df\uaa5c-\uaa5f\uaade\uaadf\uaaf0\uaaf1\uabeb\ufe10-\ufe16\ufe19\ufe30\ufe45\ufe46\ufe49-\ufe4c\ufe50-\ufe52\ufe54-\ufe57\ufe5f-\ufe61\ufe68\ufe6a\ufe6b\uff01-\uff03\uff05-\uff07\uff0a\uff0c\uff0e\uff0f\uff1a\uff1b\uff1f\uff20\uff3c\uff61\uff64\uff65]+/g)
for language like Hebrew maybe to remove " ' the single and the double quote. and do more thinking on it.
using this script:
step 1: select in Firefox holding control a column of U+1234 numbers and copy it, do not copy U+12456 they replace English
step 2 (i did in chrome)find some textarea and paste it into it then rightclick and click inspect. then you can access the selected element with $0.
var x=$0.value
var z=x.replace(/U\+/g,"").split(/[\r\n]+/).map(function(a){return parseInt(a,16)})
var ret=[];z.forEach(function(a,k){if(z[k-1]===a-1 && z[k+1]===a+1) { if(ret[ret.length-1]!="-")ret.push("-");} else { var c=a.toString(16); var prefix=c.length<3?"\\u0000":c.length<5?"\\u0000":"\\u000000"; var uu=prefix.substring(0,prefix.length-c.length)+c; ret.push(c.length<3?String.fromCharCode(a):uu)}});ret.join("")
step 3 copied over the first letters the ascii as separate chars not ranges because someone might add or remove individual chars
For en-US ( American English ) strings this should suffice:
"This., -/ is #! an $ % ^ & * example ;: {} of a = -_ string with `~)() punctuation".replace( /[^a-zA-Z ]/g, '').replace( /\s\s+/g, ' ' )
Be aware that if you support UTF-8 and characters like chinese/russian and all, this will replace them as well, so you really have to specify what you want.
If you want to retain only alphabets and spaces, you can do:
str.replace(/[^a-zA-Z ]+/g, '').replace('/ {2,}/',' ')
if you are using lodash
_.words('This, is : my - test,line:').join(' ')
This Example
_.words('"This., -/ is #! an $ % ^ & * example ;: {} of a = -_ string with `~)() punctuation"').join(' ')
As per Wikipedia's list of punctuations I had to build the following regex which detects punctuations :
[\.’'\[\](){}⟨⟩:,،、‒–—―…!.‹›«»‐\-?‘’“”'";/⁄·\&*@\•^†‡°”¡¿※#№÷׺ª%‰+−=‱¶′″‴§~_|‖¦©℗®℠™¤₳฿₵¢₡₢$₫₯֏₠€ƒ₣₲₴₭₺₾ℳ₥₦₧₱₰£៛₽₹₨₪৳₸₮₩¥]
I think the simplest solution is:
.replaceAll(/[^a-zA-Z0-9]/g,"");
Instead of filtering out every single non-character item, you just check if the character doesn't fit what you're looking for.
It depends on what you are trying to return. I used this recently:
return text.match(/[a-z]/i);
If you are targeting a modern browsers (not IE) you can utilize unicode caracter classes. This is especially helpful when you also need to support caracters like german Umlaute (äöü) or else.
Here is what I ended up with. It replaces everything that is not a letter or apostrophe or whitespace and removes multiple whitespaces in row with a single one.
const textStripped = text
.replace(/[’]/g, "'") // replace ’ with '
.replace(/[^\p{Letter}\p{Mark}\s']/gu, "") // remove everything that is not a letter, mark, space or '
.replace(/\s+/g, " ") // remove multiple spaces
.replace(/[’]/g, "'")
First replaces ’ (typographic apostrophe) with ' (typewriter apostrophe). As both may be used for words like "dont’t"
.replace(/[^\p{Letter}\p{Mark}\s']/gu, "")
\p{Letter}
stands for any caracter that is categorized as a letter in unicode.
The \p{Mark}
category needs to be included to further cover letter mark combinations. For example a german ä can be encoded as a single caracter or as a combination of "a" and a Mark. This happens quite regularly when copying german texts from PDFs.
Source: https://dev.to/tillsanders/let-s-stop-using-a-za-z-4a0m
Correct way is:
str = str.replace(/[.,?\/#!$%\^&\*;:{}=\-_`~()\s]/g, "");
Here I included question mark as well.
A weird way to do it is by using a loop that looks for characters in the string, then use a conditional that checkes if each character is either a number or letter, then throw the result in a new variable. For example;
let str = "hello!!!,.?"
let newStr = ''
for (let i = 0; i < str.length; i++) {
if (/[a-zA-Z0-9]/.test(str[i])) {
newStr += str[i]
}
}
console.log(newStr) //hello
Its simple just replace character other than words:
.replace(/[^\w]/g, ' ')
本文标签: How can I strip all punctuation from a string in JavaScript using regexStack Overflow
版权声明:本文标题:How can I strip all punctuation from a string in JavaScript using regex? - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1736705871a1948673.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论