admin管理员组文章数量:1415673
I use the following code to find words that start with hashtags:
var regex = /(?:^|\W)#(\w+)(?!\w)/g;
but it only matches the English words and it can not match hashtags in other languages such as arabic. so, how can I find hashtags in a text like this:
this is a simple #text
هذا #نص بسیط
I use the following code to find words that start with hashtags:
var regex = /(?:^|\W)#(\w+)(?!\w)/g;
but it only matches the English words and it can not match hashtags in other languages such as arabic. so, how can I find hashtags in a text like this:
Share Improve this question asked Oct 10, 2020 at 9:37 user6931342user6931342 1553 silver badges13 bronze badgesthis is a simple #text
هذا #نص بسیط
3 Answers
Reset to default 5If the value after the # should not contain a # itself, you could use a negated character class [^\s#]
matching any character except #
either way around using an alternation |
The value is in capture group 1.
(?:^|\s)(#[^\s#]+|[^\s#]+#)(?=$|\s)
Regex demo
const pattern = /(?:^|\s)(#[^\s#]+|[^\s#]+#)(?=$|\s)/;
[
"this is a simple #test1",
"هذا #نص بسیط",
"test #test2#",
"test #test3#test3",
"test ##test4",
"test test5##",
].forEach(s => {
const m = s.match(pattern);
if (m) console.log(m[1]);
});
You may use the following regex alternation:
(?<!\S)#\S+|\S+#(?!\S)
Demo
Bearing in mind that a Unicode aware \w
can be represented with [\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]
(see What's the correct regex range for javascript's regexes to match all the non word characters in any script?), the direct Unicode equivalent of your pattern is
const uw = String.raw`[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]`; // uw = Unicode \w
const regex = new RegExp(`(?<!${uw})#(${uw}+)(?!${uw})`, "gu");
Now, to match both directions, you may use
const regex = new RegExp(`(?<!${uw})(?:#(${uw}+)|${uw}+#)(?!${uw})`, "gu");
^_________^_______^
That is, a non-capturing group with an alternation |
char is used with two alernatives, that match #
+ Unicode word chars on the right, or Unicode word chars and then a #
on the right. Details:
(?<!${uw})
- a negative lookbehind that fails the match if there is a Unicode word char immediately on the left(?:#(${uw}+)|${uw}+#)
- a non-capturing group that matches either#(${uw}+)
- a#
char followed with one or more Unicode word chars|
- or${uw}+#
- one or more Unicode word chars followed with a#
char
(?!${uw})
- a negative lookahead that fails the match if there is a Unicode word char immediately on the right.
The g
flag ensures multiple matches and u
enables the Unicode property classes support in the pattern.
A JavaScript demo:
const strings = ["this is a simple #text #text2", "هذا #نن*&ص بسیط","#نص2 هذا #نص بسیط"];
const uw = String.raw`[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]`; // uw = Unicode \w
const regex = new RegExp(`(?<!${uw})(?:#(${uw}+)|${uw}+#)(?!${uw})`, "gu");
strings.forEach( string => console.log(string, '=>', string.match(regex)))
本文标签:
版权声明:本文标题:javascript - regular expression to match hashtags in both left to right and right to left languages - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1745223125a2648487.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论