javascript - Regex: accurately match bold (**) and italics (*) item(s) from the input - Stack Overflow

IT技术

更新时间：2025-03-150

admin管理员组
文章数量:1316388

I am trying to parse a markdown content with the use of regex. To grab bold and italic items from the input, I'm currently using a regex:

/(\*\*)(?<bold>[^**]+)(\*\*)|(?<normal>[^`*[~]+)|\*(?<italic>[^*]+)\*/g

Regex101 Link:

The problem with this regex are:

if there is a single * in between a bold text content, the match is breaked
if there are long texts like ******* anywhere in between the match is broken

#####: tried with: I tried removing the [^**] part in the bold group but that messed up the bold match with finding the last ** occurrence and including all `**`` chars within

What I want to have:

accurate bold
* allowed inside bold
accurate italics

Language: Javascript

Assumptions:

Bold text wrapped inside ** Italic text wrapped inside *

I am trying to parse a markdown content with the use of regex. To grab bold and italic items from the input, I'm currently using a regex:

/(\*\*)(?<bold>[^**]+)(\*\*)|(?<normal>[^`*[~]+)|\*(?<italic>[^*]+)\*/g

Regex101 Link: https://regex101./r/2zOMid/1

The problem with this regex are:

if there is a single * in between a bold text content, the match is breaked
if there are long texts like ******* anywhere in between the match is broken

#####: tried with: I tried removing the [^**] part in the bold group but that messed up the bold match with finding the last ** occurrence and including all `**`` chars within

What I want to have:

accurate bold
* allowed inside bold
accurate italics

Language: Javascript

Assumptions:

Bold text wrapped inside ** Italic text wrapped inside *

Share Improve this question edited Oct 2, 2022 at 7:17 Tyzoid 1,07813 silver badges31 bronze badges asked Jul 16, 2022 at 8:48 Kiran Parajuli 1,0558 silver badges17 bronze badges

1 Do not use a single regex here since matches are overlapping. Use bold regex first, then italics. – Wiktor Stribiżew Commented Jul 16, 2022 at 9:09
yes, i'm trying to do the same. for that the bold match in the above regex should allow to contain single * char within. if I do that the bold match is messed up. can i do that properly with regex? – Kiran Parajuli Commented Jul 16, 2022 at 11:13
Shouldn't one, by markdown rules, in need to literally show an asterisk * escape it? ***\**** for the exact reason? – Roko C. Buljan Commented Jul 16, 2022 at 11:19
for me, ***** & **\*** means a normal text. If we want just an asterisk as bold maybe using raw HTML is better (markdown supports that). but if the input is like **ab*cd** then ab*cd should be a match. – Kiran Parajuli Commented Jul 16, 2022 at 11:40

Add a ment |

5 Answers 5

Sorted by: Reset to default 4

There was some discussion in the chat going on. Just to have it mentioned, there is no requirement yet on how to deal with escaped characters like \* so I didn't take attention of it.

Depending on the desired oute I'd pick a two step solution and keep the patterns simple:

str = str.replace(/\*\*(.+?)\*\*(?!\*)/g,'<b>$1</b>').replace(/\*([^*><]+)\*/g,'<i>$1</i>');

Step 1: Replace bold parts
Replace \*\*(.+?)\*\*(?!\*) with <b>$1</b> -> Regex101 demo
It captures (.+?) one or more characters between ** lazily to $1
and uses a lookahead for matching the outher most * at the end.
Step 2: Now as the amount of remaining * is reduced, italic parts
Replace remaining \*([^*><]+)\* to <i>$1</i> -> Regex101 demo
[^*><]+ matches one or more characters that are not *, > or <.

Here is the JS-demo at tio.run

Myself I don't think it's a good idea to rely on the amount of the same character for distinguishing between kinds of replacement. The way how it works finally gets a matter of taste.

[^**] will not avoid two consecutive *. It is a character class that is no different from [^*]. The repeated asterisk has no effect.

The pattern for italic should better e in front of the normal part, which should capture anything that remains. This could even be a sole asterisk (for example) -- the pattern for normal text should allow this.

It will be easier to use split and use the bold/italic pattern for matching the "delimiter" of the split, while still capturing it. All the rest will then be "normal". The downside of split is that you cannot benefit from named capture groups, but they will just be represented by separate entries in the returned array.

I will ignore the other syntax that markdown can have (like you seem to hint at with [ and ~ in your regex). On the other hand, it is important to deal well with backslash, as it is used to escape an asterisk.

Here is the regular expression (link):

(\*\*?)(?![\s\*])((?:[\s*]*(?:\\[\\*]|[^\\\s*]))+?)\1

Here is a snippet with two functions:

a function that first splits the input into tokens, where each token is a pair, like ["normal", " this is normal text "] and ["i", "text in italics"]
another function that uses these tokens to generate HTML

The snippet is interactive. Just type the input, and the output will be rendered in HTML using the above sequence.

function tokeniseMarkdown(s) {
    const regex = /(\*\*?)(?![\s\*])((?:[\s*]*(?:\\[\\*]|[^\\\s*]))+?)\1/gs;
    const styles = ["i", "b"];
    // Matches follow this cyclic order: 
    //   normal text, mark (= "*" or "**"), formatted text, normal text, ...
    const types = ["normal", "mark", ""];
    return s.split(regex).map((match, i, matches) =>
        types[i%3] !== "mark" && match &&
            [types[i%3] || styles[matches[i-1].length-1], 
             match.replace(/\\([\\*])/g, "$1")]
    ).filter(Boolean); // Exclude empty matches and marks
}

function tokensToHtml(tokens) {
    const container = document.createElement("span");
    for (const [style, text] of tokens) {
        let node = style === "normal" ? document.createTextNode(text) 
                                      : document.createElement(style);
        node.textContent = text;
        container.appendChild(node);
    }
    return container.innerHTML;
}


// I/O management
document.addEventListener("input", refresh);

function refresh() {
    const s = document.querySelector("textarea").value;
    const tokens = tokeniseMarkdown(s);
    document.querySelector("div").innerHTML = tokensToHtml(tokens);
}
refresh();

textarea { width: 100%; height: 6em }
div { font: 22px "Times New Roman" }

<textarea>**fi*rst b** some normal text here **second b**  *first i* normal *second i* normal again</textarea><br>

<div></div>

Looking some more about the negative lookaheads, I came up with this regex:

/\*\*(?<bold>(?:(?!\*\*).)+)\*\*|`(?<code>[^`]+)`|~~(?<strike>(?:(?!~~).)+)~~|\[(?<linkTitle>[^]]+)]\((?<linkHref>.*)\)|(?<normal>[^`[*~]+)|\*(?<italic>[^*]+)\*|(?<tara>[*~]{3,})|(?<sitara>[`[]+)/g

Regex101

this pretty much works for me as per my input scenarios. If someone has a more optimized regex, please ment.

You can choose the tags depending on the number of asterisks. (1 → italic, 2 → bold, 3 → bold+italic)

function simpleMarkdownTransform(markdown) {
  return markdown
    .replace(/</g, '&lt') // disallow tags
    .replace(/>/g, '&gt')
    .replace(
      /(\*{1,3})(.+?)\1(?!\*)/g,
      (match, { length: length }, text) => {
        if (length !== 2) text = text.italics()
        return length === 1 ? text : text.bold()
      }
    )
    .replace(/\n/g, '<br>') // new line
}

Example:

simpleMarkdownTransform('abcd **bold** efgh *italic* ijkl ***bold-italic*** mnop') 
// "abcd <b>bold</b> efgh <i>italic</i> ijkl <b><i>bold-italic</i></b> mnop"

italic: ((?<!\s)\*(?!\s)(?:(?:[^\*\*]*(?:(?:\*\*[^\*\*]*){2})+?)+?|[^\*\*]+?)(?<!\s)\*)

(?<!\s)\*(?!\s) means matching the start * with no space around,
(?:(?:[^\*\*]*(?:(?:\*\*[^\*\*]*){2})+?)+? means match ** with even appearance, by which negalates meaningless ** inside intalic.
|[^\*\*]+? means if there's no match for one or more ** pair, match anything except a single **.(this "or" order is important)
(?<!\s)*) means matching the end * with no space ahead
And ?: is non-capturing group in js, you can delete it if not needing

bold: ((?<!\s)\*\*(?!\s)(?:[^\*]+?|(?:[^\*]*(?:(?:\*[^\*]*){2})+?)+?)(?<!\s)\*\*)

Similar to italic, except the order of * pair and other character.

Together you can get:
((?<!\s)\*(?!\s)(?:(?:[^\*\*]*(?:(?:\*\*[^\*\*]*){2})+?)+?|[^\*\*]+?)(?<!\s)\*)|((?<!\s)\*\*(?!\s)(?:[^\*]+?|(?:[^\*]*(?:(?:\*[^\*]*){2})+?)+?)(?<!\s)\*\*)

See the result here: https://regex101./r/9gTBpj/1

本文标签： javascriptRegex accurately match bold (**) and italics (*) item(s) from the inputStack Overflow

版权声明：本文标题：javascript - Regex: accurately match bold (**) and italics (*) item(s) from the input - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1742003456a2411482.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

javascript - Regex: accurately match bold (**) and italics (*) item(s) from the input - Stack Overflow

The problem with this regex are:

What I want to have:

Assumptions:

The problem with this regex are:

What I want to have:

Assumptions:

5 Answers 5

更多相关文章

javascript - Regex: accurately match bold (**) and italics (*) item(s) from the input - Stack Overflow

发表评论

推荐文章

mongodb - How to preserve markdown format in a parquet file - Stack Overflow

html - appending to the DOM - vanilla javascript - Stack Overflow

javascript - Login Page without Index.html Design - ngRoute (AngularJS) - Stack Overflow

javascript - How to remove &lt;hr&gt; for last li in angular ng-repeat - Stack Overflow

javascript - How to target Firefox and IE with CSS? - Stack Overflow

热门文章

permalinks - Multiple Taxonomy Items for Separate URLs

javascript - SAPUI5 - Number Format Localisation - Stack Overflow

plugin development - URL issue retrieving Custom Post Types using Backbone JS API

firefox - Javascript error with undefined variable - Stack Overflow

Call a c# method from Javascript - Stack Overflow

javascript - Uncaught TypeError: r is not a function

javascript - React router : rendered page scrolls automatically to the bottom - Stack Overflow

templates - Problem getting single_template filter to work - I want to serve a different single.php file for posts in a certain

functions - How to add HTML into error message

javascript - JSLint error: &quot;unnecessary else after disruption&quot; - Stack Overflow

最新文章

arm架构的windows系统

Windows系统调用学习笔记（四）—— 系统服务表&amp;SSDT

如何关闭Windows系统中的简繁体自动转换功能（三步搞定--最简单）

windows系统rust安装教程，解决安装慢，下载慢，下载依赖慢

Java毕业设计：基于Springboot演出服装租赁网站管理系统毕业设计源代码作品和开题报告怎么写

javascript - Filter and search using AngularJS - Stack Overflow

javascript - Node.js: crypto.pbkdf2 password to hex - Stack Overflow

javascript - Expected to return a value at the end of arrow function with if statement - Stack Overflow

javascript - Mongoose Not Saving Data - Stack Overflow

javascript - Is the Access-Control-Allow-Origin CORS header required when doing a preflight request? - Stack Overflow

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价

javascript - How to remove <hr> for last li in angular ng-repeat - Stack Overflow

javascript - JSLint error: "unnecessary else after disruption" - Stack Overflow

Windows系统调用学习笔记（四）—— 系统服务表&SSDT