admin管理员组

文章数量:1195318

I'm looking for an easy way to test if a string contains markdown. Currently I'm thinking to convert the string to HTML and then test if there has html with a simple regex but I wonder if there is a more succinct way to do it.

Here's what I've got so far

/<[a-z][\s\S]*>/i.test( markdownToHtml(string) )

I'm looking for an easy way to test if a string contains markdown. Currently I'm thinking to convert the string to HTML and then test if there has html with a simple regex but I wonder if there is a more succinct way to do it.

Here's what I've got so far

/<[a-z][\s\S]*>/i.test( markdownToHtml(string) )
Share Improve this question asked Jul 10, 2014 at 21:26 jwerrejwerre 9,58410 gold badges65 silver badges72 bronze badges 5
  • I wonder if github.com/github/linguist would help you recognize the content – rthbound Commented Jul 10, 2014 at 21:30
  • The first character in HTML (excluding white spaces) will always be < – Reactgular Commented Jul 10, 2014 at 21:30
  • These lines seem to indicate that linguist can detect markdown. – rthbound Commented Jul 11, 2014 at 16:52
  • @rthboun I'm looking for a javascript solution but Linguist appears to be 100% Ruby. Thanks for taking the time to look at this though. – jwerre Commented Jul 11, 2014 at 18:08
  • @jwerre Ah, well there seems to be a .js port - github.com/shamoons/linguist.js – rthbound Commented Jul 11, 2014 at 19:27
Add a comment  | 

4 Answers 4

Reset to default 18

I think you have to accept that it's impossible to know with certainty. Markdown borrows its syntax from existing customs—for example underscores for italics was popular on Usenet (though single asterisks meant bold, not italics as well). And of course, people have been using dashes as obvious substitutes for plaintext bullet points, long before Markdown.

Having decided it's subjective though, we may now embark on the task of determining degrees of likelihood that a piece of text contains Markdown. Here are some things I'd consider evidence for Markdown, in order of decreasing strength:

  1. Consecutive lines beginning with 1., e.g. (^|[\n\r])\s*1\.\s.*\s+1\.\s. (See the Markdown behind this answer, for example.) I'd consider this a dead giveaway, because there's even that joke:

    There are only two kinds of people in this world.

    1. Those who understand Markdown.

    1. And those who don't.

  2. Link markdown, e.g. \[[^]]+\]\(https?:\/\/\S+\).

  3. Double underscores or asterisks when a left-right pair (indicated by whether the whitespace is to the left or right, respectively) can be found, e.g. \s(__|\*\*)(?!\s)(.(?!\1))+(?!\s(?=\1)). Let me know if you want me to explain this one.

And so on. Ultimately, you'll have to come up with your own "scoring" system to determine the weight of each of these things. A good way to go about this would be to gather some sample inputs (if you have real ones, then even better), classify them manually as having Markdown or not, and running your regexes and scoring system to see what weights sort them out most accurately.

As @andrew-cheong pointed out, there is no way to know whether a string is a Markdown document or just plaintext with text structured in a Markdown fashion.

If you want to determine the degree of likelihood that a text is supposed to be Markdown, you can use the marked package as an alternative to using the regex approach:

import { marked } from 'marked';

export function isMarkdownValue(value: string): boolean {
  const tokenTypes: string[] = [];
  
  // https://marked.js.org/using_pro#tokenizer
  marked(value, {
    walkTokens: (token) => {
      tokenTypes.push(token.type);
    },
  });

  const isMarkdown = [
    'space',
    'code',
    'fences',
    'heading',
    'hr',
    'link',
    'blockquote',
    'list',
    'html',
    'def',
    'table',
    'lheading',
    'escape',
    'tag',
    'reflink',
    'strong',
    'codespan',
    'url',
  ].some((tokenType) => tokenTypes.includes(tokenType));

  return isMarkdown;
}

This is just a simple example implementation using the walkTokens option of the marked package: https://marked.js.org/using_pro#tokenizer

This way you can easily implement any kind of detection logic based on the actual parsing of the potential markdown tokens. You could also implement a likelihood score instead of returning true or false.

You can get map of tokens from marked library and recursively check if it has markdown related token types (strong, link etc). This is similar to derbenoo's answer but more complete:

import * as marked from "marked";

function isMarkdownValue(text: string): boolean {
    function containsNonTextTokens(tokens) {
      return tokens.some(token => {
        if (token.type !== 'text' && token.type !== 'paragraph' ) { // change this as per your needs
          return true;
        }
        // Check recursively for nested tokens
        if (token.tokens && containsNonTextTokens(token.tokens)) {
          return true;
        }
        return false;
      });
    }
    // Use the lexer to tokenize the input without rendering it to HTML
    const tokens = marked.lexer(text);
    // Check if the tokens contain any Markdown elements
    return containsNonTextTokens(tokens);
    
  }

I've implemented the regular expression approach in very-small-parser, the code looks something like this:

// Headings H1-H6.
const h1 = /(^|\n) {0,3}#{1,6} {1,8}[^\n]{1,64}\r?\n\r?\n\s{0,32}\S/;

// Bold, italic, underline, strikethrough, highlight.
const bold = /(?:\s|^)(_|__|\*|\*\*|~~|==|\+\+)(?!\s).{1,64}(?<!\s)(?=\1)/;

// Basic inline link (also captures images).
const link = /\[[^\]]{1,128}\]\(https?:\/\/\S{1,999}\)/;

// Inline code.
const code = /(?:\s|^)`(?!\s)[^`]{1,48}(?<!\s)`([^\w]|$)/;

// Unordered list.
const ul = /(?:^|\n)\s{0,5}\-\s{1}[^\n]+\n\s{0,15}\-\s/;

// Ordered list.
const ol = /(?:^|\n)\s{0,5}\d+\.\s{1}[^\n]+\n\s{0,15}\d+\.\s/;

// Horizontal rule.
const hr = /\n{2} {0,3}\-{2,48}\n{2}/;

// Fenced code block.
const fences = /(?:\n|^)(```|~~~|\$\$)(?!`|~)[^\s]{0,64} {0,64}[^\n]{0,64}\n[\s\S]{0,9999}?\s*\1 {0,64}(?:\n+|$)/;

// Classical underlined H1 and H2 headings.
const title = /(?:\n|^)(?!\s)\w[^\n]{0,64}\r?\n(\-|=)\1{0,64}\n\n\s{0,64}(\w|$)/;

// Blockquote.
const blockquote = /(?:^|(\r?\n\r?\n))( {0,3}>[^\n]{1,333}\n){1,999}($|(\r?\n))/;

/**
 * Returns `true` if the source text might be a markdown document.
 *
 * @param src Source text to analyze.
 */
export const is = (src: string): boolean =>
  h1.test(src) ||
  bold.test(src) ||
  link.test(src) ||
  code.test(src) ||
  ul.test(src) ||
  ol.test(src) ||
  hr.test(src) ||
  fences.test(src) ||
  title.test(src) ||
  blockquote.test(src);

Call the is function to return you the answer:

is('__bold__');    // true
is('hello world'); // false

The advantage of the regular expression approach is that the whole code you can see on the screen above, you don't need to ship tens of KB parser.

本文标签: javascriptHow to test if a string has Markdown in itStack Overflow