admin管理员组

文章数量:1134001

How can I check if a BeautifulSoup Tag is a block-level element (e.g. <p>, <div>, <h2>), or a "phrase content" element like <span>, <strong>?

Basically I want to have a function that returns True for any Tag that is allowed inside of <p> tag according to the HTML spec, and false for any Tag that is not allowed inside of a <p> tag.

I'm asking this question because I don't want to hardcode the list of allowed tags myself, but I can't find anything from bs4 or html docs about judging whether a Tag is phrasing content or not.

BeautifulSoup already knows which elements are allowed inside of <p> and which are not:

>>> BeautifulSoup('<p><h2>')
<html><body><p></p><h2></h2></body></html>
>>> BeautifulSoup('<p><em>')
<html><body><p><em></em></p></body></html>

I would also be happy to use Python's html module if it can give me the answer.

How can I check if a BeautifulSoup Tag is a block-level element (e.g. <p>, <div>, <h2>), or a "phrase content" element like <span>, <strong>?

Basically I want to have a function that returns True for any Tag that is allowed inside of <p> tag according to the HTML spec, and false for any Tag that is not allowed inside of a <p> tag.

I'm asking this question because I don't want to hardcode the list of allowed tags myself, but I can't find anything from bs4 or html docs about judging whether a Tag is phrasing content or not.

BeautifulSoup already knows which elements are allowed inside of <p> and which are not:

>>> BeautifulSoup('<p><h2>')
<html><body><p></p><h2></h2></body></html>
>>> BeautifulSoup('<p><em>')
<html><body><p><em></em></p></body></html>

I would also be happy to use Python's html module if it can give me the answer.

Share Improve this question asked Jan 7 at 19:20 NilsNils 1791 silver badge10 bronze badges 2
  • I don't think there's any built-in way, you'll need to hard-code the list. – Barmar Commented Jan 7 at 20:04
  • BeautifulSoup doesn't have such functionality builtin. It is a parser essentially . You will have to hardcode some kind of list with the tags you want to achieve that. Why do you want to do this ? – Manos Kounelakis Commented Jan 7 at 20:07
Add a comment  | 

3 Answers 3

Reset to default 1

You can try this.

from bs4 import BeautifulSoup

def is_phrasing_content(tag_name, parser="html.parser"):
    snippet = f"<p><{tag_name}></{tag_name}></p>"
    soup = BeautifulSoup(snippet, parser)

    p_tag = soup.find("p")
    if not p_tag:
        return False

    found_inside_p = p_tag.find(tag_name)
    return (found_inside_p is not None)

print(is_phrasing_content("em"))
print(is_phrasing_content("span"))
print(is_phrasing_content("div"))
print(is_phrasing_content("h2"))

I hope this will help you a little.

Since BS doesn't appear to provide a hard-coded list of elements in the phrasing category, you'll have to resort to the definition in the HTML standard you're going to target. For WHATWG HTML review draft (January 2022), the list of phrasing content is a, abbr, area, audio, b, bdi, bdo, br, button, canvas, cite, code, data, datalist, del, dfn, em, embed, i, iframe, img, input, ins, kbd, label, link, map, mark, math, meta, meter, noscript, object, output, picture, progress, q, ruby, s, samp, select, slot, small, span, strong, sub, sup, svg, template, textarea, time, u, var, video, war, keygen (but check chapter 3.2.5.2.5 at https://html.spec.whatwg.org/multipage/dom.html#phrasing-content-2 for an up-to-date list).

But: Even though the spec says phrasing content is accepted as content of <p> elements, it also says that a <p> element's end-element tag can be omitted (ie. the <p> element is terminated) on any address, article, aside, blockquote, details, dialog, div, dl, fieldset, figure, footer, form, h1, h2, h3, h4, h5, h6, header, hgroup, hr, main, nav, ol, p, pre, section, style, table, ul, or menu element (again, you need to check your target HTML spec; eg. the new search element isn't included in this list), which may or may not be relevant to your application.

You can read a bit more on the interpretation of that part of the spec in the context of the older SGML-based HTML specs at https://sgmljs.net/docs/html5.html.

I'm not sure that Beautiful soup knows what you are saying. It's more like it uses some engine to parse and fix the HTML. It has this method soup.get_text() which returns all the text in HTML. Maybe you are looking for this. If not then it would help understand why you need such a function.

本文标签: pythonDetect if Tag is a blocklevel elementStack Overflow