admin管理员组文章数量:1134001
How can I check if a BeautifulSoup Tag is a block-level element (e.g. <p>
, <div>
, <h2>
), or a "phrase content" element like <span>
, <strong>
?
Basically I want to have a function that returns True for any Tag that is allowed inside of <p>
tag according to the HTML spec, and false for any Tag that is not allowed inside of a <p>
tag.
I'm asking this question because I don't want to hardcode the list of allowed tags myself, but I can't find anything from bs4
or html
docs about judging whether a Tag is phrasing content or not.
BeautifulSoup already knows which elements are allowed inside of <p>
and which are not:
>>> BeautifulSoup('<p><h2>')
<html><body><p></p><h2></h2></body></html>
>>> BeautifulSoup('<p><em>')
<html><body><p><em></em></p></body></html>
I would also be happy to use Python's html
module if it can give me the answer.
How can I check if a BeautifulSoup Tag is a block-level element (e.g. <p>
, <div>
, <h2>
), or a "phrase content" element like <span>
, <strong>
?
Basically I want to have a function that returns True for any Tag that is allowed inside of <p>
tag according to the HTML spec, and false for any Tag that is not allowed inside of a <p>
tag.
I'm asking this question because I don't want to hardcode the list of allowed tags myself, but I can't find anything from bs4
or html
docs about judging whether a Tag is phrasing content or not.
BeautifulSoup already knows which elements are allowed inside of <p>
and which are not:
>>> BeautifulSoup('<p><h2>')
<html><body><p></p><h2></h2></body></html>
>>> BeautifulSoup('<p><em>')
<html><body><p><em></em></p></body></html>
I would also be happy to use Python's html
module if it can give me the answer.
- I don't think there's any built-in way, you'll need to hard-code the list. – Barmar Commented Jan 7 at 20:04
- BeautifulSoup doesn't have such functionality builtin. It is a parser essentially . You will have to hardcode some kind of list with the tags you want to achieve that. Why do you want to do this ? – Manos Kounelakis Commented Jan 7 at 20:07
3 Answers
Reset to default 1You can try this.
from bs4 import BeautifulSoup
def is_phrasing_content(tag_name, parser="html.parser"):
snippet = f"<p><{tag_name}></{tag_name}></p>"
soup = BeautifulSoup(snippet, parser)
p_tag = soup.find("p")
if not p_tag:
return False
found_inside_p = p_tag.find(tag_name)
return (found_inside_p is not None)
print(is_phrasing_content("em"))
print(is_phrasing_content("span"))
print(is_phrasing_content("div"))
print(is_phrasing_content("h2"))
I hope this will help you a little.
Since BS doesn't appear to provide a hard-coded list of elements in the phrasing category, you'll have to resort to the definition in the HTML standard you're going to target. For WHATWG HTML review draft (January 2022), the list of phrasing content is a
, abbr
, area
, audio
, b
, bdi
, bdo
, br
, button
, canvas
, cite
, code
, data
, datalist
, del
, dfn
, em
, embed
, i
, iframe
, img
, input
, ins
, kbd
, label
, link
, map
, mark
, math
, meta
, meter
, noscript
, object
, output
, picture
, progress
, q
, ruby
, s
, samp
, select
, slot
, small
, span
, strong
, sub
, sup
, svg
, template
, textarea
, time
, u
, var
, video
, war
, keygen
(but check chapter 3.2.5.2.5 at https://html.spec.whatwg.org/multipage/dom.html#phrasing-content-2 for an up-to-date list).
But: Even though the spec says phrasing content is accepted as content of <p>
elements, it also says that a <p>
element's end-element tag can be omitted (ie. the <p>
element is terminated) on any address
, article
, aside
, blockquote
, details
, dialog
, div
, dl
, fieldset
, figure
, footer
, form
, h1
, h2
, h3
, h4
, h5
, h6
, header
, hgroup
, hr
, main
, nav
, ol
, p
, pre
, section
, style
, table
, ul
, or menu
element (again, you need to check your target HTML spec; eg. the new search
element isn't included in this list), which may or may not be relevant to your application.
You can read a bit more on the interpretation of that part of the spec in the context of the older SGML-based HTML specs at https://sgmljs.net/docs/html5.html.
I'm not sure that Beautiful soup knows what you are saying.
It's more like it uses some engine to parse and fix the HTML.
It has this method soup.get_text()
which returns all the text in HTML.
Maybe you are looking for this.
If not then it would help understand why you need such a function.
本文标签: pythonDetect if Tag is a blocklevel elementStack Overflow
版权声明:本文标题:python - Detect if Tag is a block-level element? - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1736779936a1952539.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论