web scraping - Playwright Python can't find HTML tag which shows up in debugger and in a print statement - Stack Overflo

IT技术

更新时间：2025-03-090

admin管理员组
文章数量:1289732

I am trying to scrape a page product detail page

but I am not able to find the tag when the code runs. I print the parent tag out, and I see the h2 tag I want, and also when I enter the debugger I can get what I want.

import time

from bs4 import BeautifulSoup
from playwright.sync_api import sync_playwright


def playwright_get_soup(url, selector_to_wait_for=None, wait_after_page_load=None):
    with sync_playwright() as this_playwright:
        browser = this_playwright.chromium.launch()
        page = browser.new_page()
        page.goto(url)
        try:
            page.wait_for_load_state("load")
            if wait_after_page_load:
                time.sleep(wait_after_page_load)
        except:
            pass
        
        if selector_to_wait_for:
            page.wait_for_selector(selector_to_wait_for, timeout=15000)

        soup = BeautifulSoup(page.content(), "html.parser")
        browser.close()
    return soup


def parse_product_detail_page(soup):
    parent_block = soup.find("div", class_="primary_block")
    name_and_id_box = parent_block.find("div", class_="item-box")

    print(name_and_id_box) # the h2 tag is visible here

    name_and_id_header = name_and_id_box.find("h2", class_="col-xs-6 ")

    # import ipdb; ipdb.set_trace() # the h2 tag is also visible here

    id_and_raw_name = name_and_id_header.split("#", maxsplit=1) # this is where the program errors out


def scrape_product_detail_page(product_detail_url):
    try:
        soup = playwright_url_to_soup(product_detail_url, selector_to_wait_for=".item-box")
    except:
        return None
    parsed_data = parse_product_detail_page(soup)
    return parsed_data


result = scrape_product_detail_page(".html")

I would appreciate some help determining why name_and_id_header keeps showing up as none. Thank you

I am trying to scrape a page product detail page

but I am not able to find the tag when the code runs. I print the parent tag out, and I see the h2 tag I want, and also when I enter the debugger I can get what I want.

import time

from bs4 import BeautifulSoup
from playwright.sync_api import sync_playwright


def playwright_get_soup(url, selector_to_wait_for=None, wait_after_page_load=None):
    with sync_playwright() as this_playwright:
        browser = this_playwright.chromium.launch()
        page = browser.new_page()
        page.goto(url)
        try:
            page.wait_for_load_state("load")
            if wait_after_page_load:
                time.sleep(wait_after_page_load)
        except:
            pass
        
        if selector_to_wait_for:
            page.wait_for_selector(selector_to_wait_for, timeout=15000)

        soup = BeautifulSoup(page.content(), "html.parser")
        browser.close()
    return soup


def parse_product_detail_page(soup):
    parent_block = soup.find("div", class_="primary_block")
    name_and_id_box = parent_block.find("div", class_="item-box")

    print(name_and_id_box) # the h2 tag is visible here

    name_and_id_header = name_and_id_box.find("h2", class_="col-xs-6 ")

    # import ipdb; ipdb.set_trace() # the h2 tag is also visible here

    id_and_raw_name = name_and_id_header.split("#", maxsplit=1) # this is where the program errors out


def scrape_product_detail_page(product_detail_url):
    try:
        soup = playwright_url_to_soup(product_detail_url, selector_to_wait_for=".item-box")
    except:
        return None
    parsed_data = parse_product_detail_page(soup)
    return parsed_data


result = scrape_product_detail_page("https://www.innovation-line/four-color-photoimage-products/ventoux-210d-polyester-drawstring-cinch-pack-backpack-907.html")

I would appreciate some help determining why name_and_id_header keeps showing up as none. Thank you

Share Improve this question edited Feb 21 at 22:43 asked Feb 19 at 17:27 Cody Childers 251 silver badge4 bronze badges

Add a comment |

2 Answers 2

Sorted by: Reset to default 3

There is a whitespace in your BeautifulSoup class selection:

name_and_id_box.find("h2", class_="col-xs-6 ")

should be "col-xs-6":

name_and_id_box.find("h2", class_="col-xs-6").get_text()

or simply, because it is the only <h2> there:

name_and_id_box.h2.get_text()

The code seems overengineered. I suggest not using BeautifulSoup with Playwright, because BeautifulSoup requires that you dump the entire page to string and re-parse the string before you can select elements, adding another layer of indirection and confusion between you and your goal, not to mention another dependency.

Simply use Playwright directly, with an auto-waiting locator:

from playwright.sync_api import sync_playwright  # 1.48.0

url = "<Your URL>"

with sync_playwright() as pw:
    browser = pw.chromium.launch()
    page = browser.new_page()
    page.goto(url, wait_until="domcontentloaded")
    print(page.locator("h2").text_content())

Output:

Item #907 - 14-1/2" W x 17-1/2" H - "VENTOUX" 210D Polyester Drawstring Cinch Pack Backpack

This locator is strict, so you'll get a nice error if the page changes and adds another <h2>, along with a prompt giving you the selectors you can use to fix the problem--very nice DX.

You can also use page.get_by_role("heading", level=2) if you want to avoid CSS selectors entirely.

Only once it's working should you worry about breaking out functions. First order of business is correctness.

Another approach is to recognize that the data you want is in the static HTML, so you can skip Playwright and simply use requests and BS:

import requests  # 2.31.0
from bs4 import BeautifulSoup  # 4.11.2

soup = BeautifulSoup(requests.get(url).text, "lxml")
print(soup.select_one("h2").text)

本文标签：

版权声明：本文标题：web scraping - Playwright Python can't find HTML tag which shows up in debugger and in a print statement - Stack Overflo 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1741480893a2381164.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

发表评论

全部评论 0

暂无评论

编程频道|软件玩家 - 软件改变生活！

web scraping - Playwright Python can&#39;t find HTML tag which shows up in debugger and in a print statement - Stack Overflo

2 Answers 2

更多相关文章

javascript - ES6 modules in extensions in Chrome version 61 - Stack Overflow

asp.net - Suppress NTLM dialog box after unauthorized request - Stack Overflow

javascript - Go to anchor link on click event jquery - Stack Overflow

dart - After building apk in flutter my application can&#39;t provide operations[example: calculate]. Showing only GUI - Sta

javascript - Lodash object by Index - Stack Overflow

javascript - Slick Slider not loading properly on a hidden div - Stack Overflow

javascript - Chrome Alarm events keeps firing every 4 seconds - Stack Overflow

javascript - TypeError: Cannot read property &#39;path&#39; of undefined in Node js while uploading file - Stack Overflo

Multiple Google Analytics for Multiple pagepath

javascript - How to check if the remote source is available using React? - Stack Overflow

javascript - Highcharts use series labels as x-axis categories - Stack Overflow

Are WordPress &amp; PHP objectively and based on evidence going to die because the future is made of frontend rendering inst

javascript - Batch write more than 25 items on DynamoDB using Lambda - Stack Overflow

javascript - Programmatically invoking events - Stack Overflow

javascript - Disable table pagination in Material-UI - Stack Overflow

php - meta_query search names when they have middle initials

assembly - Keyboard irq not properly returning - Stack Overflow

javascript - Element type is invalid: expected a string (for built-in components) or a classfunction (for composite components)&

reactjs - Cell edit handler like onCellChange - Stack Overflow

How to remove feeds from WordPress totally?

发表评论

推荐文章

windows subsystem for linux - WSLg: apps don&#39;t show tray icons - Stack Overflow

javascript - jQuery datetime formatter - Stack Overflow

amazon web services - Can Aws Application LB replace apache proxy server for reverse tunnel over HTTPS? - Stack Overflow

javascript - Slick.js: How to remove current slide? - Stack Overflow

javascript - Get data attribute onchange event - Stack Overflow

热门文章

javascript - Jest Mock returns undefined instead of data - Stack Overflow

javascript - object HTMLSelectElement accessing value through this keyword - Stack Overflow

Fibonacci Sequence - Find the number of digits - JavaScript - Stack Overflow

javascript - How to turn off form validation in Angular when clicking by button that shoulden&#39;t send form - Stack Overfl

javascript - How to recreate a page with all of the cookies? - Stack Overflow

pine script v5 - Plot exhibition problem in pinescript v.5 - Stack Overflow

javascript - How to set the size of a d3 symbol? - Stack Overflow

javascript - Force download through markup or JS - Stack Overflow

code - Make order notes field at woocommerce checkout only viewable on front end by Admin level user role

javascript - canvas drawImage doesn&#39;t draw images the first time - Stack Overflow

最新文章

Win7各正式版下载地址和SHA验证

怎么样把中文版的Windows7改成英文版的Windows7

Win7系统笔记本蓝牙打开指南：详细步骤助你轻松连接

win7开机弹计算机,win7开机弹出Windows Installer窗口的解决方法

windows7虚拟机安装vmtools方法

How to remove feeds from WordPress totally?

javascript - boolean variable in jquery - Stack Overflow

javascript - Script file not loading - Stack Overflow

How to Exclude Plugin Apps from System Apps in Android PackageManager? - Stack Overflow

javascript - Timeout Feature Discord.js - Stack Overflow

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价

web scraping - Playwright Python can't find HTML tag which shows up in debugger and in a print statement - Stack Overflo

dart - After building apk in flutter my application can't provide operations[example: calculate]. Showing only GUI - Sta

javascript - TypeError: Cannot read property 'path' of undefined in Node js while uploading file - Stack Overflo

Are WordPress & PHP objectively and based on evidence going to die because the future is made of frontend rendering inst

windows subsystem for linux - WSLg: apps don't show tray icons - Stack Overflow

javascript - How to turn off form validation in Angular when clicking by button that shoulden't send form - Stack Overfl

javascript - canvas drawImage doesn't draw images the first time - Stack Overflow