python - Getting a blank output while using scrapy playwright - Stack Overflow

IT技术

更新时间：2025-02-050

admin管理员组
文章数量:1208155

I tried using scrapy playwright to scrape few contents from this website: /.

I have added the html code which I was trying to scrape:

html code

I have attached my code below:

import scrapy
from scrapy_playwright.page import PageMethod


class OscarSpider(scrapy.Spider):
    name = "OscarSpider"

    def start_requests(self):
        yield scrapy.Request(
            url="/",
            callback=self.parse,
            meta={
                "playwright": True,
                "playwright_include_page": True,
                "playwright_page_methods": [
                    PageMethod("wait_for_selector", "a#2010"),  # Ensure button exists
                    PageMethod("click", "a#2010"),  # Click the button
                    PageMethod("wait_for_selector", "tr.film"),  # Wait for data to load
                    PageMethod("evaluate", "window.scrollTo(0, document.body.scrollHeight)"),
                    PageMethod("wait_for_timeout", 6000)  # Wait for AJAX data
                ]
            }
        )

    async def parse(self, response):
        for row in response.css("tr.film"):
            yield {
                "title": row.css("td.film-title::text").get(default="").strip(),
                "nominations": row.css("td.film-nominations::text").get(default="").strip(),
                "awards": row.css("td.film-awards::text").get(default="").strip(),
            }

But even after executing it, it is not returning the output when I type this:

scrapy crawl OscarSpider -O Oscar.json

I'm expecting output as below (in JSON):

Title                   Nominations Awards  
The King's Speech            12      4  
Inception                     8      4

Please help me in this regard.

I tried using scrapy playwright to scrape few contents from this website: https://www.scrapethissite.com/pages/ajax-javascript/.

I have added the html code which I was trying to scrape:

html code

I have attached my code below:

import scrapy
from scrapy_playwright.page import PageMethod


class OscarSpider(scrapy.Spider):
    name = "OscarSpider"

    def start_requests(self):
        yield scrapy.Request(
            url="https://www.scrapethissite.com/pages/ajax-javascript/",
            callback=self.parse,
            meta={
                "playwright": True,
                "playwright_include_page": True,
                "playwright_page_methods": [
                    PageMethod("wait_for_selector", "a#2010"),  # Ensure button exists
                    PageMethod("click", "a#2010"),  # Click the button
                    PageMethod("wait_for_selector", "tr.film"),  # Wait for data to load
                    PageMethod("evaluate", "window.scrollTo(0, document.body.scrollHeight)"),
                    PageMethod("wait_for_timeout", 6000)  # Wait for AJAX data
                ]
            }
        )

    async def parse(self, response):
        for row in response.css("tr.film"):
            yield {
                "title": row.css("td.film-title::text").get(default="").strip(),
                "nominations": row.css("td.film-nominations::text").get(default="").strip(),
                "awards": row.css("td.film-awards::text").get(default="").strip(),
            }

But even after executing it, it is not returning the output when I type this:

scrapy crawl OscarSpider -O Oscar.json

I'm expecting output as below (in JSON):

Title                   Nominations Awards  
The King's Speech            12      4  
Inception                     8      4

Please help me in this regard.

Share Improve this question edited Jan 19 at 9:55 ouroboros1 14.1k7 gold badges39 silver badges55 bronze badges asked Jan 19 at 9:52 Nitish K 211 silver badge4 bronze badges

Add a comment |

3 Answers 3

Sorted by: Reset to default 1

You don't need to include the ::text in the css selector while using playwright It's better to use the .inner_text() as following:

"title" = row.locator("td.film-title").inner_text().strip()
"nominations" = row.locator("td.film-nominations").inner_text().strip()
"awards" = row.locator("td.film-awards").inner_text().strip()

You can send an Ajax request to get the Data. Data will be received as JSON. You can parse the JSON to get the records. I have provided the code below:

import scrapy, json


class OscarSpider(scrapy.Spider):
    name = "OscarSpider"
    year_to_crawl = 2010

    def start_requests(self):
        yield scrapy.Request(
            url=f"https://www.scrapethissite.com/pages/ajax-javascript/?ajax=true&year={self.year_to_crawl}",
            callback=self.parse,
        )

    def parse(self, response):
        raw_response = json.loads(response.text)
        for row in raw_response:
            yield {
                "title": row['title'],
                "nominations": row['nominations'],
                "awards": row['awards'],
            }

In the above code, you just need to change year_to_crawl if you want to get the data for any other year.

Output from Spider

{'title': "The King's Speech", 'nominations': 12, 'awards': 4}
{'title': 'Inception', 'nominations': 8, 'awards': 4}
{'title': 'The Social Network', 'nominations': 8, 'awards': 3}
{'title': 'The Fighter', 'nominations': 7, 'awards': 2}
{'title': 'Toy Story 3', 'nominations': 5, 'awards': 2}
{'title': 'Alice in Wonderland', 'nominations': 3, 'awards': 2}
{'title': 'Black Swan', 'nominations': 5, 'awards': 1}
{'title': 'In a Better World', 'nominations': 1, 'awards': 1}
{'title': 'The Lost Thing', 'nominations': 1, 'awards': 1}
{'title': 'God of Love', 'nominations': 1, 'awards': 1}
{'title': 'The Wolfman', 'nominations': 1, 'awards': 1}
{'title': 'Strangers No More', 'nominations': 1, 'awards': 1}
{'title': 'Inside Job', 'nominations': 1, 'awards': 1}

Root cause

The cause for this issue can be seen on the logs when running the spider:

playwright._impl._errors.Error: Page.wait_for_selector: SyntaxError: Failed to 
execute 'querySelectorAll' on 'Document': 'a#2010' is not a valid selector.

This is happening due to issues when querying the selector at this point:

PageMethod("wait_for_selector", "a#2010"),  # Ensure button exists
PageMethod("click", "a#2010"), # Click the button

Ideally, ID identifiers shouldn't start with numbers. Querying it like that will lead to invalid selectors. This problem is also reproducible if you open the browser devtools console on https://www.scrapethissite.com/pages/ajax-javascript. See the examples below.

Invalid:

> document.querySelector("a#2010")
Uncaught SyntaxError: Failed to execute 'querySelector' on 'Document': '#2010' is
not a valid selector.

Valid:

> document.querySelector("a[id='2010']")
<a href="#" class="year-link" id="2010">2010</a>

Solution

Given you have no control over the HTML to fix the id attribute so it becomes a valid selector, a workaround is to update your spider code to query for "a[id='2010']" instead:

PageMethod("wait_for_selector", "a[id='2010']"),  # Ensure button exists
PageMethod("click", "a[id='2010']"),  # Click the button

本文标签： pythonGetting a blank output while using scrapy playwrightStack Overflow

版权声明：本文标题：python - Getting a blank output while using scrapy playwright - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1738742667a2109942.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

python - Getting a blank output while using scrapy playwright - Stack Overflow

3 Answers 3

Root cause

Solution

更多相关文章

python - Getting a blank output while using scrapy playwright - Stack Overflow

发表评论

推荐文章

php - Quill.js not working with Livewire 3 wire:navigate in Laravel 11 SPA - Stack Overflow

c# - All objects being the same z position despite different z offsets - Stack Overflow

php - How to properly check if all form elements are filled with JavaScript - Stack Overflow

javascript - Angular UI-Router $urlRouterProvider .when not working *anymore* - Stack Overflow

JavaScript reduce not working on an object - Stack Overflow

热门文章

Optional Chaining in JavaScript returns undefined instead of null when the variable to be tested is null - Stack Overflow

javascript - How to detect async function support without eval? - Stack Overflow

url rewriting - How to add a custom redirect rule for Post as subdomains?

javascript - how to show a confirmation dialog box in toastr - Stack Overflow

javascript - Remove line from line graph in d3.js - Stack Overflow

javascript - checking a variable value using an OR operator - Stack Overflow

javascript - run another function after preceding function has completed? - Stack Overflow

testing - Determining orthogonal array - Stack Overflow

javascript - How to store and retrieve Shopping Cart items in localstorage? - Stack Overflow

comments - Count the total views of all user posts published

最新文章

windows7安装vue-cli

小白重装工具在线安装win10教程

电脑小白怎么重装系统_电脑小白u盘重装系统详细教程【小白必看】

忘记电脑密码如何修改win7

Windows7 SP1更新升级失败

javascript - Emscripten - C++ with pure html5 canvas support (not WebGL) - Stack Overflow

php - Why is variable not working on custom sql query using wpdb?

javascript - How to ignore src folder from package - Stack Overflow

Can you share dependencies between flutter packages in Melos and how? - Stack Overflow

javascript - Getting notified when the page DOM has loaded (but before window.onload) - Stack Overflow

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价

javascript - Angular UI-Router $urlRouterProvider .when not working anymore - Stack Overflow