python - how to handle web scraping error beacuase of its data type - Stack Overflow

IT技术

更新时间：2025-04-171

admin管理员组
文章数量:1405145

I'm starting web scraping with python and playwright. but when I run the code I get an error .how to find type of response data(binary,text,...) and handle the error about convert and save data as text data in a file.

from playwright.sync_api import sync_playwright
import json

def handle_response(response):   
  with open("copy.txt", "w", encoding="utf-8") as file:
      file.write(response.text()) 
  


def main():
  playwright=sync_playwright().start()
  browser=playwright.chromium.launch(headless=True)
  browser.new_context(no_viewport=True)
  page=browser.new_page()  
  page.on('response',lambda response:handle_response(response))  
  page.goto(".en-gb.html?aid=304142&checkin=2025-05-15&checkout=2025-05-16#map_opened-map_trigger_header_pin")    
  page.wait_for_timeout(1000)   
  browser.close()
  playwright.stop()

if __name__=='__main__':
   main()

Error:

Exception has occurred: UnicodeDecodeError 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte File "J:\SeSa\Playwright\sample.py", line 6, in handle_response file.write(response.text()) ~~~~~~~~~~~~~^^ File "J:\SeSa\Playwright\sample.py", line 15, in page.on('response',lambda response:handle_response(response))
~~~~~~~~~~~~~~~^^^^^^^^^^ UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte

I'm starting web scraping with python and playwright. but when I run the code I get an error .how to find type of response data(binary,text,...) and handle the error about convert and save data as text data in a file.

from playwright.sync_api import sync_playwright
import json

def handle_response(response):   
  with open("copy.txt", "w", encoding="utf-8") as file:
      file.write(response.text()) 
  


def main():
  playwright=sync_playwright().start()
  browser=playwright.chromium.launch(headless=True)
  browser.new_context(no_viewport=True)
  page=browser.new_page()  
  page.on('response',lambda response:handle_response(response))  
  page.goto("https://www.booking/hotel/it/hotelnordroma.en-gb.html?aid=304142&checkin=2025-05-15&checkout=2025-05-16#map_opened-map_trigger_header_pin")    
  page.wait_for_timeout(1000)   
  browser.close()
  playwright.stop()

if __name__=='__main__':
   main()

Error:

Exception has occurred: UnicodeDecodeError 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte File "J:\SeSa\Playwright\sample.py", line 6, in handle_response file.write(response.text()) ~~~~~~~~~~~~~^^ File "J:\SeSa\Playwright\sample.py", line 15, in page.on('response',lambda response:handle_response(response))
~~~~~~~~~~~~~~~^^^^^^^^^^ UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte

Share Improve this question edited Mar 9 at 1:35 ggorlen 58k8 gold badges114 silver badges157 bronze badges asked Mar 8 at 19:43 Mojsa 297 bronze badges

Add a comment |

1 Answer 1

Sorted by: Reset to default 3

I'm not sure what you're trying to achieve, but since many responses are binary files like images, use the "wb" option in your write, and .body() on the response (rather than .text()).

Also, choose different names for each file, otherwise copy.txt will simply contain only the last response received.

import os
from playwright.sync_api import sync_playwright # 1.48.0


url = "<Your URL>"
output_directory = "site_content"


def handle_response(response):
    file_name = response.url.split("/")[-1][-100:]

    if response.ok and file_name:
        with open(os.path.join(output_directory, file_name), "wb") as file:
            file.write(response.body())


def main():
    os.makedirs(output_directory, exist_ok=True)

    with sync_playwright() as playwright:
        browser = playwright.chromium.launch()
        page = browser.new_page()
        page.on("response", handle_response)
        page.goto(url, wait_until="networkidle")


if __name__ == "__main__":
    main()

In general, it's a bit unusual to want to capture all responses from a site like this. Most of the data that will be written is junk. Usually you're just after one JSON blob or something like that.

You might want to clarify your actual goal, because there's probably a more straightforward way to achieve it.

Note that response.headers["content-type"] and response.request.resource_type can also be useful tools for taking different actions depending on the data and request type.

本文标签： pythonhow to handle web scraping error beacuase of its data typeStack Overflow

版权声明：本文标题：python - how to handle web scraping error beacuase of its data type - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1744886951a2630546.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

python - how to handle web scraping error beacuase of its data type - Stack Overflow

1 Answer 1

更多相关文章