Python request taking too long to get PDF from website - Stack Overflow-软件玩家

admin管理员组
文章数量:1335409

I'm trying to create a single, lightweight Python script to open a website hosting a guaranteed PDF file, download it, and extract its text.

I’ve reviewed many posts here and across the internet and settled on a combination of the requests and PyPDF2 libraries. While PyPDF2 efficiently extracts text once the PDF is in memory, the process of retrieving the PDF data using requests is quite slow. Below is my code and the time it took to fetch the PDF file (before text extraction).

This is my original code:

import urllib.request
from urllib.parse import urlparse
import time


url = ".pdf"

headers = {
      "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
      "Accept": "application/pdf",  # Indicating we want a PDF file
  }

# Extract the base domain from the URL to set as the Referer header
parsed_url = urlparse(url)
referer = f"{parsed_url.scheme}://{parsed_urlloc}"  # Extract base domain (e.g., ";)

# Update the headers with dynamic Referer
headers["Referer"] = referer

start_time=time.time()
# Step 1: Fetch PDF content directly from the URL with headers
req = urllib.request.Request(url, headers=headers)
with urllib.request.urlopen(req) as response:
    pdf_data = response.read()
   
print(time.time() - start_time)

print(time.time() - start_time) 65.53884482383728

It took more than a minute to get the data form the page, when opening this URL on my browser is fast as lighting.

And another version using urllib3 adapters and retry logic:

import requests
import time
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

url = ".pdf"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Accept": "application/pdf",
    "Cache-Control": "no-cache",
    "Pragma": "no-cache",
}

start_time = time.time()

# Configure retries for requests
session = requests.Session()
retries = Retry(total=3, backoff_factor=0.3, status_forcelist=[500, 502, 503, 504])
adapter = HTTPAdapter(max_retries=retries)
session.mount("https://", adapter)

response = session.get(url, headers=headers, timeout=5)

if response.status_code == 200:
    pdf_data = response.content
    print(f"Time taken: {time.time() - start_time:.2f} seconds")
else:
    print(f"Failed to fetch the PDF. Status code: {response.status_code}")

Time taken: 105.47 seconds

Both methods work for downloading the PDF, but the process is still too slow for production. For example, using a URL from the United Nations, my browser loads the PDF in 1–2 seconds, while the script takes much longer. My internet connection is fast and stable.

What alternative approaches, libraries, or programming strategies can I use to speed up this process (making it as fast as a browser)? I’ve read about tweaking user agents and headers, but these don’t seem to help on my end.

Update

I just run the code on Colab and it is so fast:

What could possibly be wrong or missing in my configuration? I'm on windows 10, with great internet connections.

Output Generated by Chaitanya Rahalkar's Code response:

>>> import requests
>>> from urllib3.util.retry import Retry
>>> from requests.adapters import HTTPAdapter
>>> import io
>>> import time
>>> def download_pdf(url):
...     # Configure session with optimized settings
...     session = requests.Session()
...     retries = Retry(total=3, backoff_factor=0.1, status_forcelist=[500, 502, 503, 504])
...     adapter = HTTPAdapter(max_retries=retries, pool_connections=10, pool_maxsize=10)
...     session.mount('https://', adapter)
...     headers = {
...         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
...         'Accept': 'application/pdf',
...     }
...     # Use streaming to download in chunks
...     response = session.get(url, headers=headers, stream=True, timeout=10)
...     response.raise_for_status()
...     # Stream content into memory
...     pdf_buffer = io.BytesIO()
...     for chunk in response.iter_content(chunk_size=8192):
...         if chunk:
...             pdf_buffer.write(chunk)
...     return pdf_buffer.getvalue()
...
>>> url = ".pdf"
>>> start_time = time.time()
>>> pdf_data = download_pdf(url)
>>> print(f"Download completed in {time.time() - start_time:.2f} seconds")
Download completed in 68.41 seconds
>>> print(f"PDF size: {len(pdf_data) / 1024:.1f} KB")
PDF size: 190.6 KB
>>>

I'm trying to create a single, lightweight Python script to open a website hosting a guaranteed PDF file, download it, and extract its text.

This is my original code:

import urllib.request
from urllib.parse import urlparse
import time


url = "https://www.ohchr./sites/default/files/UDHR/Documents/UDHR_Translations/eng.pdf"

headers = {
      "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
      "Accept": "application/pdf",  # Indicating we want a PDF file
  }

# Extract the base domain from the URL to set as the Referer header
parsed_url = urlparse(url)
referer = f"{parsed_url.scheme}://{parsed_urlloc}"  # Extract base domain (e.g., "https://example")

# Update the headers with dynamic Referer
headers["Referer"] = referer

start_time=time.time()
# Step 1: Fetch PDF content directly from the URL with headers
req = urllib.request.Request(url, headers=headers)
with urllib.request.urlopen(req) as response:
    pdf_data = response.read()
   
print(time.time() - start_time)

print(time.time() - start_time) 65.53884482383728

It took more than a minute to get the data form the page, when opening this URL on my browser is fast as lighting.

And another version using urllib3 adapters and retry logic:

import requests
import time
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

url = "https://www.ohchr./sites/default/files/UDHR/Documents/UDHR_Translations/eng.pdf"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Accept": "application/pdf",
    "Cache-Control": "no-cache",
    "Pragma": "no-cache",
}

start_time = time.time()

# Configure retries for requests
session = requests.Session()
retries = Retry(total=3, backoff_factor=0.3, status_forcelist=[500, 502, 503, 504])
adapter = HTTPAdapter(max_retries=retries)
session.mount("https://", adapter)

response = session.get(url, headers=headers, timeout=5)

if response.status_code == 200:
    pdf_data = response.content
    print(f"Time taken: {time.time() - start_time:.2f} seconds")
else:
    print(f"Failed to fetch the PDF. Status code: {response.status_code}")

Time taken: 105.47 seconds

Update

I just run the code on Colab and it is so fast:

What could possibly be wrong or missing in my configuration? I'm on windows 10, with great internet connections.

Output Generated by Chaitanya Rahalkar's Code response:

>>> import requests
>>> from urllib3.util.retry import Retry
>>> from requests.adapters import HTTPAdapter
>>> import io
>>> import time
>>> def download_pdf(url):
...     # Configure session with optimized settings
...     session = requests.Session()
...     retries = Retry(total=3, backoff_factor=0.1, status_forcelist=[500, 502, 503, 504])
...     adapter = HTTPAdapter(max_retries=retries, pool_connections=10, pool_maxsize=10)
...     session.mount('https://', adapter)
...     headers = {
...         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
...         'Accept': 'application/pdf',
...     }
...     # Use streaming to download in chunks
...     response = session.get(url, headers=headers, stream=True, timeout=10)
...     response.raise_for_status()
...     # Stream content into memory
...     pdf_buffer = io.BytesIO()
...     for chunk in response.iter_content(chunk_size=8192):
...         if chunk:
...             pdf_buffer.write(chunk)
...     return pdf_buffer.getvalue()
...
>>> url = "https://www.ohchr./sites/default/files/UDHR/Documents/UDHR_Translations/eng.pdf"
>>> start_time = time.time()
>>> pdf_data = download_pdf(url)
>>> print(f"Download completed in {time.time() - start_time:.2f} seconds")
Download completed in 68.41 seconds
>>> print(f"PDF size: {len(pdf_data) / 1024:.1f} KB")
PDF size: 190.6 KB
>>>

Share Improve this question edited Nov 27, 2024 at 0:47 asked Nov 20, 2024 at 0:35 R_Student 7391 gold badge9 silver badges22 bronze badges

2 Are you sure your browser isn't just loading a cached copy of the file, instead of actually fetching it from the remote server? – chepner Commented Nov 20, 2024 at 1:09
3 I just ran both of your code samples, and the first took 0.43 s, the second 0.52 s. I tried this with Python 3.11.5 on Windows 11. I'm not sure what's going on on your machine, but it appears to me that your code is fine and something else might be going on. Firewall? Weird network configuration? – joanis Commented Nov 20, 2024 at 1:54
2 reboot your host and see if that helps . using curl over what you've shown makes little sense, if its slow the way you are doing it, then its likely that curl will exhibit the same 'problem'. I ran your code a few hundred iterations all taking less .3 of a second. – ticktalk Commented Nov 20, 2024 at 17:49
1 Is it possible that your browser is using different proxy settings than your Python environment? This is quite common. As an alternative, besides using and fine-tuning other libraries, you may also consider API services like pagesnap.co. – Kenaz Chan Commented Nov 21, 2024 at 15:41
1 1. are you using any sort of vpn?? , what things additionally you have configured your system interfering network connection? test if it works when vpn and all others are off. moreover when you run the requests for first time , windows asks you permission for public/private network usage,what did you do for that. 2. Also have you tried different url,hosted in different file hosting service and compared it with browser speed? are all urls slow?? 3. add 'Connection': 'keep-alive' in headers. All these points are equally important try all – redoc Commented Nov 26, 2024 at 12:57

| Show 5 more comments

2 Answers 2

Sorted by: Reset to default -2

Are you able to time inside of the lines?

The most straight forward way might be to download the library and adding timing profile lines to see which part of the process took the longest time.

This How do I profile a Python script? might be helpful for you.

Try using stream = True in the below section to download the file directly instead of it going into memory first.

req = urllib.request.Request(url, headers=headers)

with urllib.request.urlopen(req) as response:

    pdf_data = response.read()

本文标签： Python request taking too long to get PDF from websiteStack Overflow

版权声明：本文标题：Python request taking too long to get PDF from website - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1742388693a2465554.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

Python request taking too long to get PDF from website - Stack Overflow

Update

Update

2 Answers 2

更多相关文章

Python request taking too long to get PDF from website - Stack Overflow

发表评论

推荐文章

Enqueue plugin for specific pages

javascript - how to update parent&#39;s state from child component along with passing some information from child - Stack Ov

Custom Taxonomy in default RSS feed

javascript - How to display a div to center of the page and auto fit content to page using css? - Stack Overflow

set - Cartesian product of objects in javascript - Stack Overflow

热门文章

javascript - Angular JS: using ng-switch and checking if file exists - Stack Overflow

javascript - jQuery - get ID of element after paste event - Stack Overflow

javascript - Nodenpm: How to manage globally installed devDependencies - Stack Overflow

javascript - Nodejs + Express always returning index.html - Stack Overflow

json - jq: how do I handle recursion in this case? - Stack Overflow

javascript - Implement oAuth 1.0 in ky + react-query - Stack Overflow

javascript - Send ajax request on mouseover - Stack Overflow

javascript - How to return all column namestitles (and avoid &quot;TypeError: table.columns(...).names is not a function&

mootools - What exactly is DOM ExtensionWrapping? - Stack Overflow

mvvm - How do I refresh my SwiftUI to changes in my view model? - Stack Overflow

最新文章

轻松上手：Windows7系统一键重装详细教程

IBM x3850 x5 服务器 安装 Windows Server 2008

根据计算机配置设置bios,BIOS设置图解教程

N1盒子大升级：手把手教你安装OpenWRT并实现公网环境远程管理

哪个牌子充电宝质量最好最耐用？2025年十大公认耐用充电宝品牌

command line interface - Why CLI influx backup does not work on my Raspberry Pi? - Stack Overflow

How to negate all elements in javascript array - Stack Overflow

shortcode to display woocommerce cart total

javascript - Can I return data from an anonymous callback function in jQuery? - Stack Overflow

apache kafka connect - Complex Flink Job with Data Enrichment Using Table API and DataStream API - Stack Overflow

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价

javascript - how to update parent's state from child component along with passing some information from child - Stack Ov

javascript - How to return all column namestitles (and avoid "TypeError: table.columns(...).names is not a function&

IBM x3850 x5 服务器安装 Windows Server 2008