admin管理员组

文章数量:1335409

I'm trying to create a single, lightweight Python script to open a website hosting a guaranteed PDF file, download it, and extract its text.

I’ve reviewed many posts here and across the internet and settled on a combination of the requests and PyPDF2 libraries. While PyPDF2 efficiently extracts text once the PDF is in memory, the process of retrieving the PDF data using requests is quite slow. Below is my code and the time it took to fetch the PDF file (before text extraction).

This is my original code:

import urllib.request
from urllib.parse import urlparse
import time


url = ".pdf"

headers = {
      "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
      "Accept": "application/pdf",  # Indicating we want a PDF file
  }

# Extract the base domain from the URL to set as the Referer header
parsed_url = urlparse(url)
referer = f"{parsed_url.scheme}://{parsed_urlloc}"  # Extract base domain (e.g., ";)

# Update the headers with dynamic Referer
headers["Referer"] = referer

start_time=time.time()
# Step 1: Fetch PDF content directly from the URL with headers
req = urllib.request.Request(url, headers=headers)
with urllib.request.urlopen(req) as response:
    pdf_data = response.read()
   
print(time.time() - start_time)

print(time.time() - start_time) 65.53884482383728

It took more than a minute to get the data form the page, when opening this URL on my browser is fast as lighting.

And another version using urllib3 adapters and retry logic:

import requests
import time
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

url = ".pdf"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Accept": "application/pdf",
    "Cache-Control": "no-cache",
    "Pragma": "no-cache",
}

start_time = time.time()

# Configure retries for requests
session = requests.Session()
retries = Retry(total=3, backoff_factor=0.3, status_forcelist=[500, 502, 503, 504])
adapter = HTTPAdapter(max_retries=retries)
session.mount("https://", adapter)

response = session.get(url, headers=headers, timeout=5)

if response.status_code == 200:
    pdf_data = response.content
    print(f"Time taken: {time.time() - start_time:.2f} seconds")
else:
    print(f"Failed to fetch the PDF. Status code: {response.status_code}")

Time taken: 105.47 seconds

Both methods work for downloading the PDF, but the process is still too slow for production. For example, using a URL from the United Nations, my browser loads the PDF in 1–2 seconds, while the script takes much longer. My internet connection is fast and stable.

What alternative approaches, libraries, or programming strategies can I use to speed up this process (making it as fast as a browser)? I’ve read about tweaking user agents and headers, but these don’t seem to help on my end.

Update

I just run the code on Colab and it is so fast:

What could possibly be wrong or missing in my configuration? I'm on windows 10, with great internet connections.

Output Generated by Chaitanya Rahalkar's Code response:

>>> import requests
>>> from urllib3.util.retry import Retry
>>> from requests.adapters import HTTPAdapter
>>> import io
>>> import time
>>> def download_pdf(url):
...     # Configure session with optimized settings
...     session = requests.Session()
...     retries = Retry(total=3, backoff_factor=0.1, status_forcelist=[500, 502, 503, 504])
...     adapter = HTTPAdapter(max_retries=retries, pool_connections=10, pool_maxsize=10)
...     session.mount('https://', adapter)
...     headers = {
...         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
...         'Accept': 'application/pdf',
...     }
...     # Use streaming to download in chunks
...     response = session.get(url, headers=headers, stream=True, timeout=10)
...     response.raise_for_status()
...     # Stream content into memory
...     pdf_buffer = io.BytesIO()
...     for chunk in response.iter_content(chunk_size=8192):
...         if chunk:
...             pdf_buffer.write(chunk)
...     return pdf_buffer.getvalue()
...
>>> url = ".pdf"
>>> start_time = time.time()
>>> pdf_data = download_pdf(url)
>>> print(f"Download completed in {time.time() - start_time:.2f} seconds")
Download completed in 68.41 seconds
>>> print(f"PDF size: {len(pdf_data) / 1024:.1f} KB")
PDF size: 190.6 KB
>>>

I'm trying to create a single, lightweight Python script to open a website hosting a guaranteed PDF file, download it, and extract its text.

I’ve reviewed many posts here and across the internet and settled on a combination of the requests and PyPDF2 libraries. While PyPDF2 efficiently extracts text once the PDF is in memory, the process of retrieving the PDF data using requests is quite slow. Below is my code and the time it took to fetch the PDF file (before text extraction).

This is my original code:

import urllib.request
from urllib.parse import urlparse
import time


url = "https://www.ohchr./sites/default/files/UDHR/Documents/UDHR_Translations/eng.pdf"

headers = {
      "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
      "Accept": "application/pdf",  # Indicating we want a PDF file
  }

# Extract the base domain from the URL to set as the Referer header
parsed_url = urlparse(url)
referer = f"{parsed_url.scheme}://{parsed_urlloc}"  # Extract base domain (e.g., "https://example")

# Update the headers with dynamic Referer
headers["Referer"] = referer

start_time=time.time()
# Step 1: Fetch PDF content directly from the URL with headers
req = urllib.request.Request(url, headers=headers)
with urllib.request.urlopen(req) as response:
    pdf_data = response.read()
   
print(time.time() - start_time)

print(time.time() - start_time) 65.53884482383728

It took more than a minute to get the data form the page, when opening this URL on my browser is fast as lighting.

And another version using urllib3 adapters and retry logic:

import requests
import time
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

url = "https://www.ohchr./sites/default/files/UDHR/Documents/UDHR_Translations/eng.pdf"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Accept": "application/pdf",
    "Cache-Control": "no-cache",
    "Pragma": "no-cache",
}

start_time = time.time()

# Configure retries for requests
session = requests.Session()
retries = Retry(total=3, backoff_factor=0.3, status_forcelist=[500, 502, 503, 504])
adapter = HTTPAdapter(max_retries=retries)
session.mount("https://", adapter)

response = session.get(url, headers=headers, timeout=5)

if response.status_code == 200:
    pdf_data = response.content
    print(f"Time taken: {time.time() - start_time:.2f} seconds")
else:
    print(f"Failed to fetch the PDF. Status code: {response.status_code}")

Time taken: 105.47 seconds

Both methods work for downloading the PDF, but the process is still too slow for production. For example, using a URL from the United Nations, my browser loads the PDF in 1–2 seconds, while the script takes much longer. My internet connection is fast and stable.

What alternative approaches, libraries, or programming strategies can I use to speed up this process (making it as fast as a browser)? I’ve read about tweaking user agents and headers, but these don’t seem to help on my end.

Update

I just run the code on Colab and it is so fast:

What could possibly be wrong or missing in my configuration? I'm on windows 10, with great internet connections.

Output Generated by Chaitanya Rahalkar's Code response:

>>> import requests
>>> from urllib3.util.retry import Retry
>>> from requests.adapters import HTTPAdapter
>>> import io
>>> import time
>>> def download_pdf(url):
...     # Configure session with optimized settings
...     session = requests.Session()
...     retries = Retry(total=3, backoff_factor=0.1, status_forcelist=[500, 502, 503, 504])
...     adapter = HTTPAdapter(max_retries=retries, pool_connections=10, pool_maxsize=10)
...     session.mount('https://', adapter)
...     headers = {
...         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
...         'Accept': 'application/pdf',
...     }
...     # Use streaming to download in chunks
...     response = session.get(url, headers=headers, stream=True, timeout=10)
...     response.raise_for_status()
...     # Stream content into memory
...     pdf_buffer = io.BytesIO()
...     for chunk in response.iter_content(chunk_size=8192):
...         if chunk:
...             pdf_buffer.write(chunk)
...     return pdf_buffer.getvalue()
...
>>> url = "https://www.ohchr./sites/default/files/UDHR/Documents/UDHR_Translations/eng.pdf"
>>> start_time = time.time()
>>> pdf_data = download_pdf(url)
>>> print(f"Download completed in {time.time() - start_time:.2f} seconds")
Download completed in 68.41 seconds
>>> print(f"PDF size: {len(pdf_data) / 1024:.1f} KB")
PDF size: 190.6 KB
>>>
Share Improve this question edited Nov 27, 2024 at 0:47 R_Student asked Nov 20, 2024 at 0:35 R_StudentR_Student 7391 gold badge9 silver badges22 bronze badges 10
  • 2 Are you sure your browser isn't just loading a cached copy of the file, instead of actually fetching it from the remote server? – chepner Commented Nov 20, 2024 at 1:09
  • 3 I just ran both of your code samples, and the first took 0.43 s, the second 0.52 s. I tried this with Python 3.11.5 on Windows 11. I'm not sure what's going on on your machine, but it appears to me that your code is fine and something else might be going on. Firewall? Weird network configuration? – joanis Commented Nov 20, 2024 at 1:54
  • 2 reboot your host and see if that helps . using curl over what you've shown makes little sense, if its slow the way you are doing it, then its likely that curl will exhibit the same 'problem'. I ran your code a few hundred iterations all taking less .3 of a second. – ticktalk Commented Nov 20, 2024 at 17:49
  • 1 Is it possible that your browser is using different proxy settings than your Python environment? This is quite common. As an alternative, besides using and fine-tuning other libraries, you may also consider API services like pagesnap.co. – Kenaz Chan Commented Nov 21, 2024 at 15:41
  • 1 1. are you using any sort of vpn?? , what things additionally you have configured your system interfering network connection? test if it works when vpn and all others are off. moreover when you run the requests for first time , windows asks you permission for public/private network usage,what did you do for that. 2. Also have you tried different url,hosted in different file hosting service and compared it with browser speed? are all urls slow?? 3. add 'Connection': 'keep-alive' in headers. All these points are equally important try all – redoc Commented Nov 26, 2024 at 12:57
 |  Show 5 more comments

2 Answers 2

Reset to default -2

Are you able to time inside of the lines?

The most straight forward way might be to download the library and adding timing profile lines to see which part of the process took the longest time.

This How do I profile a Python script? might be helpful for you.

Try using stream = True in the below section to download the file directly instead of it going into memory first.

req = urllib.request.Request(url, headers=headers)

with urllib.request.urlopen(req) as response:

    pdf_data = response.read()

本文标签: Python request taking too long to get PDF from websiteStack Overflow