admin管理员组文章数量:1335409
I'm trying to create a single, lightweight Python script to open a website hosting a guaranteed PDF file, download it, and extract its text.
I’ve reviewed many posts here and across the internet and settled on a combination of the requests and PyPDF2 libraries. While PyPDF2 efficiently extracts text once the PDF is in memory, the process of retrieving the PDF data using requests is quite slow. Below is my code and the time it took to fetch the PDF file (before text extraction).
This is my original code:
import urllib.request
from urllib.parse import urlparse
import time
url = ".pdf"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
"Accept": "application/pdf", # Indicating we want a PDF file
}
# Extract the base domain from the URL to set as the Referer header
parsed_url = urlparse(url)
referer = f"{parsed_url.scheme}://{parsed_urlloc}" # Extract base domain (e.g., ";)
# Update the headers with dynamic Referer
headers["Referer"] = referer
start_time=time.time()
# Step 1: Fetch PDF content directly from the URL with headers
req = urllib.request.Request(url, headers=headers)
with urllib.request.urlopen(req) as response:
pdf_data = response.read()
print(time.time() - start_time)
print(time.time() - start_time) 65.53884482383728
It took more than a minute to get the data form the page, when opening this URL on my browser is fast as lighting.
And another version using urllib3 adapters and retry logic:
import requests
import time
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
url = ".pdf"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
"Accept": "application/pdf",
"Cache-Control": "no-cache",
"Pragma": "no-cache",
}
start_time = time.time()
# Configure retries for requests
session = requests.Session()
retries = Retry(total=3, backoff_factor=0.3, status_forcelist=[500, 502, 503, 504])
adapter = HTTPAdapter(max_retries=retries)
session.mount("https://", adapter)
response = session.get(url, headers=headers, timeout=5)
if response.status_code == 200:
pdf_data = response.content
print(f"Time taken: {time.time() - start_time:.2f} seconds")
else:
print(f"Failed to fetch the PDF. Status code: {response.status_code}")
Time taken: 105.47 seconds
Both methods work for downloading the PDF, but the process is still too slow for production. For example, using a URL from the United Nations, my browser loads the PDF in 1–2 seconds, while the script takes much longer. My internet connection is fast and stable.
What alternative approaches, libraries, or programming strategies can I use to speed up this process (making it as fast as a browser)? I’ve read about tweaking user agents and headers, but these don’t seem to help on my end.
Update
I just run the code on Colab and it is so fast:
What could possibly be wrong or missing in my configuration? I'm on windows 10, with great internet connections.
Output Generated by Chaitanya Rahalkar's Code response:
>>> import requests
>>> from urllib3.util.retry import Retry
>>> from requests.adapters import HTTPAdapter
>>> import io
>>> import time
>>> def download_pdf(url):
... # Configure session with optimized settings
... session = requests.Session()
... retries = Retry(total=3, backoff_factor=0.1, status_forcelist=[500, 502, 503, 504])
... adapter = HTTPAdapter(max_retries=retries, pool_connections=10, pool_maxsize=10)
... session.mount('https://', adapter)
... headers = {
... 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
... 'Accept': 'application/pdf',
... }
... # Use streaming to download in chunks
... response = session.get(url, headers=headers, stream=True, timeout=10)
... response.raise_for_status()
... # Stream content into memory
... pdf_buffer = io.BytesIO()
... for chunk in response.iter_content(chunk_size=8192):
... if chunk:
... pdf_buffer.write(chunk)
... return pdf_buffer.getvalue()
...
>>> url = ".pdf"
>>> start_time = time.time()
>>> pdf_data = download_pdf(url)
>>> print(f"Download completed in {time.time() - start_time:.2f} seconds")
Download completed in 68.41 seconds
>>> print(f"PDF size: {len(pdf_data) / 1024:.1f} KB")
PDF size: 190.6 KB
>>>
I'm trying to create a single, lightweight Python script to open a website hosting a guaranteed PDF file, download it, and extract its text.
I’ve reviewed many posts here and across the internet and settled on a combination of the requests and PyPDF2 libraries. While PyPDF2 efficiently extracts text once the PDF is in memory, the process of retrieving the PDF data using requests is quite slow. Below is my code and the time it took to fetch the PDF file (before text extraction).
This is my original code:
import urllib.request
from urllib.parse import urlparse
import time
url = "https://www.ohchr./sites/default/files/UDHR/Documents/UDHR_Translations/eng.pdf"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
"Accept": "application/pdf", # Indicating we want a PDF file
}
# Extract the base domain from the URL to set as the Referer header
parsed_url = urlparse(url)
referer = f"{parsed_url.scheme}://{parsed_urlloc}" # Extract base domain (e.g., "https://example")
# Update the headers with dynamic Referer
headers["Referer"] = referer
start_time=time.time()
# Step 1: Fetch PDF content directly from the URL with headers
req = urllib.request.Request(url, headers=headers)
with urllib.request.urlopen(req) as response:
pdf_data = response.read()
print(time.time() - start_time)
print(time.time() - start_time) 65.53884482383728
It took more than a minute to get the data form the page, when opening this URL on my browser is fast as lighting.
And another version using urllib3 adapters and retry logic:
import requests
import time
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
url = "https://www.ohchr./sites/default/files/UDHR/Documents/UDHR_Translations/eng.pdf"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
"Accept": "application/pdf",
"Cache-Control": "no-cache",
"Pragma": "no-cache",
}
start_time = time.time()
# Configure retries for requests
session = requests.Session()
retries = Retry(total=3, backoff_factor=0.3, status_forcelist=[500, 502, 503, 504])
adapter = HTTPAdapter(max_retries=retries)
session.mount("https://", adapter)
response = session.get(url, headers=headers, timeout=5)
if response.status_code == 200:
pdf_data = response.content
print(f"Time taken: {time.time() - start_time:.2f} seconds")
else:
print(f"Failed to fetch the PDF. Status code: {response.status_code}")
Time taken: 105.47 seconds
Both methods work for downloading the PDF, but the process is still too slow for production. For example, using a URL from the United Nations, my browser loads the PDF in 1–2 seconds, while the script takes much longer. My internet connection is fast and stable.
What alternative approaches, libraries, or programming strategies can I use to speed up this process (making it as fast as a browser)? I’ve read about tweaking user agents and headers, but these don’t seem to help on my end.
Update
I just run the code on Colab and it is so fast:
What could possibly be wrong or missing in my configuration? I'm on windows 10, with great internet connections.
Output Generated by Chaitanya Rahalkar's Code response:
>>> import requests
>>> from urllib3.util.retry import Retry
>>> from requests.adapters import HTTPAdapter
>>> import io
>>> import time
>>> def download_pdf(url):
... # Configure session with optimized settings
... session = requests.Session()
... retries = Retry(total=3, backoff_factor=0.1, status_forcelist=[500, 502, 503, 504])
... adapter = HTTPAdapter(max_retries=retries, pool_connections=10, pool_maxsize=10)
... session.mount('https://', adapter)
... headers = {
... 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
... 'Accept': 'application/pdf',
... }
... # Use streaming to download in chunks
... response = session.get(url, headers=headers, stream=True, timeout=10)
... response.raise_for_status()
... # Stream content into memory
... pdf_buffer = io.BytesIO()
... for chunk in response.iter_content(chunk_size=8192):
... if chunk:
... pdf_buffer.write(chunk)
... return pdf_buffer.getvalue()
...
>>> url = "https://www.ohchr./sites/default/files/UDHR/Documents/UDHR_Translations/eng.pdf"
>>> start_time = time.time()
>>> pdf_data = download_pdf(url)
>>> print(f"Download completed in {time.time() - start_time:.2f} seconds")
Download completed in 68.41 seconds
>>> print(f"PDF size: {len(pdf_data) / 1024:.1f} KB")
PDF size: 190.6 KB
>>>
Share
Improve this question
edited Nov 27, 2024 at 0:47
R_Student
asked Nov 20, 2024 at 0:35
R_StudentR_Student
7391 gold badge9 silver badges22 bronze badges
10
|
Show 5 more comments
2 Answers
Reset to default -2Are you able to time inside of the lines?
The most straight forward way might be to download the library and adding timing profile lines to see which part of the process took the longest time.
This How do I profile a Python script? might be helpful for you.
Try using stream = True in the below section to download the file directly instead of it going into memory first.
req = urllib.request.Request(url, headers=headers)
with urllib.request.urlopen(req) as response:
pdf_data = response.read()
本文标签: Python request taking too long to get PDF from websiteStack Overflow
版权声明:本文标题:Python request taking too long to get PDF from website - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1742388693a2465554.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
'Connection': 'keep-alive'
in headers. All these points are equally important try all – redoc Commented Nov 26, 2024 at 12:57