admin管理员组文章数量:1401590
I have a folder containing 600 PDF files, and each PDF has 20 pages. I need to convert each page into a high-quality PNG as quickly as possible.
I wrote the following script for this task:
import os
import multiprocessing
import fitz # PyMuPDF
from PIL import Image
def process_pdf(pdf_path, output_folder):
try:
pdf_name = os.path.splitext(os.path.basename(pdf_path))[0]
pdf_output_folder = os.path.join(output_folder, pdf_name)
os.makedirs(pdf_output_folder, exist_ok=True)
doc = fitz.open(pdf_path)
for i, page in enumerate(doc):
pix = page.get_pixmap(dpi=850) # Render page at high DPI
img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
img_path = os.path.join(pdf_output_folder, f"page_{i+1}.png")
img.save(img_path, "PNG")
print(f"Processed: {pdf_path}")
except Exception as e:
print(f"Error processing {pdf_path}: {e}")
def main():
input_folder = r"E:\Desktop\New folder (5)\New folder (4)"
output_folder = r"E:\Desktop\New folder (5)\New folder (5)"
pdf_files = [os.path.join(input_folder, f) for f in os.listdir(input_folder) if f.lower().endswith(".pdf")]
with multiprocessing.Pool(processes=multiprocessing.cpu_count()) as pool:
pool.starmap(process_pdf, [(pdf, output_folder) for pdf in pdf_files])
print("All PDFs processed successfully!")
if __name__ == "__main__":
main()
Issue:
This script is too slow, especially when processing a large number of PDFs. I tried the following optimizations, but they did not improve speed significantly:
- Reduced DPI slightly – Lowered from 1200 DPI to 850 DPI. (I also tested 600-800 DPI.)
- Enabled
alpha=False
inget_pixmap()
– Reduced memory usage. - Used
ThreadPoolExecutor
instead ofmultiprocessing.Pool
– No major improvement. - Reduced PNG compression – Set
optimize=False
when saving images. - Converted images to grayscale – Helped slightly, but I need color images for my task.
Possible Solutions I Considered:
- Parallel Processing of Pages Instead of Files – Instead of processing one file at a time, process each page in parallel to fully utilize CPU cores.
- Use
ProcessPoolExecutor
instead ofThreadPoolExecutor
– Since rendering is CPU-intensive, multiprocessing should be better. - Use JPEG Instead of PNG – JPEG is much faster to save and takes less storage, but I need high-quality images.
- Lower DPI to 500-600 – Provides a balance between speed and quality.
- Batch Write Files Instead of Saving One by One – Reduces I/O overhead.
What I Need Help With:
- How can I significantly speed up this PDF-to-PNG conversion while maintaining high image quality?
- Are there better libraries or techniques I should use?
- Is there a way to fully utilize CPU cores efficiently?
Any suggestions would be greatly appreciated!
I have a folder containing 600 PDF files, and each PDF has 20 pages. I need to convert each page into a high-quality PNG as quickly as possible.
I wrote the following script for this task:
import os
import multiprocessing
import fitz # PyMuPDF
from PIL import Image
def process_pdf(pdf_path, output_folder):
try:
pdf_name = os.path.splitext(os.path.basename(pdf_path))[0]
pdf_output_folder = os.path.join(output_folder, pdf_name)
os.makedirs(pdf_output_folder, exist_ok=True)
doc = fitz.open(pdf_path)
for i, page in enumerate(doc):
pix = page.get_pixmap(dpi=850) # Render page at high DPI
img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
img_path = os.path.join(pdf_output_folder, f"page_{i+1}.png")
img.save(img_path, "PNG")
print(f"Processed: {pdf_path}")
except Exception as e:
print(f"Error processing {pdf_path}: {e}")
def main():
input_folder = r"E:\Desktop\New folder (5)\New folder (4)"
output_folder = r"E:\Desktop\New folder (5)\New folder (5)"
pdf_files = [os.path.join(input_folder, f) for f in os.listdir(input_folder) if f.lower().endswith(".pdf")]
with multiprocessing.Pool(processes=multiprocessing.cpu_count()) as pool:
pool.starmap(process_pdf, [(pdf, output_folder) for pdf in pdf_files])
print("All PDFs processed successfully!")
if __name__ == "__main__":
main()
Issue:
This script is too slow, especially when processing a large number of PDFs. I tried the following optimizations, but they did not improve speed significantly:
- Reduced DPI slightly – Lowered from 1200 DPI to 850 DPI. (I also tested 600-800 DPI.)
- Enabled
alpha=False
inget_pixmap()
– Reduced memory usage. - Used
ThreadPoolExecutor
instead ofmultiprocessing.Pool
– No major improvement. - Reduced PNG compression – Set
optimize=False
when saving images. - Converted images to grayscale – Helped slightly, but I need color images for my task.
Possible Solutions I Considered:
- Parallel Processing of Pages Instead of Files – Instead of processing one file at a time, process each page in parallel to fully utilize CPU cores.
- Use
ProcessPoolExecutor
instead ofThreadPoolExecutor
– Since rendering is CPU-intensive, multiprocessing should be better. - Use JPEG Instead of PNG – JPEG is much faster to save and takes less storage, but I need high-quality images.
- Lower DPI to 500-600 – Provides a balance between speed and quality.
- Batch Write Files Instead of Saving One by One – Reduces I/O overhead.
What I Need Help With:
- How can I significantly speed up this PDF-to-PNG conversion while maintaining high image quality?
- Are there better libraries or techniques I should use?
- Is there a way to fully utilize CPU cores efficiently?
Any suggestions would be greatly appreciated!
Share Improve this question asked Mar 22 at 7:34 Pubg MobilePubg Mobile 7711 silver badge8 bronze badges 7 | Show 2 more comments3 Answers
Reset to default 1Not only is this process highly CPU intensive, it also requires significant RAM. On MacOS (M2) running on just 4 CPUs (i.e., half the number available) improves performance significantly. Even so, the average time to process a page is ~1.3s
For this test I have 80 PDFs. A maximum of 20 pages is processed per PDF.
Here's the test:
import fitz
from pathlib import Path
from multiprocessing import Pool
from PIL import Image
from time import monotonic
from os import process_cpu_count
SOURCE_DIR = Path("/Volumes/Spare/Downloads")
TARGET_DIR = Path("/Volumes/Spare/PDFs")
def cpus() -> int:
if ncpus := process_cpu_count():
ncpus //= 2
return ncpus if ncpus > 1 else 2
return 2
def process(path: Path) -> tuple[float, int]:
print(f"Processing {path.name}")
try:
with fitz.open(path) as pdf:
start = monotonic()
for i, page in enumerate(pdf.pages(), 1):
pix = page.get_pixmap(dpi=850)
img = Image.frombytes("RGB", (pix.width, pix.height), pix.samples)
img_path = TARGET_DIR / f"{path.stem}_page_{i}.png"
img.save(img_path, "PNG")
if i >= 20:
break
return (monotonic() - start, i)
except Exception:
pass
return (0.0, 0)
def main() -> None:
TARGET_DIR.mkdir(parents=True, exist_ok=True)
with Pool(cpus()) as pool:
sum_d = 0.0
sum_p = 0
for duration, page_count in pool.map(process, SOURCE_DIR.glob("*.pdf")):
sum_d += duration
sum_p += page_count
if sum_p > 0:
print(f"Average duration per page = {sum_d/sum_p:,.4f}s")
else:
print("No files were processed")
if __name__ == "__main__":
main()
Output excluding filenames:
Average duration per page = 1.2667s
Summary:
Rendering at 850dpi with fitz / PyMuPDF is slow. Reducing the render to, for example, 300dpi decreased the per page timing to ~0.17s
PDF conversion is naturally slow so using just 9 x 20 page files (as life is too short). A few minutes should be enough for timing 9 files as only 180 pages.
These are raw executable CLI times without any Python overheads. Results will always be different depending on the single Graphics Processor so a dedicated workstation with multithreaded GPUs will outperform my values. Thus only offered for comparing same 20 KB pdfs (so diagrams with text roughly compressed to 1 KB per page) as a relative "exe Vs exe" test.
Here are some timings based on PDF printouts to image/paper device, which is traditionally considered high speed at 60 pages per minute.
For comparison I uses both Artifex GS and MuPDF. Let us show MuPdf end reading times as they are easiest to report and faster too.
page New folder (4)/file009.pdf 20 469ms (interpretation) 453ms (rendering) 922ms (total)
total 9188ms (0ms layout) / 20 pages for an average of 459ms
fastest page 1: 15ms (interpretation) 407ms (rendering) 422ms(total)
slowest page 19: 484ms (interpretation) 469ms (rendering) 953ms(total)
Page 1 and 19 are actually quite similar contents. However reading a compressed PDF takes variable times. Here about 1/2 a second to 1 whole second and this is due to having to traverse backwards and forwards between (in this 20 page PDF) almost 140 discrete compressed objects. Single Page PDF should be faster, without significantly less "toing and froing". Smaller PDF with decompressed contents are always better than big compressed image ones.
So what time constraints and rectangle volumes are involved?
density 720 dpi for 2304 x 3456 imagery
page pixels = 7,962,624
per PNG file= 22.78 MB
per PDF file= 442.96 MB memory bus IO + source for decompression.
GS start 15:18:46.12
GS end 15:21:03.76
pages 180
seconds = 137.64
600 files= 9176.00
hours = 2.55
Mu start 16:11:37.92
Mu end 16:13:00.06
pages 180
seconds = 82.14
600 files= 5476.00
hours = 1.52
Thus Mu PDF should shave an hour off the time taken.
What about other common tools used in Python applications? The files are naturally slower as often significantly compressed down much more (about -40% the MuPDF size).
xpdf
PDFtoPNG start 17:40:14.63
PDFtoPNG end 17:43:14.85
9 files in seconds 180.22
PDFtoPPM (-PNG) prefered for its FOSS permisive license
pdftoppm start 18:33:47.17
pdftoppm end 18:37:22.03
9 files in seconds 214.86
Many suggestions say use Multi-threading and above MuPDF times were based on 4 Threads what if we change that setting?
4 Threads 82.14
3 Threads 81.81
2 Threads 71.45
1 Thread 79.38
So without sharing time between 3 or more threads the sequential rendering times can be improved using 2 Threads on this 2 core device.
With a little more tweaking we can get down to 180 pages = 64.54 seconds
Thus 600 above 20 page files = 4,302.66 seconds = estimated 1 hr 12 minutes.
On a slow Windows for
looping through:
mutool draw -st -P -T 2 -B 32 -r 720 -F png -o "New folder (5)\%%~nc-page%%2d.png" "New folder (4)/%%~nxc"
Here is an example script that renders all pages of a PDF to images using Python multiprocessing. You can expect a 2-4 times overall faster than linear execution:
import pymupdf
import concurrent.futures
from concurrent.futures import ProcessPoolExecutor
import time
def render_page(x):
filename, numbers = x
doc = pymupdf.open(filename)
for pno in numbers:
pix = doc[pno].get_pixmap(dpi=300)
pix.save(f"img-{pno}.jpg")
if __name__ == "__main__":
t0 = time.perf_counter()
doc = pymupdf.open("adobe.pdf")
pc = doc.page_count
with ProcessPoolExecutor(max_workers=10) as executor:
for i in range(0, pc, 50):
r = range(i, i + min(50, pc - i))
executor.submit(render_page, (doc.name, r))
t1 = time.perf_counter()
print(f"Duration {t1-t0}")
本文标签: pythonConvert multiplepage PDF files to PNG quicklyStack Overflow
版权声明:本文标题:python - Convert multiple-page PDF files to PNG quickly - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1744322884a2600589.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
(pip install pymupdf)
, useimport fitz # PyMuPDF
, checkprint(fitz.__doc__)
for version, and reinstall if needed(pip uninstall pymupdf && pip install pymupdf)
. – Pubg Mobile Commented Mar 22 at 8:34