admin管理员组

文章数量:1401590

I have a folder containing 600 PDF files, and each PDF has 20 pages. I need to convert each page into a high-quality PNG as quickly as possible.

I wrote the following script for this task:

import os
import multiprocessing
import fitz  # PyMuPDF
from PIL import Image

def process_pdf(pdf_path, output_folder):
    try:
        pdf_name = os.path.splitext(os.path.basename(pdf_path))[0]
        pdf_output_folder = os.path.join(output_folder, pdf_name)
        os.makedirs(pdf_output_folder, exist_ok=True)

        doc = fitz.open(pdf_path)

        for i, page in enumerate(doc):
            pix = page.get_pixmap(dpi=850)  # Render page at high DPI
            img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
            
            img_path = os.path.join(pdf_output_folder, f"page_{i+1}.png")
            img.save(img_path, "PNG")

        print(f"Processed: {pdf_path}")
    except Exception as e:
        print(f"Error processing {pdf_path}: {e}")

def main():
    input_folder = r"E:\Desktop\New folder (5)\New folder (4)"
    output_folder = r"E:\Desktop\New folder (5)\New folder (5)"

    pdf_files = [os.path.join(input_folder, f) for f in os.listdir(input_folder) if f.lower().endswith(".pdf")]

    with multiprocessing.Pool(processes=multiprocessing.cpu_count()) as pool:
        pool.starmap(process_pdf, [(pdf, output_folder) for pdf in pdf_files])

    print("All PDFs processed successfully!")

if __name__ == "__main__":
    main()

Issue:

This script is too slow, especially when processing a large number of PDFs. I tried the following optimizations, but they did not improve speed significantly:

  • Reduced DPI slightly – Lowered from 1200 DPI to 850 DPI. (I also tested 600-800 DPI.)
  • Enabled alpha=False in get_pixmap() – Reduced memory usage.
  • Used ThreadPoolExecutor instead of multiprocessing.Pool – No major improvement.
  • Reduced PNG compression – Set optimize=False when saving images.
  • Converted images to grayscale – Helped slightly, but I need color images for my task.

Possible Solutions I Considered:

  • Parallel Processing of Pages Instead of Files – Instead of processing one file at a time, process each page in parallel to fully utilize CPU cores.
  • Use ProcessPoolExecutor instead of ThreadPoolExecutor – Since rendering is CPU-intensive, multiprocessing should be better.
  • Use JPEG Instead of PNG – JPEG is much faster to save and takes less storage, but I need high-quality images.
  • Lower DPI to 500-600 – Provides a balance between speed and quality.
  • Batch Write Files Instead of Saving One by One – Reduces I/O overhead.

What I Need Help With:

  • How can I significantly speed up this PDF-to-PNG conversion while maintaining high image quality?
  • Are there better libraries or techniques I should use?
  • Is there a way to fully utilize CPU cores efficiently?

Any suggestions would be greatly appreciated!

I have a folder containing 600 PDF files, and each PDF has 20 pages. I need to convert each page into a high-quality PNG as quickly as possible.

I wrote the following script for this task:

import os
import multiprocessing
import fitz  # PyMuPDF
from PIL import Image

def process_pdf(pdf_path, output_folder):
    try:
        pdf_name = os.path.splitext(os.path.basename(pdf_path))[0]
        pdf_output_folder = os.path.join(output_folder, pdf_name)
        os.makedirs(pdf_output_folder, exist_ok=True)

        doc = fitz.open(pdf_path)

        for i, page in enumerate(doc):
            pix = page.get_pixmap(dpi=850)  # Render page at high DPI
            img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
            
            img_path = os.path.join(pdf_output_folder, f"page_{i+1}.png")
            img.save(img_path, "PNG")

        print(f"Processed: {pdf_path}")
    except Exception as e:
        print(f"Error processing {pdf_path}: {e}")

def main():
    input_folder = r"E:\Desktop\New folder (5)\New folder (4)"
    output_folder = r"E:\Desktop\New folder (5)\New folder (5)"

    pdf_files = [os.path.join(input_folder, f) for f in os.listdir(input_folder) if f.lower().endswith(".pdf")]

    with multiprocessing.Pool(processes=multiprocessing.cpu_count()) as pool:
        pool.starmap(process_pdf, [(pdf, output_folder) for pdf in pdf_files])

    print("All PDFs processed successfully!")

if __name__ == "__main__":
    main()

Issue:

This script is too slow, especially when processing a large number of PDFs. I tried the following optimizations, but they did not improve speed significantly:

  • Reduced DPI slightly – Lowered from 1200 DPI to 850 DPI. (I also tested 600-800 DPI.)
  • Enabled alpha=False in get_pixmap() – Reduced memory usage.
  • Used ThreadPoolExecutor instead of multiprocessing.Pool – No major improvement.
  • Reduced PNG compression – Set optimize=False when saving images.
  • Converted images to grayscale – Helped slightly, but I need color images for my task.

Possible Solutions I Considered:

  • Parallel Processing of Pages Instead of Files – Instead of processing one file at a time, process each page in parallel to fully utilize CPU cores.
  • Use ProcessPoolExecutor instead of ThreadPoolExecutor – Since rendering is CPU-intensive, multiprocessing should be better.
  • Use JPEG Instead of PNG – JPEG is much faster to save and takes less storage, but I need high-quality images.
  • Lower DPI to 500-600 – Provides a balance between speed and quality.
  • Batch Write Files Instead of Saving One by One – Reduces I/O overhead.

What I Need Help With:

  • How can I significantly speed up this PDF-to-PNG conversion while maintaining high image quality?
  • Are there better libraries or techniques I should use?
  • Is there a way to fully utilize CPU cores efficiently?

Any suggestions would be greatly appreciated!

Share Improve this question asked Mar 22 at 7:34 Pubg MobilePubg Mobile 7711 silver badge8 bronze badges 7
  • @AdonBilivit PyMuPDF is the official name of the library. fitz is the module name inside. PyMuPDF, which you import in Python. – Pubg Mobile Commented Mar 22 at 8:23
  • @AdonBilivit Ensure pymupdf is installed (pip install pymupdf), use import fitz # PyMuPDF, check print(fitz.__doc__) for version, and reinstall if needed (pip uninstall pymupdf && pip install pymupdf). – Pubg Mobile Commented Mar 22 at 8:34
  • What is the content of the PDFs that makes you need 1200dpi and "better than JPEG" quality? – Mark Setchell Commented Mar 22 at 8:51
  • JPEG with 1x1 chroma sampling and maximum quality can reproduce line art pretty well. Running two separate instances of the software working on different files would be one way to test if there is any mileage in parallel processing here. Simplest fix for a one off is set it running overnight! – Martin Brown Commented Mar 22 at 9:06
  • You could use GhostScript SDK to achieve this without the need for any programming language. – user23633404 Commented Mar 22 at 13:02
 |  Show 2 more comments

3 Answers 3

Reset to default 1

Not only is this process highly CPU intensive, it also requires significant RAM. On MacOS (M2) running on just 4 CPUs (i.e., half the number available) improves performance significantly. Even so, the average time to process a page is ~1.3s

For this test I have 80 PDFs. A maximum of 20 pages is processed per PDF.

Here's the test:

import fitz
from pathlib import Path
from multiprocessing import Pool
from PIL import Image
from time import monotonic
from os import process_cpu_count

SOURCE_DIR = Path("/Volumes/Spare/Downloads")
TARGET_DIR = Path("/Volumes/Spare/PDFs")

def cpus() -> int:
    if ncpus := process_cpu_count():
        ncpus //= 2
        return ncpus if ncpus > 1 else 2
    return 2
    
def process(path: Path) -> tuple[float, int]:
    print(f"Processing {path.name}")
    try:
        with fitz.open(path) as pdf:
            start = monotonic()
            for i, page in enumerate(pdf.pages(), 1):
                pix = page.get_pixmap(dpi=850)
                img = Image.frombytes("RGB", (pix.width, pix.height), pix.samples)
                img_path = TARGET_DIR / f"{path.stem}_page_{i}.png"
                img.save(img_path, "PNG")
                if i >= 20:
                    break
            return (monotonic() - start, i)
    except Exception:
        pass
    return (0.0, 0)

def main() -> None:
    TARGET_DIR.mkdir(parents=True, exist_ok=True)
    with Pool(cpus()) as pool:
        sum_d = 0.0
        sum_p = 0
        for duration, page_count in pool.map(process, SOURCE_DIR.glob("*.pdf")):
            sum_d += duration
            sum_p += page_count
        if sum_p > 0:
            print(f"Average duration per page = {sum_d/sum_p:,.4f}s")
        else:
            print("No files were processed")

if __name__ == "__main__":
    main()

Output excluding filenames:

Average duration per page = 1.2667s

Summary:

Rendering at 850dpi with fitz / PyMuPDF is slow. Reducing the render to, for example, 300dpi decreased the per page timing to ~0.17s

PDF conversion is naturally slow so using just 9 x 20 page files (as life is too short). A few minutes should be enough for timing 9 files as only 180 pages.

These are raw executable CLI times without any Python overheads. Results will always be different depending on the single Graphics Processor so a dedicated workstation with multithreaded GPUs will outperform my values. Thus only offered for comparing same 20 KB pdfs (so diagrams with text roughly compressed to 1 KB per page) as a relative "exe Vs exe" test.

Here are some timings based on PDF printouts to image/paper device, which is traditionally considered high speed at 60 pages per minute.

For comparison I uses both Artifex GS and MuPDF. Let us show MuPdf end reading times as they are easiest to report and faster too.

page New folder (4)/file009.pdf 20 469ms (interpretation) 453ms (rendering) 922ms (total)
total 9188ms (0ms layout) / 20 pages for an average of 459ms
fastest page 1: 15ms (interpretation) 407ms (rendering) 422ms(total)
slowest page 19: 484ms (interpretation) 469ms (rendering) 953ms(total)

Page 1 and 19 are actually quite similar contents. However reading a compressed PDF takes variable times. Here about 1/2 a second to 1 whole second and this is due to having to traverse backwards and forwards between (in this 20 page PDF) almost 140 discrete compressed objects. Single Page PDF should be faster, without significantly less "toing and froing". Smaller PDF with decompressed contents are always better than big compressed image ones.

So what time constraints and rectangle volumes are involved?

density 720 dpi for 2304 x 3456 imagery
page pixels =   7,962,624
per PNG file=  22.78 MB
per PDF file= 442.96 MB memory bus IO + source for decompression.

GS start 15:18:46.12 
GS end   15:21:03.76 
   pages      180
   seconds =  137.64
  600 files= 9176.00
  hours    =    2.55

Mu start 16:11:37.92 
Mu end   16:13:00.06
   pages      180
   seconds =   82.14
  600 files= 5476.00
  hours    =    1.52

Thus Mu PDF should shave an hour off the time taken.

What about other common tools used in Python applications? The files are naturally slower as often significantly compressed down much more (about -40% the MuPDF size).

xpdf
PDFtoPNG start 17:40:14.63 
PDFtoPNG end   17:43:14.85 
9 files in seconds  180.22

PDFtoPPM (-PNG) prefered for its FOSS permisive license
pdftoppm start 18:33:47.17 
pdftoppm end   18:37:22.03 
9 files in seconds  214.86

Many suggestions say use Multi-threading and above MuPDF times were based on 4 Threads what if we change that setting?

4 Threads  82.14
3 Threads  81.81
2 Threads  71.45
1 Thread   79.38 

So without sharing time between 3 or more threads the sequential rendering times can be improved using 2 Threads on this 2 core device.

With a little more tweaking we can get down to 180 pages = 64.54 seconds

Thus 600 above 20 page files = 4,302.66 seconds = estimated 1 hr 12 minutes.

On a slow Windows for looping through:

mutool draw -st -P -T 2 -B 32 -r 720 -F png -o "New folder (5)\%%~nc-page%%2d.png" "New folder (4)/%%~nxc"

Here is an example script that renders all pages of a PDF to images using Python multiprocessing. You can expect a 2-4 times overall faster than linear execution:

import pymupdf
import concurrent.futures
from concurrent.futures import ProcessPoolExecutor
import time


def render_page(x):
    filename, numbers = x
    doc = pymupdf.open(filename)
    for pno in numbers:
        pix = doc[pno].get_pixmap(dpi=300)
        pix.save(f"img-{pno}.jpg")


if __name__ == "__main__":
    t0 = time.perf_counter()
    doc = pymupdf.open("adobe.pdf")
    pc = doc.page_count
    with ProcessPoolExecutor(max_workers=10) as executor:
        for i in range(0, pc, 50):
            r = range(i, i + min(50, pc - i))
            executor.submit(render_page, (doc.name, r))

    t1 = time.perf_counter()
    print(f"Duration {t1-t0}")

本文标签: pythonConvert multiplepage PDF files to PNG quicklyStack Overflow