admin管理员组

文章数量:1356825

I have code where I'm trying to split PDFs into a list of jpg MemoryStream files. I the split portion working, where it takes around under a second or less and creates 100 pdf streams. However once I get to the the point where I'm turning the PDFs into images performance drags down to a snails pace. ImageMagick uses GhostScript to perform this action. My theory is that each time it calls out to GhostScript it has to set it up to be called it produces overhead. I'm wondering if there is a way to make batch calls. The way I understand it MagickImageCollection can only take one page at a time, which is why I do it in a separate method.

I'm open to using a different tool to split the images or convert them. I'm looking into BlackIce but I'm waiting to hear about our license.

namespace PDFTools;

using ImageMagick;
using iText.Kernel.Pdf;
using iText.Layout;

public class PDFUtilities(string temporaryDirectory)
{
    private readonly string TemporaryDirectory = temporaryDirectory;

    public async Task<List<byte[]>> ConvertPdfToImageAsync(Stream stream)
    {
        List<byte[]> results = new List<byte[]>();
        MagickNET.SetTempDirectory(this.TemporaryDirectory);
        List<MemoryStream> pdfPages = this.SplitPdf(stream);
        List<MemoryStream> output = new MemoryStream();

        var tasks = pdfPages.Select((pdfPage, index) => new OrderedTask
        {
            Index = index,
            Task = this.ConvertPageToImageStreamAsync(pdfPage)
        }).ToList();

        _ = await Task.WhenAll(tasks.Select(static t => t.Task));
        OrderedTask[] orderedTask = tasks
            .OrderBy(static s => s.Index)
            .ToArray();

        foreach (OrderedTask task in orderedTask)
        {
            MemoryStream ms = await task.Task;
            byte[] bytes = ms.ToArray();
            results.Add(bytes);
        }
        return results;
    }


    private async Task<MemoryStream> ConvertPageToImageStreamAsync(MemoryStream file)
    {
        MemoryStream outputStream = new MemoryStream();
        MagickImageCollection images = new MagickImageCollection();
        await images.ReadAsync(file);   // Only accepts one image at a time, when I tried multiple PDFS it only gets the last image.

        foreach (MagickImage image in images)
        {
            image.Quality = 100;
            await image.WriteAsync(outputStream, MagickFormat.Jpg);
        }

        outputStream.Position = 0;
        file.Close();
        return outputStream;
    }

    private List<MemoryStream> SplitPdf(Stream stream)
    {
        List<MemoryStream> pdfPages = new List<MemoryStream>();

        using (PdfDocument pdfDocument = new PdfDocument(new PdfReader(stream)))
        {
            for (int pageNumber = 1; pageNumber <= pdfDocument.GetNumberOfPages(); pageNumber++)
            {
                using (MemoryStream tempStream = new MemoryStream())
                {
                    using (PdfWriter writer = new PdfWriter(tempStream))
                    {
                        using (PdfDocument newPdf = new PdfDocument(writer))
                        {
                            _ = pdfDocument.CopyPagesTo(pageNumber, pageNumber, newPdf);
                        }
                    }

                    MemoryStream outputStream = new MemoryStream(tempStream.ToArray());
                    pdfPages.Add(outputStream);
                }
            }
        }

        return pdfPages;
    }
}

internal class OrderedTask
{
    required public int Index { get; set; }

    required public Task<MemoryStream> Task { get; set; }
}

I have code where I'm trying to split PDFs into a list of jpg MemoryStream files. I the split portion working, where it takes around under a second or less and creates 100 pdf streams. However once I get to the the point where I'm turning the PDFs into images performance drags down to a snails pace. ImageMagick uses GhostScript to perform this action. My theory is that each time it calls out to GhostScript it has to set it up to be called it produces overhead. I'm wondering if there is a way to make batch calls. The way I understand it MagickImageCollection can only take one page at a time, which is why I do it in a separate method.

I'm open to using a different tool to split the images or convert them. I'm looking into BlackIce but I'm waiting to hear about our license.

namespace PDFTools;

using ImageMagick;
using iText.Kernel.Pdf;
using iText.Layout;

public class PDFUtilities(string temporaryDirectory)
{
    private readonly string TemporaryDirectory = temporaryDirectory;

    public async Task<List<byte[]>> ConvertPdfToImageAsync(Stream stream)
    {
        List<byte[]> results = new List<byte[]>();
        MagickNET.SetTempDirectory(this.TemporaryDirectory);
        List<MemoryStream> pdfPages = this.SplitPdf(stream);
        List<MemoryStream> output = new MemoryStream();

        var tasks = pdfPages.Select((pdfPage, index) => new OrderedTask
        {
            Index = index,
            Task = this.ConvertPageToImageStreamAsync(pdfPage)
        }).ToList();

        _ = await Task.WhenAll(tasks.Select(static t => t.Task));
        OrderedTask[] orderedTask = tasks
            .OrderBy(static s => s.Index)
            .ToArray();

        foreach (OrderedTask task in orderedTask)
        {
            MemoryStream ms = await task.Task;
            byte[] bytes = ms.ToArray();
            results.Add(bytes);
        }
        return results;
    }


    private async Task<MemoryStream> ConvertPageToImageStreamAsync(MemoryStream file)
    {
        MemoryStream outputStream = new MemoryStream();
        MagickImageCollection images = new MagickImageCollection();
        await images.ReadAsync(file);   // Only accepts one image at a time, when I tried multiple PDFS it only gets the last image.

        foreach (MagickImage image in images)
        {
            image.Quality = 100;
            await image.WriteAsync(outputStream, MagickFormat.Jpg);
        }

        outputStream.Position = 0;
        file.Close();
        return outputStream;
    }

    private List<MemoryStream> SplitPdf(Stream stream)
    {
        List<MemoryStream> pdfPages = new List<MemoryStream>();

        using (PdfDocument pdfDocument = new PdfDocument(new PdfReader(stream)))
        {
            for (int pageNumber = 1; pageNumber <= pdfDocument.GetNumberOfPages(); pageNumber++)
            {
                using (MemoryStream tempStream = new MemoryStream())
                {
                    using (PdfWriter writer = new PdfWriter(tempStream))
                    {
                        using (PdfDocument newPdf = new PdfDocument(writer))
                        {
                            _ = pdfDocument.CopyPagesTo(pageNumber, pageNumber, newPdf);
                        }
                    }

                    MemoryStream outputStream = new MemoryStream(tempStream.ToArray());
                    pdfPages.Add(outputStream);
                }
            }
        }

        return pdfPages;
    }
}

internal class OrderedTask
{
    required public int Index { get; set; }

    required public Task<MemoryStream> Task { get; set; }
}
Share Improve this question edited Mar 29 at 23:08 Christoph Rackwitz 15.8k5 gold badges39 silver badges51 bronze badges asked Mar 27 at 19:37 adc90adc90 3133 silver badges13 bronze badges 3
  • 1 Why don't you use PyMuPDF which is a very fast PDF to image conversion tool. – user23633404 Commented Mar 28 at 4:26
  • ImageMagick generates pretty low quality PDF images, you should try GhostScript.NET – user23633404 Commented Mar 28 at 11:49
  • @KJ It may be that it's just "slow" two seconds, the requirements are non existent so I'm trying to convert a 100mb PDF to a png, so I split it and it's 5MB each and takes around 1 second. It looks like everyone posting here only knows about stuff I've found so maybe there is no good answer. I think maybe it's just slow by nature. – adc90 Commented Mar 28 at 14:17
Add a comment  | 

1 Answer 1

Reset to default -1

Since there is no reason to use 3 libraries that conflict in commercial licenses. Both Apryse (iText) and Artifex (PDF.co) have different models. It is far simpler to use a programmers library with more permissive license.

Whilst Artifex MuPDF/Mutool will usually be the fastest. Poppler PDF utilities is usually preferred as FOSS (although slower).

PDF streaming is naturally serial due the very backwards way a PDF needs to be scanned so the easiest is export pages in a sequence such as 00, 01, 02, 03 etc. this means 100 will need 100 FILEnames.jpg but that is automatic with PDFtoPPM however the Quality of conversion will be 100% lossless.PNG.

So given 4000page.pdf we simply need enough File System space for the first 100 pages and its done in about 7 seconds. Clearly we need a file system target as 100 images cannot be a single stream, thus all 100 parts of the stream need storage one by one as files.

pdftoppm -f 1 -l 100 -png 4000.pdf 100\Page

Considering the complexity in this case, 7 seconds is good as if there were 64 golden Brahman disks the universe would end before it finished. (18,446,744,073,709,551,616 pages)

本文标签: cSplitting a PDF into a list of images iText7 and ImageMagickStack Overflow