admin管理员组文章数量:1353145
The code I am using to extract the images is
from PIL import Image
def extract_images_from_pdfs(pdf_list):
import fitz # PyMuPDF
output_dir = "C:/path_to_image"
os.makedirs(output_dir, exist_ok=True)
for pdf_path in pdf_list:
pdf_name = os.path.splitext(os.path.basename(pdf_path))[0]
# Open the PDF
pdf_document = fitz.open(pdf_path)
# Track the count of images extracted per page
image_count = 0
for page_num, page in enumerate(pdf_document):
# Get the images on this page
image_list = page.get_images(full=True)
if not image_list:
print(f"No images found on page {page_num+1} of {pdf_name}")
continue
# Process each image
for img_index, img in enumerate(image_list):
xref = img[0]
base_image = pdf_document.extract_image(xref)
if base_image:
image_bytes = base_image["image"]
image_ext = base_image["ext"]
# Convert bytes to image
image = Image.open(io.BytesIO(image_bytes))
# Save the image
image_name = f"{pdf_name}_image_{image_count}.{image_ext}"
image_path = os.path.join(output_dir, image_name)
image.save(image_path)
image_count += 1
pdf_document.close()
print(f"Extracted {image_count} images from {pdf_name}")
The input, pdf_list
, is just a list containing all the names of my pdf's.
Extracted image 1
Extracted image 2
Expected image:
Could it be that the images on the PDF are encrypted / accessible and is there a work around for this.
Any help is greatly appreciated.
testingpdfexampaper.tiiny.site This is the URL for the PDF
The code I am using to extract the images is
from PIL import Image
def extract_images_from_pdfs(pdf_list):
import fitz # PyMuPDF
output_dir = "C:/path_to_image"
os.makedirs(output_dir, exist_ok=True)
for pdf_path in pdf_list:
pdf_name = os.path.splitext(os.path.basename(pdf_path))[0]
# Open the PDF
pdf_document = fitz.open(pdf_path)
# Track the count of images extracted per page
image_count = 0
for page_num, page in enumerate(pdf_document):
# Get the images on this page
image_list = page.get_images(full=True)
if not image_list:
print(f"No images found on page {page_num+1} of {pdf_name}")
continue
# Process each image
for img_index, img in enumerate(image_list):
xref = img[0]
base_image = pdf_document.extract_image(xref)
if base_image:
image_bytes = base_image["image"]
image_ext = base_image["ext"]
# Convert bytes to image
image = Image.open(io.BytesIO(image_bytes))
# Save the image
image_name = f"{pdf_name}_image_{image_count}.{image_ext}"
image_path = os.path.join(output_dir, image_name)
image.save(image_path)
image_count += 1
pdf_document.close()
print(f"Extracted {image_count} images from {pdf_name}")
The input, pdf_list
, is just a list containing all the names of my pdf's.
Extracted image 1
Extracted image 2
Expected image:
Could it be that the images on the PDF are encrypted / accessible and is there a work around for this.
Any help is greatly appreciated.
testingpdfexampaper.tiiny.site This is the URL for the PDF
Share Improve this question edited Apr 2 at 6:57 cards 5,0641 gold badge11 silver badges26 bronze badges asked Mar 31 at 19:17 ShinyZack123ShinyZack123 475 bronze badges 4 |1 Answer
Reset to default 1The PDF has 78 very small pieces of imagery of which the "largest" is masking for O
on the first page:
1 60 image 81 62 index 1 8 image no 271 0 151 151 1996B 40%
And many are simply one single pixel.
They can be in any order and the early ones of the 78 are generally parts of R
:
pdfimages -list chem.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 4 26 cmyk 4 8 image no 214 0 163 153 77B 19%
1 1 image 2 2 cmyk 4 8 image no 215 0 204 245 21B 131%
1 2 image 7 59 index 1 8 image no 226 0 306 303 53B 13%
1 3 image 60 39 index 1 8 image no 237 0 150 153 819B 35%
1 4 image 1 1 cmyk 4 8 image no 248 0 204 204 14B 350%
1 5 image 9 4 cmyk 4 8 image no 259 0 162 153 74B 51%
1 6 image 58 31 index 1 8 image no 270 0 150 154 526B 29%
1 7 image 4 3 cmyk 4 8 image no 281 0 153 153 38B 79%
1 8 image 2 2 cmyk 4 8 image no 290 0 153 175 24B 150%
NOTE there is common with many PDF constructions no "one to one" relationship.
One text line can be many places and one visible line can be multiple paths too.
Thus image extraction is of no real value as any whole page could be exported as single images, then trimmed to desired areas, at any density/quality you wish.
Python has PyMuPDF which can "gather" "paths" and combine into single graphical units. So if you select an area of inclusions (Region of Interest) they can possibly be reused as vectors elsewhere?
This is similar in effect to the way the MuPDF command line can with a few well chosen commands export SVG areas for reuse.
本文标签: pythonExtracting Images from a PDF using PyMuPDF gives broken output imagesStack Overflow
版权声明:本文标题:python - Extracting Images from a PDF using PyMuPDF gives broken output images - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1743925550a2562894.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
pdfimages
command to see what it gets? – Tim Roberts Commented Mar 31 at 19:23