admin管理员组文章数量:1312778
I can't find the solution to how to isolate or extract vector chart and graphs(that are not images) from pdf.
I have tried extract directly, but I realize that it is not that straight forward. I was using mymupdf. This script extracts and saves only images. But I needed to save charts that are not the images. In PDF apparently it is stored differently.
import fitz
import os
pdf_path = 'path to pdf'
output_folder = 'your output folder'
os.makedirs(output_folder, exist_ok=True)
doc = fitz.open(pdf_path)
chart_count = 0
page = doc.load_page(0)
img_list = page.get_images(full=True )
for img_index, img in enumerate(img_list):
base_image = doc.extract_image(img[0])
image_bytes = base_image["image"]
image = Image.open(io.BytesIO(image_bytes))
image_path = os.path.join(output_folder,
f"chart_{chart_count+1}.png")
image.save(image_path)
chart_count += 1
This one only performs good on image type in PDF but not for vector charts. Do you have any suggestions or solutions?
Sample PDF file ( where you can see not all charts are being extracted)
I can't find the solution to how to isolate or extract vector chart and graphs(that are not images) from pdf.
I have tried extract directly, but I realize that it is not that straight forward. I was using mymupdf. This script extracts and saves only images. But I needed to save charts that are not the images. In PDF apparently it is stored differently.
import fitz
import os
pdf_path = 'path to pdf'
output_folder = 'your output folder'
os.makedirs(output_folder, exist_ok=True)
doc = fitz.open(pdf_path)
chart_count = 0
page = doc.load_page(0)
img_list = page.get_images(full=True )
for img_index, img in enumerate(img_list):
base_image = doc.extract_image(img[0])
image_bytes = base_image["image"]
image = Image.open(io.BytesIO(image_bytes))
image_path = os.path.join(output_folder,
f"chart_{chart_count+1}.png")
image.save(image_path)
chart_count += 1
This one only performs good on image type in PDF but not for vector charts. Do you have any suggestions or solutions?
Sample PDF file ( where you can see not all charts are being extracted)
Share Improve this question asked Feb 3 at 12:38 ravshanovbekravshanovbek 113 bronze badges1 Answer
Reset to default 0You have correctly described PDF is different components on a page. Some are areas of colour and others are text and perhaps JPEG images so when we strip the background paper colours the first 6 pages match that description well.
Floating images and floating text characters in chart like pages. Any page colours or linework are totally separate sub page objects.
Moving on to the ones you hope to see different. We can see these are either images or simply just parts of a page thus not independent graphics for extraction.
Thus to extract objects from an area they must be gathered by co-ordinates in your Region of Interest (ROI) or redact the others from the page.
PyMuPdf is good at redaction so trim all the page outside the Region of interest using X and Y REDACTION boxes.
Then once all the surrounding data is deleted ensure the remaining text is one colour for ease of viewing.
The culmination of editing With MuPDF can thus be a single page PDF of the retained and edited area.
Finally you can reduce the page size to what you design it to be.
The code would be too large for me to write each custom page editor so I simply cut and paste using Mutools and Notepad as far easier.
本文标签: extract vector charts from PDFStack Overflow
版权声明:本文标题:extract vector charts from PDF - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1741820979a2399362.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论