admin管理员组文章数量:1400157
I want to process large XML file (50GB) that is archived in zip file (5GB). My idea was to process it in stream mode because of the file size.
I implemented code:
import xml.etree.ElementTree as ET
import zipfile
import gc
def process_big_xml(path_to_zip, path_to_xml):
with zipfile.ZipFile(path_to_zip, "r") as z:
with z.open(path_to_xml) as file_streamed:
context = ET.iterparse(file_streamed, events=("start", "end"))
processed_lines = 0
for event, elem in context:
if event == "end" and elem.tag == "record":
...
(processing data in XML)
...
processed_lines += 1
# Clearing memory
elem.clear()
del elem
if processed_lines % 25000 == 0:
gc.collect() # Using gc to reduce memory usage
The memory usage is increasing linearly despite using elem.clear()
, del elem
and gc.collect()
. Where is the mistake? What can be improved? I tried also with lxml
- memory usage was OK, but function was x
times slower.
I want to process large XML file (50GB) that is archived in zip file (5GB). My idea was to process it in stream mode because of the file size.
I implemented code:
import xml.etree.ElementTree as ET
import zipfile
import gc
def process_big_xml(path_to_zip, path_to_xml):
with zipfile.ZipFile(path_to_zip, "r") as z:
with z.open(path_to_xml) as file_streamed:
context = ET.iterparse(file_streamed, events=("start", "end"))
processed_lines = 0
for event, elem in context:
if event == "end" and elem.tag == "record":
...
(processing data in XML)
...
processed_lines += 1
# Clearing memory
elem.clear()
del elem
if processed_lines % 25000 == 0:
gc.collect() # Using gc to reduce memory usage
The memory usage is increasing linearly despite using elem.clear()
, del elem
and gc.collect()
. Where is the mistake? What can be improved? I tried also with lxml
- memory usage was OK, but function was x
times slower.
3 Answers
Reset to default 0You don’t extract the zip that’s good, but define a chunk size, e.g. 1MB. With this you can fine tune speed and memory. Test maybe 4MB, 8MB, etc.
import xml.etree.ElementTree as ET
import zipfile
import gc
import io
def process_big_xml(path_to_zip, path_to_xml, chunk_size=1024*1024):
with zipfile.ZipFile(path_to_zip, "r") as z:
with z.open(path_to_xml) as file_streamed:
buffered_reader = io.BufferedReader(file_streamed, buffer_size=chunk_size)
context = ET.iterparse(buffered_reader, events=("start", "end"))
processed_lines = 0
for event, elem in context:
if event == "end" and elem.tag == "record":
# Process the data here...
processed_lines += 1
elem.clear()
if processed_lines % 25000 == 0:
gc.collect()
As an alternative you can try lxml as a faster alternative,too. Updated code according your example.
from lxml import etree
import zipfile
import io
def process_big_xml(path_to_zip, path_to_xml, chunk_size=512*1024):
with zipfile.ZipFile(path_to_zip, "r") as z:
with z.open(path_to_xml) as file_streamed:
buffered_reader = io.BufferedReader(file_streamed, buffer_size=chunk_size)
context = etree.iterparse(buffered_reader, events=("end",), tag="record")
for event, elem in context:
record_id = elem.get("id")
record_type = elem.get("type")
pricing_elem = elem.find("pricing")
if pricing_elem is not None:
tier = pricing_elem.get("tier")
price = pricing_elem.get("price")
print(f"Record ID: {record_id}, Type: {record_type}, Tier: {tier}, Price: {price}")
# Memory cleanup: clear and remove processed elements
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
# Ensure the parser releases memory
context = None
For optimize memory usage and speed you can have also a look to the expat parser:
import zipfile
import xml.parsers.expat
current_record = None
def start_element(name, attrs):
global current_record
if name == "record":
current_record = {"id": attrs.get("id", ""), "type": attrs.get("type", "")}
elif name == "pricing" and current_record is not None:
current_record["tier"] = attrs.get("tier", "")
current_record["price"] = attrs.get("price", "")
def end_element(name):
global current_record
if name == "record" and current_record:
print(f"Parsed Record: {current_record}") # Process the record (e.g., save it)
current_record = None # Reset for next record
def char_data(data):
pass # Not needed for your shared structure, but kept for completeness
zip_path = "large_file.zip"
xml_filename = "large_file.xml"
with zipfile.ZipFile(zip_path, 'r') as z:
with z.open(xml_filename) as f:
parser = xml.parsers.expat.ParserCreate()
parser.StartElementHandler = start_element
parser.EndElementHandler = end_element
parser.CharacterDataHandler = char_data
# Read and parse in chunks
chunk_size = 1024 * 1024
for chunk in iter(lambda: f.read(chunk_size), b""):
parser.Parse(chunk, False)
parser.Parse(b"", True)
本文标签: Python xml iterparse consumes memoryStack Overflow
版权声明:本文标题:Python xml iterparse consumes memory - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1744245770a2597003.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
elem.clear()
anddel elem
do nothing at all, since a new object will be bound to the variableelem
as soon as the loop iteration occurs. The oldelem
will be garbage collected at that point. – Paul Cornelius Commented Mar 24 at 15:05(processing data in XML)
part which retains data in despite of the elem object being garbage collected. Does the process finishes at all? – LMC Commented Mar 25 at 16:13