admin管理员组

文章数量:1400157

I want to process large XML file (50GB) that is archived in zip file (5GB). My idea was to process it in stream mode because of the file size.

I implemented code:

import xml.etree.ElementTree as ET
import zipfile
import gc

def process_big_xml(path_to_zip, path_to_xml):
  with zipfile.ZipFile(path_to_zip, "r") as z:
    with z.open(path_to_xml) as file_streamed:
      context = ET.iterparse(file_streamed, events=("start", "end"))
      processed_lines = 0
      for event, elem in context:
        if event == "end" and elem.tag == "record":
          ... 
          (processing data in XML)
          ...
          processed_lines += 1
          
        # Clearing memory
        elem.clear()
        del elem

        if processed_lines % 25000 == 0:
          gc.collect() # Using gc to reduce memory usage

The memory usage is increasing linearly despite using elem.clear(), del elem and gc.collect(). Where is the mistake? What can be improved? I tried also with lxml - memory usage was OK, but function was x times slower.

I want to process large XML file (50GB) that is archived in zip file (5GB). My idea was to process it in stream mode because of the file size.

I implemented code:

import xml.etree.ElementTree as ET
import zipfile
import gc

def process_big_xml(path_to_zip, path_to_xml):
  with zipfile.ZipFile(path_to_zip, "r") as z:
    with z.open(path_to_xml) as file_streamed:
      context = ET.iterparse(file_streamed, events=("start", "end"))
      processed_lines = 0
      for event, elem in context:
        if event == "end" and elem.tag == "record":
          ... 
          (processing data in XML)
          ...
          processed_lines += 1
          
        # Clearing memory
        elem.clear()
        del elem

        if processed_lines % 25000 == 0:
          gc.collect() # Using gc to reduce memory usage

The memory usage is increasing linearly despite using elem.clear(), del elem and gc.collect(). Where is the mistake? What can be improved? I tried also with lxml - memory usage was OK, but function was x times slower.

Share Improve this question edited Mar 24 at 14:57 Naveed Ahmed 5052 silver badges13 bronze badges asked Mar 24 at 14:48 keyboardNoobkeyboardNoob 1118 bronze badges 8
  • Did you check elem.clear(), del elem and gc.collect() are actually working in your code? – Naveed Ahmed Commented Mar 24 at 14:54
  • @NaveedAhmed how can I check this? – keyboardNoob Commented Mar 24 at 14:55
  • You can run your code in debug mode and use the watch variable windows in your IDE. – Naveed Ahmed Commented Mar 24 at 15:02
  • The lines elem.clear() and del elem do nothing at all, since a new object will be bound to the variable elem as soon as the loop iteration occurs. The old elem will be garbage collected at that point. – Paul Cornelius Commented Mar 24 at 15:05
  • 1 Memory is probably increasing due to the (processing data in XML) part which retains data in despite of the elem object being garbage collected. Does the process finishes at all? – LMC Commented Mar 25 at 16:13
 |  Show 3 more comments

3 Answers 3

Reset to default 0

You don’t extract the zip that’s good, but define a chunk size, e.g. 1MB. With this you can fine tune speed and memory. Test maybe 4MB, 8MB, etc.

import xml.etree.ElementTree as ET
import zipfile
import gc
import io

def process_big_xml(path_to_zip, path_to_xml, chunk_size=1024*1024):  
    with zipfile.ZipFile(path_to_zip, "r") as z:
        with z.open(path_to_xml) as file_streamed:
            buffered_reader = io.BufferedReader(file_streamed, buffer_size=chunk_size)  
            context = ET.iterparse(buffered_reader, events=("start", "end"))
            
            processed_lines = 0
            for event, elem in context:
                if event == "end" and elem.tag == "record":
                    # Process the data here...
                    processed_lines += 1

                elem.clear()

                if processed_lines % 25000 == 0:
                    gc.collect() 

As an alternative you can try lxml as a faster alternative,too. Updated code according your example.

from lxml import etree
import zipfile
import io

def process_big_xml(path_to_zip, path_to_xml, chunk_size=512*1024):
    with zipfile.ZipFile(path_to_zip, "r") as z:
        with z.open(path_to_xml) as file_streamed:
            buffered_reader = io.BufferedReader(file_streamed, buffer_size=chunk_size)  
            context = etree.iterparse(buffered_reader, events=("end",), tag="record")  

            for event, elem in context:
                record_id = elem.get("id")
                record_type = elem.get("type")

                pricing_elem = elem.find("pricing")
                if pricing_elem is not None:
                    tier = pricing_elem.get("tier")
                    price = pricing_elem.get("price")

                    print(f"Record ID: {record_id}, Type: {record_type}, Tier: {tier}, Price: {price}")

                # Memory cleanup: clear and remove processed elements
                elem.clear()
                while elem.getprevious() is not None:
                    del elem.getparent()[0]  

            # Ensure the parser releases memory
            context = None

For optimize memory usage and speed you can have also a look to the expat parser:

import zipfile
import xml.parsers.expat

current_record = None

def start_element(name, attrs):
    global current_record
    if name == "record":
        current_record = {"id": attrs.get("id", ""), "type": attrs.get("type", "")}
    elif name == "pricing" and current_record is not None:
        current_record["tier"] = attrs.get("tier", "")
        current_record["price"] = attrs.get("price", "")

def end_element(name):
    global current_record
    if name == "record" and current_record:
        print(f"Parsed Record: {current_record}")  # Process the record (e.g., save it)
        current_record = None  # Reset for next record

def char_data(data):
    pass  # Not needed for your shared structure, but kept for completeness


zip_path = "large_file.zip"
xml_filename = "large_file.xml"

with zipfile.ZipFile(zip_path, 'r') as z:
    with z.open(xml_filename) as f:
        parser = xml.parsers.expat.ParserCreate()
        parser.StartElementHandler = start_element
        parser.EndElementHandler = end_element
        parser.CharacterDataHandler = char_data

        # Read and parse in chunks
        chunk_size = 1024 * 1024
        for chunk in iter(lambda: f.read(chunk_size), b""):
            parser.Parse(chunk, False)

        parser.Parse(b"", True)

本文标签: Python xml iterparse consumes memoryStack Overflow