admin管理员组

文章数量:1323734

Question: What is the best way to control chunk size used by standard XML iterative parsers in Python?

If single elements* aren't the optimal chunk size for use by iterative parsers, then what is the optimal chunk size? Specifically, where are the chosen chunk sizes documented for the popular libraries lxml and the Python built-in xml.etree.ElementTree?

I seem to have a workaround for changing the default chunk size (e.g. to single lines as a proof of concept) while still using the same iterative parsers and not developing a new one, but I want to know if there is a better, widely-known solution than my somewhat hacky workaround.

* Note: In highly structured example XML documents optimized for human readability, usually each line corresponds to a single opening or closing tag of a single element, so it's conceivable that some chunk sizes might be measured in terms of number of lines. Parsers measuring chunk sizes in number of characters might be more plausible.

What I have tried: I would prefer not using SAX because it requires a lot of confusingly structured boilerplate code.

The way iterparse (both from lxml.etree and xml.etree.ElementTree) is discussed, it often sounds as if it parses XML files "iteratively", as in element-by-element / tag-by-tag (see note above).

But it appears that in practice, given a file-like object as input, both parsers seek and parse the output of the .read method of that file-like object. (As opposed to the output of .readline as I had expected.) If that file-like object is a file-pointer to an 8GB file and this is running on a cluster node with 2GB memory, this will of course cause an OOM error.

.read seems to have an optional parameter n corresponding to the number of lines of the document / text file to read into memory, but if standard iterative parsers do actually use this optional parameter when invoking .read, they don't seem to document what value they use. The MWE example below shows that if such a value is used, then it is at least 16 or greater.

This conclusion is based on this answer to a related question as well as my own testing. Here is a MWE:

import io
from lxml import etree
import xml.etree.ElementTree as etree2

xml_string = """<root>
        <Employee Name="Mr.ZZ" Age="30">
            <Experience TotalYears="10" StartDate="2000-01-01" EndDate="2010-12-12">
                    <Employment id = "1" EndTime="ABC" StartDate="2000-01-01" EndDate="2002-12-12">
                            <Project Name="ABC_1" Team="4">
                            </Project>
                    </Employment>
                    <Employment id = "2" EndTime="XYZ" StartDate="2003-01-01" EndDate="2010-12-12">
                        <PromotionStatus>Manager</PromotionStatus>
                            <Project Name="XYZ_1" Team="7">
                                <Award>Star Team Member</Award>
                            </Project>
                    </Employment>
            </Experience>
        </Employee>
</root>"""

#### lxml output

for event, element in etree.iterparse(io.BytesIO(xml_string.encode("UTF-8")), recover=True, remove_blank_text=True,
                                      events=("start", "end",)):
    print(str((event, element, element.tag, 
               element.text.strip() if element.text is not None else element.text,
               element.tail.strip() if element.tail is not None else element.tail)) + "\n")
    print(f"{etree.tostring(element)}\n")

### xml.etree.ElementTree output is the same

for event, element in etree2.iterparse(io.BytesIO(xml_string.encode("UTF-8")),
                                      events=("start", "end",)):
    print(str((event, element, element.tag, 
               element.text.strip() if element.text is not None else element.text,
               element.tail.strip() if element.tail is not None else element.tail)) + "\n")
    print(f"{etree2.tostring(element)}\n")

Already at the very first iteration, the string representation of the root tag represents the entire XML document, which suggests that the entire output of .read has already been parsed, rather than just the first line (which is what I had originally thought the first iteration was supposed to correspond to based on others' discussion of iterparse).

I was able to come up with the following workaround which displays the expected line-by-line parsing behavior. However, I wonder if there are better solutions. For example, would the millions of calls to readline that would have to be made for ~8GB file cause a bottleneck / IO throttle?

### for the MWE

class StreamString(object):

    def __init__(self, string):
        self._io = io.StringIO(string)

    def read(self, len=None):
        return self._io.readline().encode("UTF-8")

    def close(self):
        self._io.close()

### closer to what would be used in practice

class StreamFile(object):

    def __init__(self, path):
        self._file = open(path, "r")

    def read(self, len=None):
        return self._file.readline().encode("UTF-8")

    def close(self):
        self._file.close()

### demonstrating the expected line-by-line parsing behavior

iterator = etree.iterparse(StreamString(xml_string), recover=True, remove_blank_text=True,
                                      events=("start", "end",))
event, root = next(iterator)
print(str((event, root, root.tag, 
           root.text.strip() if root.text is not None else root.text,
           root.tail.strip() if root.tail is not None else root.tail)) + "\n")
print(f"{etree.tostring(root)}\n")

for event, element in iterator:
    print(str((event, element, element.tag, 
               element.text.strip() if element.text is not None else element.text,
               element.tail.strip() if element.tail is not None else element.tail)) + "\n")
    print(f"{etree.tostring(root)}\n")

This demonstrates the expected behavior, where the parsed tree corresponding to the root element successively grows with each iteration as new lines are added. This behavior is also easier to understand and mesh with the numerous suggestions on this site about how to clear the memory footprints corresponding to nodes and their ancestors (and all of their "older" in depth-first search siblings) after parsing them. It is unclear to me why this is not the default behavior.

Note: although the XML string used for the MWE is small and easily fits entirely into memory, the end goal is to run this for XML files that are potentially gigabytes on size on cluster nodes with 1-2 GB of memory. (I don't have control over the compute environment, yes I agree it would make more sense to just scale vertically to a single node with ~64GB memory.)

Question: What is the best way to control chunk size used by standard XML iterative parsers in Python?

If single elements* aren't the optimal chunk size for use by iterative parsers, then what is the optimal chunk size? Specifically, where are the chosen chunk sizes documented for the popular libraries lxml and the Python built-in xml.etree.ElementTree?

I seem to have a workaround for changing the default chunk size (e.g. to single lines as a proof of concept) while still using the same iterative parsers and not developing a new one, but I want to know if there is a better, widely-known solution than my somewhat hacky workaround.

* Note: In highly structured example XML documents optimized for human readability, usually each line corresponds to a single opening or closing tag of a single element, so it's conceivable that some chunk sizes might be measured in terms of number of lines. Parsers measuring chunk sizes in number of characters might be more plausible.

What I have tried: I would prefer not using SAX because it requires a lot of confusingly structured boilerplate code.

The way iterparse (both from lxml.etree and xml.etree.ElementTree) is discussed, it often sounds as if it parses XML files "iteratively", as in element-by-element / tag-by-tag (see note above).

But it appears that in practice, given a file-like object as input, both parsers seek and parse the output of the .read method of that file-like object. (As opposed to the output of .readline as I had expected.) If that file-like object is a file-pointer to an 8GB file and this is running on a cluster node with 2GB memory, this will of course cause an OOM error.

.read seems to have an optional parameter n corresponding to the number of lines of the document / text file to read into memory, but if standard iterative parsers do actually use this optional parameter when invoking .read, they don't seem to document what value they use. The MWE example below shows that if such a value is used, then it is at least 16 or greater.

This conclusion is based on this answer to a related question as well as my own testing. Here is a MWE:

import io
from lxml import etree
import xml.etree.ElementTree as etree2

xml_string = """<root>
        <Employee Name="Mr.ZZ" Age="30">
            <Experience TotalYears="10" StartDate="2000-01-01" EndDate="2010-12-12">
                    <Employment id = "1" EndTime="ABC" StartDate="2000-01-01" EndDate="2002-12-12">
                            <Project Name="ABC_1" Team="4">
                            </Project>
                    </Employment>
                    <Employment id = "2" EndTime="XYZ" StartDate="2003-01-01" EndDate="2010-12-12">
                        <PromotionStatus>Manager</PromotionStatus>
                            <Project Name="XYZ_1" Team="7">
                                <Award>Star Team Member</Award>
                            </Project>
                    </Employment>
            </Experience>
        </Employee>
</root>"""

#### lxml output

for event, element in etree.iterparse(io.BytesIO(xml_string.encode("UTF-8")), recover=True, remove_blank_text=True,
                                      events=("start", "end",)):
    print(str((event, element, element.tag, 
               element.text.strip() if element.text is not None else element.text,
               element.tail.strip() if element.tail is not None else element.tail)) + "\n")
    print(f"{etree.tostring(element)}\n")

### xml.etree.ElementTree output is the same

for event, element in etree2.iterparse(io.BytesIO(xml_string.encode("UTF-8")),
                                      events=("start", "end",)):
    print(str((event, element, element.tag, 
               element.text.strip() if element.text is not None else element.text,
               element.tail.strip() if element.tail is not None else element.tail)) + "\n")
    print(f"{etree2.tostring(element)}\n")

Already at the very first iteration, the string representation of the root tag represents the entire XML document, which suggests that the entire output of .read has already been parsed, rather than just the first line (which is what I had originally thought the first iteration was supposed to correspond to based on others' discussion of iterparse).

I was able to come up with the following workaround which displays the expected line-by-line parsing behavior. However, I wonder if there are better solutions. For example, would the millions of calls to readline that would have to be made for ~8GB file cause a bottleneck / IO throttle?

### for the MWE

class StreamString(object):

    def __init__(self, string):
        self._io = io.StringIO(string)

    def read(self, len=None):
        return self._io.readline().encode("UTF-8")

    def close(self):
        self._io.close()

### closer to what would be used in practice

class StreamFile(object):

    def __init__(self, path):
        self._file = open(path, "r")

    def read(self, len=None):
        return self._file.readline().encode("UTF-8")

    def close(self):
        self._file.close()

### demonstrating the expected line-by-line parsing behavior

iterator = etree.iterparse(StreamString(xml_string), recover=True, remove_blank_text=True,
                                      events=("start", "end",))
event, root = next(iterator)
print(str((event, root, root.tag, 
           root.text.strip() if root.text is not None else root.text,
           root.tail.strip() if root.tail is not None else root.tail)) + "\n")
print(f"{etree.tostring(root)}\n")

for event, element in iterator:
    print(str((event, element, element.tag, 
               element.text.strip() if element.text is not None else element.text,
               element.tail.strip() if element.tail is not None else element.tail)) + "\n")
    print(f"{etree.tostring(root)}\n")

This demonstrates the expected behavior, where the parsed tree corresponding to the root element successively grows with each iteration as new lines are added. This behavior is also easier to understand and mesh with the numerous suggestions on this site about how to clear the memory footprints corresponding to nodes and their ancestors (and all of their "older" in depth-first search siblings) after parsing them. It is unclear to me why this is not the default behavior.

Note: although the XML string used for the MWE is small and easily fits entirely into memory, the end goal is to run this for XML files that are potentially gigabytes on size on cluster nodes with 1-2 GB of memory. (I don't have control over the compute environment, yes I agree it would make more sense to just scale vertically to a single node with ~64GB memory.)

Share Improve this question edited Feb 22 at 18:44 hasManyStupidQuestions asked Jan 12 at 23:54 hasManyStupidQuestionshasManyStupidQuestions 2602 silver badges12 bronze badges 7
  • 1 Have you tried with a bigger input? It might be reading in chunks instead of by line, and your xml_string is small enough to easily fit in a reasonable chunk size. – user2357112 Commented Jan 13 at 0:42
  • 1 Deleting elements to keep memory low – LMC Commented Jan 13 at 0:59
  • 1 You can cut huge xml files in pieces with e,g, xml_split. – Hermann12 Commented Jan 13 at 6:50
  • @user2357112 I agree, it appears to be reading in chunks (it is clearly not line by line) -- but where do I find the documentation about the chunk size that is being used? My concern is that the chunk size might be large enough to cause OOM issues itself. It is hard to find a series of successively larger "minimal working example" XML documents to reverse engineer the chunk size – hasManyStupidQuestions Commented Feb 21 at 6:07
  • 1 @hasManyStupidQuestions: These kinds of chunks are usually a few kilobytes. Nowhere near enough to cause problems. – user2357112 Commented Feb 21 at 6:52
 |  Show 2 more comments

1 Answer 1

Reset to default -1

Event-based parsers, unlike DOM parsers, do not have to build an in-memory representation of the parsed data and therefore are not limited to documents that can fit in memory. Furthermore, "line-by-line parsing" of an XML file makes no sense as XML is not a line-oriented format. Parsing XML is a long-solved problem. It's better to understand the fully capable existing parsing solutions than to reinvent them poorly.

Realize that processing events via callbacks such as startElement() require no more state creation than what you or your requirements impose. If you attempt to retrieve the contents of the root element as a string, of course you risk having insufficient memory. Don't do that; it's fighting the event framework rather than working naturally within it.

本文标签: