admin管理员组文章数量:1323734
Question: What is the best way to control chunk size used by standard XML iterative parsers in Python?
If single elements* aren't the optimal chunk size for use by iterative parsers, then what is the optimal chunk size? Specifically, where are the chosen chunk sizes documented for the popular libraries lxml
and the Python built-in xml.etree.ElementTree
?
I seem to have a workaround for changing the default chunk size (e.g. to single lines as a proof of concept) while still using the same iterative parsers and not developing a new one, but I want to know if there is a better, widely-known solution than my somewhat hacky workaround.
* Note: In highly structured example XML documents optimized for human readability, usually each line corresponds to a single opening or closing tag of a single element, so it's conceivable that some chunk sizes might be measured in terms of number of lines. Parsers measuring chunk sizes in number of characters might be more plausible.
What I have tried: I would prefer not using SAX because it requires a lot of confusingly structured boilerplate code.
The way iterparse
(both from lxml.etree
and xml.etree.ElementTree
) is discussed, it often sounds as if it parses XML files "iteratively", as in element-by-element / tag-by-tag (see note above).
But it appears that in practice, given a file-like object as input, both parsers seek and parse the output of the .read
method of that file-like object. (As opposed to the output of .readline
as I had expected.) If that file-like object is a file-pointer to an 8GB file and this is running on a cluster node with 2GB memory, this will of course cause an OOM error.
.read
seems to have an optional parameter n
corresponding to the number of lines of the document / text file to read into memory, but if standard iterative parsers do actually use this optional parameter when invoking .read
, they don't seem to document what value they use. The MWE example below shows that if such a value is used, then it is at least 16 or greater.
This conclusion is based on this answer to a related question as well as my own testing. Here is a MWE:
import io
from lxml import etree
import xml.etree.ElementTree as etree2
xml_string = """<root>
<Employee Name="Mr.ZZ" Age="30">
<Experience TotalYears="10" StartDate="2000-01-01" EndDate="2010-12-12">
<Employment id = "1" EndTime="ABC" StartDate="2000-01-01" EndDate="2002-12-12">
<Project Name="ABC_1" Team="4">
</Project>
</Employment>
<Employment id = "2" EndTime="XYZ" StartDate="2003-01-01" EndDate="2010-12-12">
<PromotionStatus>Manager</PromotionStatus>
<Project Name="XYZ_1" Team="7">
<Award>Star Team Member</Award>
</Project>
</Employment>
</Experience>
</Employee>
</root>"""
#### lxml output
for event, element in etree.iterparse(io.BytesIO(xml_string.encode("UTF-8")), recover=True, remove_blank_text=True,
events=("start", "end",)):
print(str((event, element, element.tag,
element.text.strip() if element.text is not None else element.text,
element.tail.strip() if element.tail is not None else element.tail)) + "\n")
print(f"{etree.tostring(element)}\n")
### xml.etree.ElementTree output is the same
for event, element in etree2.iterparse(io.BytesIO(xml_string.encode("UTF-8")),
events=("start", "end",)):
print(str((event, element, element.tag,
element.text.strip() if element.text is not None else element.text,
element.tail.strip() if element.tail is not None else element.tail)) + "\n")
print(f"{etree2.tostring(element)}\n")
Already at the very first iteration, the string representation of the root tag represents the entire XML document, which suggests that the entire output of .read
has already been parsed, rather than just the first line (which is what I had originally thought the first iteration was supposed to correspond to based on others' discussion of iterparse
).
I was able to come up with the following workaround which displays the expected line-by-line parsing behavior. However, I wonder if there are better solutions. For example, would the millions of calls to readline
that would have to be made for ~8GB file cause a bottleneck / IO throttle?
### for the MWE
class StreamString(object):
def __init__(self, string):
self._io = io.StringIO(string)
def read(self, len=None):
return self._io.readline().encode("UTF-8")
def close(self):
self._io.close()
### closer to what would be used in practice
class StreamFile(object):
def __init__(self, path):
self._file = open(path, "r")
def read(self, len=None):
return self._file.readline().encode("UTF-8")
def close(self):
self._file.close()
### demonstrating the expected line-by-line parsing behavior
iterator = etree.iterparse(StreamString(xml_string), recover=True, remove_blank_text=True,
events=("start", "end",))
event, root = next(iterator)
print(str((event, root, root.tag,
root.text.strip() if root.text is not None else root.text,
root.tail.strip() if root.tail is not None else root.tail)) + "\n")
print(f"{etree.tostring(root)}\n")
for event, element in iterator:
print(str((event, element, element.tag,
element.text.strip() if element.text is not None else element.text,
element.tail.strip() if element.tail is not None else element.tail)) + "\n")
print(f"{etree.tostring(root)}\n")
This demonstrates the expected behavior, where the parsed tree corresponding to the root element successively grows with each iteration as new lines are added. This behavior is also easier to understand and mesh with the numerous suggestions on this site about how to clear the memory footprints corresponding to nodes and their ancestors (and all of their "older" in depth-first search siblings) after parsing them. It is unclear to me why this is not the default behavior.
Note: although the XML string used for the MWE is small and easily fits entirely into memory, the end goal is to run this for XML files that are potentially gigabytes on size on cluster nodes with 1-2 GB of memory. (I don't have control over the compute environment, yes I agree it would make more sense to just scale vertically to a single node with ~64GB memory.)
Question: What is the best way to control chunk size used by standard XML iterative parsers in Python?
If single elements* aren't the optimal chunk size for use by iterative parsers, then what is the optimal chunk size? Specifically, where are the chosen chunk sizes documented for the popular libraries lxml
and the Python built-in xml.etree.ElementTree
?
I seem to have a workaround for changing the default chunk size (e.g. to single lines as a proof of concept) while still using the same iterative parsers and not developing a new one, but I want to know if there is a better, widely-known solution than my somewhat hacky workaround.
* Note: In highly structured example XML documents optimized for human readability, usually each line corresponds to a single opening or closing tag of a single element, so it's conceivable that some chunk sizes might be measured in terms of number of lines. Parsers measuring chunk sizes in number of characters might be more plausible.
What I have tried: I would prefer not using SAX because it requires a lot of confusingly structured boilerplate code.
The way iterparse
(both from lxml.etree
and xml.etree.ElementTree
) is discussed, it often sounds as if it parses XML files "iteratively", as in element-by-element / tag-by-tag (see note above).
But it appears that in practice, given a file-like object as input, both parsers seek and parse the output of the .read
method of that file-like object. (As opposed to the output of .readline
as I had expected.) If that file-like object is a file-pointer to an 8GB file and this is running on a cluster node with 2GB memory, this will of course cause an OOM error.
.read
seems to have an optional parameter n
corresponding to the number of lines of the document / text file to read into memory, but if standard iterative parsers do actually use this optional parameter when invoking .read
, they don't seem to document what value they use. The MWE example below shows that if such a value is used, then it is at least 16 or greater.
This conclusion is based on this answer to a related question as well as my own testing. Here is a MWE:
import io
from lxml import etree
import xml.etree.ElementTree as etree2
xml_string = """<root>
<Employee Name="Mr.ZZ" Age="30">
<Experience TotalYears="10" StartDate="2000-01-01" EndDate="2010-12-12">
<Employment id = "1" EndTime="ABC" StartDate="2000-01-01" EndDate="2002-12-12">
<Project Name="ABC_1" Team="4">
</Project>
</Employment>
<Employment id = "2" EndTime="XYZ" StartDate="2003-01-01" EndDate="2010-12-12">
<PromotionStatus>Manager</PromotionStatus>
<Project Name="XYZ_1" Team="7">
<Award>Star Team Member</Award>
</Project>
</Employment>
</Experience>
</Employee>
</root>"""
#### lxml output
for event, element in etree.iterparse(io.BytesIO(xml_string.encode("UTF-8")), recover=True, remove_blank_text=True,
events=("start", "end",)):
print(str((event, element, element.tag,
element.text.strip() if element.text is not None else element.text,
element.tail.strip() if element.tail is not None else element.tail)) + "\n")
print(f"{etree.tostring(element)}\n")
### xml.etree.ElementTree output is the same
for event, element in etree2.iterparse(io.BytesIO(xml_string.encode("UTF-8")),
events=("start", "end",)):
print(str((event, element, element.tag,
element.text.strip() if element.text is not None else element.text,
element.tail.strip() if element.tail is not None else element.tail)) + "\n")
print(f"{etree2.tostring(element)}\n")
Already at the very first iteration, the string representation of the root tag represents the entire XML document, which suggests that the entire output of .read
has already been parsed, rather than just the first line (which is what I had originally thought the first iteration was supposed to correspond to based on others' discussion of iterparse
).
I was able to come up with the following workaround which displays the expected line-by-line parsing behavior. However, I wonder if there are better solutions. For example, would the millions of calls to readline
that would have to be made for ~8GB file cause a bottleneck / IO throttle?
### for the MWE
class StreamString(object):
def __init__(self, string):
self._io = io.StringIO(string)
def read(self, len=None):
return self._io.readline().encode("UTF-8")
def close(self):
self._io.close()
### closer to what would be used in practice
class StreamFile(object):
def __init__(self, path):
self._file = open(path, "r")
def read(self, len=None):
return self._file.readline().encode("UTF-8")
def close(self):
self._file.close()
### demonstrating the expected line-by-line parsing behavior
iterator = etree.iterparse(StreamString(xml_string), recover=True, remove_blank_text=True,
events=("start", "end",))
event, root = next(iterator)
print(str((event, root, root.tag,
root.text.strip() if root.text is not None else root.text,
root.tail.strip() if root.tail is not None else root.tail)) + "\n")
print(f"{etree.tostring(root)}\n")
for event, element in iterator:
print(str((event, element, element.tag,
element.text.strip() if element.text is not None else element.text,
element.tail.strip() if element.tail is not None else element.tail)) + "\n")
print(f"{etree.tostring(root)}\n")
This demonstrates the expected behavior, where the parsed tree corresponding to the root element successively grows with each iteration as new lines are added. This behavior is also easier to understand and mesh with the numerous suggestions on this site about how to clear the memory footprints corresponding to nodes and their ancestors (and all of their "older" in depth-first search siblings) after parsing them. It is unclear to me why this is not the default behavior.
Note: although the XML string used for the MWE is small and easily fits entirely into memory, the end goal is to run this for XML files that are potentially gigabytes on size on cluster nodes with 1-2 GB of memory. (I don't have control over the compute environment, yes I agree it would make more sense to just scale vertically to a single node with ~64GB memory.)
Share Improve this question edited Feb 22 at 18:44 hasManyStupidQuestions asked Jan 12 at 23:54 hasManyStupidQuestionshasManyStupidQuestions 2602 silver badges12 bronze badges 7 | Show 2 more comments1 Answer
Reset to default -1Event-based parsers, unlike DOM parsers, do not have to build an in-memory representation of the parsed data and therefore are not limited to documents that can fit in memory. Furthermore, "line-by-line parsing" of an XML file makes no sense as XML is not a line-oriented format. Parsing XML is a long-solved problem. It's better to understand the fully capable existing parsing solutions than to reinvent them poorly.
Realize that processing events via callbacks such as startElement() require no more state creation than what you or your requirements impose. If you attempt to retrieve the contents of the root element as a string, of course you risk having insufficient memory. Don't do that; it's fighting the event framework rather than working naturally within it.
本文标签:
版权声明:本文标题:What is the best way to control chunk size used by XML iterative parsers in Python, without using SAX? (`iterparse` behaves unex 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1742119054a2421608.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
xml_string
is small enough to easily fit in a reasonable chunk size. – user2357112 Commented Jan 13 at 0:42