What is the best way to control chunk size used by XML iterative parsers in Python, without using SAX? (`iterparse` behaves unex-软件玩家

admin管理员组
文章数量:1324847

Question: What is the best way to control chunk size used by standard XML iterative parsers in Python?

If single elements* aren't the optimal chunk size for use by iterative parsers, then what is the optimal chunk size? Specifically, where are the chosen chunk sizes documented for the popular libraries lxml and the Python built-in xml.etree.ElementTree?

I seem to have a workaround for changing the default chunk size (e.g. to single lines as a proof of concept) while still using the same iterative parsers and not developing a new one, but I want to know if there is a better, widely-known solution than my somewhat hacky workaround.

* Note: In highly structured example XML documents optimized for human readability, usually each line corresponds to a single opening or closing tag of a single element, so it's conceivable that some chunk sizes might be measured in terms of number of lines. Parsers measuring chunk sizes in number of characters might be more plausible.

What I have tried: I would prefer not using SAX because it requires a lot of confusingly structured boilerplate code.

The way iterparse (both from lxml.etree and xml.etree.ElementTree) is discussed, it often sounds as if it parses XML files "iteratively", as in element-by-element / tag-by-tag (see note above).

But it appears that in practice, given a file-like object as input, both parsers seek and parse the output of the .read method of that file-like object. (As opposed to the output of .readline as I had expected.) If that file-like object is a file-pointer to an 8GB file and this is running on a cluster node with 2GB memory, this will of course cause an OOM error.

.read seems to have an optional parameter n corresponding to the number of lines of the document / text file to read into memory, but if standard iterative parsers do actually use this optional parameter when invoking .read, they don't seem to document what value they use. The MWE example below shows that if such a value is used, then it is at least 16 or greater.

This conclusion is based on this answer to a related question as well as my own testing. Here is a MWE:

import io
from lxml import etree
import xml.etree.ElementTree as etree2

xml_string = """<root>
        <Employee Name="Mr.ZZ" Age="30">
            <Experience TotalYears="10" StartDate="2000-01-01" EndDate="2010-12-12">
                    <Employment id = "1" EndTime="ABC" StartDate="2000-01-01" EndDate="2002-12-12">
                            <Project Name="ABC_1" Team="4">
                            </Project>
                    </Employment>
                    <Employment id = "2" EndTime="XYZ" StartDate="2003-01-01" EndDate="2010-12-12">
                        <PromotionStatus>Manager</PromotionStatus>
                            <Project Name="XYZ_1" Team="7">
                                <Award>Star Team Member</Award>
                            </Project>
                    </Employment>
            </Experience>
        </Employee>
</root>"""

#### lxml output

for event, element in etree.iterparse(io.BytesIO(xml_string.encode("UTF-8")), recover=True, remove_blank_text=True,
                                      events=("start", "end",)):
    print(str((event, element, element.tag, 
               element.text.strip() if element.text is not None else element.text,
               element.tail.strip() if element.tail is not None else element.tail)) + "\n")
    print(f"{etree.tostring(element)}\n")

### xml.etree.ElementTree output is the same

for event, element in etree2.iterparse(io.BytesIO(xml_string.encode("UTF-8")),
                                      events=("start", "end",)):
    print(str((event, element, element.tag, 
               element.text.strip() if element.text is not None else element.text,
               element.tail.strip() if element.tail is not None else element.tail)) + "\n")
    print(f"{etree2.tostring(element)}\n")

Already at the very first iteration, the string representation of the root tag represents the entire XML document, which suggests that the entire output of .read has already been parsed, rather than just the first line (which is what I had originally thought the first iteration was supposed to correspond to based on others' discussion of iterparse).

I was able to come up with the following workaround which displays the expected line-by-line parsing behavior. However, I wonder if there are better solutions. For example, would the millions of calls to readline that would have to be made for ~8GB file cause a bottleneck / IO throttle?

### for the MWE

class StreamString(object):

    def __init__(self, string):
        self._io = io.StringIO(string)

    def read(self, len=None):
        return self._io.readline().encode("UTF-8")

    def close(self):
        self._io.close()

### closer to what would be used in practice

class StreamFile(object):

    def __init__(self, path):
        self._file = open(path, "r")

    def read(self, len=None):
        return self._file.readline().encode("UTF-8")

    def close(self):
        self._file.close()

### demonstrating the expected line-by-line parsing behavior

iterator = etree.iterparse(StreamString(xml_string), recover=True, remove_blank_text=True,
                                      events=("start", "end",))
event, root = next(iterator)
print(str((event, root, root.tag, 
           root.text.strip() if root.text is not None else root.text,
           root.tail.strip() if root.tail is not None else root.tail)) + "\n")
print(f"{etree.tostring(root)}\n")

for event, element in iterator:
    print(str((event, element, element.tag, 
               element.text.strip() if element.text is not None else element.text,
               element.tail.strip() if element.tail is not None else element.tail)) + "\n")
    print(f"{etree.tostring(root)}\n")

This demonstrates the expected behavior, where the parsed tree corresponding to the root element successively grows with each iteration as new lines are added. This behavior is also easier to understand and mesh with the numerous suggestions on this site about how to clear the memory footprints corresponding to nodes and their ancestors (and all of their "older" in depth-first search siblings) after parsing them. It is unclear to me why this is not the default behavior.

Note: although the XML string used for the MWE is small and easily fits entirely into memory, the end goal is to run this for XML files that are potentially gigabytes on size on cluster nodes with 1-2 GB of memory. (I don't have control over the compute environment, yes I agree it would make more sense to just scale vertically to a single node with ~64GB memory.)

Question: What is the best way to control chunk size used by standard XML iterative parsers in Python?

What I have tried: I would prefer not using SAX because it requires a lot of confusingly structured boilerplate code.

The way iterparse (both from lxml.etree and xml.etree.ElementTree) is discussed, it often sounds as if it parses XML files "iteratively", as in element-by-element / tag-by-tag (see note above).

This conclusion is based on this answer to a related question as well as my own testing. Here is a MWE:

import io
from lxml import etree
import xml.etree.ElementTree as etree2

xml_string = """<root>
        <Employee Name="Mr.ZZ" Age="30">
            <Experience TotalYears="10" StartDate="2000-01-01" EndDate="2010-12-12">
                    <Employment id = "1" EndTime="ABC" StartDate="2000-01-01" EndDate="2002-12-12">
                            <Project Name="ABC_1" Team="4">
                            </Project>
                    </Employment>
                    <Employment id = "2" EndTime="XYZ" StartDate="2003-01-01" EndDate="2010-12-12">
                        <PromotionStatus>Manager</PromotionStatus>
                            <Project Name="XYZ_1" Team="7">
                                <Award>Star Team Member</Award>
                            </Project>
                    </Employment>
            </Experience>
        </Employee>
</root>"""

#### lxml output

for event, element in etree.iterparse(io.BytesIO(xml_string.encode("UTF-8")), recover=True, remove_blank_text=True,
                                      events=("start", "end",)):
    print(str((event, element, element.tag, 
               element.text.strip() if element.text is not None else element.text,
               element.tail.strip() if element.tail is not None else element.tail)) + "\n")
    print(f"{etree.tostring(element)}\n")

### xml.etree.ElementTree output is the same

for event, element in etree2.iterparse(io.BytesIO(xml_string.encode("UTF-8")),
                                      events=("start", "end",)):
    print(str((event, element, element.tag, 
               element.text.strip() if element.text is not None else element.text,
               element.tail.strip() if element.tail is not None else element.tail)) + "\n")
    print(f"{etree2.tostring(element)}\n")

### for the MWE

class StreamString(object):

    def __init__(self, string):
        self._io = io.StringIO(string)

    def read(self, len=None):
        return self._io.readline().encode("UTF-8")

    def close(self):
        self._io.close()

### closer to what would be used in practice

class StreamFile(object):

    def __init__(self, path):
        self._file = open(path, "r")

    def read(self, len=None):
        return self._file.readline().encode("UTF-8")

    def close(self):
        self._file.close()

### demonstrating the expected line-by-line parsing behavior

iterator = etree.iterparse(StreamString(xml_string), recover=True, remove_blank_text=True,
                                      events=("start", "end",))
event, root = next(iterator)
print(str((event, root, root.tag, 
           root.text.strip() if root.text is not None else root.text,
           root.tail.strip() if root.tail is not None else root.tail)) + "\n")
print(f"{etree.tostring(root)}\n")

for event, element in iterator:
    print(str((event, element, element.tag, 
               element.text.strip() if element.text is not None else element.text,
               element.tail.strip() if element.tail is not None else element.tail)) + "\n")
    print(f"{etree.tostring(root)}\n")

Share Improve this question edited Feb 22 at 18:44 asked Jan 12 at 23:54 hasManyStupidQuestions 2602 silver badges12 bronze badges

1 Have you tried with a bigger input? It might be reading in chunks instead of by line, and your xml_string is small enough to easily fit in a reasonable chunk size. – user2357112 Commented Jan 13 at 0:42
1 Deleting elements to keep memory low – LMC Commented Jan 13 at 0:59
1 You can cut huge xml files in pieces with e,g, xml_split. – Hermann12 Commented Jan 13 at 6:50
@user2357112 I agree, it appears to be reading in chunks (it is clearly not line by line) -- but where do I find the documentation about the chunk size that is being used? My concern is that the chunk size might be large enough to cause OOM issues itself. It is hard to find a series of successively larger "minimal working example" XML documents to reverse engineer the chunk size – hasManyStupidQuestions Commented Feb 21 at 6:07
1 @hasManyStupidQuestions: These kinds of chunks are usually a few kilobytes. Nowhere near enough to cause problems. – user2357112 Commented Feb 21 at 6:52

| Show 2 more comments

1 Answer 1

Sorted by: Reset to default -1

Event-based parsers, unlike DOM parsers, do not have to build an in-memory representation of the parsed data and therefore are not limited to documents that can fit in memory. Furthermore, "line-by-line parsing" of an XML file makes no sense as XML is not a line-oriented format. Parsing XML is a long-solved problem. It's better to understand the fully capable existing parsing solutions than to reinvent them poorly.

Realize that processing events via callbacks such as startElement() require no more state creation than what you or your requirements impose. If you attempt to retrieve the contents of the root element as a string, of course you risk having insufficient memory. Don't do that; it's fighting the event framework rather than working naturally within it.

本文标签：

版权声明：本文标题：What is the best way to control chunk size used by XML iterative parsers in Python, without using SAX? (`iterparse` behaves unex 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1742119054a2421608.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

What is the best way to control chunk size used by XML iterative parsers in Python, without using SAX? (`iterparse` behaves unex

1 Answer 1

更多相关文章

javascript - Angular 5 prevent or warn user before back button click - Stack Overflow

uploads - media file uploading

validation - Gravity Forms validate arrival and departure

javascript - How can Redux guarantee no race condition? - Stack Overflow

javascript - JSON Data - Parsed Or &#39;Eval&#39;ed - Stack Overflow

javascript - Mustache conditional statement (ifelse) not working - Stack Overflow

Javascript: Array exact match - Stack Overflow

plugins - How to ignore fields if empty?

javascript - Shopware administration module not showing up after administration:build - Stack Overflow

javascript - How to run code after an image has been loaded using JQuery - Stack Overflow

javascript - Best practice to handle aws s3 presigned URL expiration date - Stack Overflow

javascript - Returning textfield value with material ui and reactjs - Stack Overflow

javascript - Sliding Div from left corner to right corner animation - Stack Overflow

javascript - How to bind a div widthheight to form fields? - Stack Overflow

How to resize div width with javascript - Stack Overflow

javascript - React SWR - how to know that updating (mutating) is running? - Stack Overflow

post meta - How to use update_post_meta() function properly?

javascript - AngularJS - Why select drop down doesn&#39;t have $event on change - Stack Overflow

javascript - How to change material color for only one object in THREE.js - Stack Overflow

asp.net - Simple Javascript function is not working on .ascx file - Stack Overflow

发表评论

推荐文章

php - Checkbox is not being set

plugin development - Header Button Chance Polylang Elementor

javascript - process.env.varible undefined in nest.js - Stack Overflow

javascript - Getting screen width as a variable in jsp - Stack Overflow

javascript - Bootstrap 4 popper is undefined - Stack Overflow

热门文章

theme development - Wordpress Insert ads after every 5th post

javascript - Testing for a button&#39;s disabled state in React - Stack Overflow

javascript - Webgl Context Lost and not restored - Stack Overflow

Get HTML code after javascript execution using CURL PHP - Stack Overflow

custom taxonomy - Infinite Scroll appending Next Product Category Term pages

javascript - change js confirm to a twitter bootstrap modal - Stack Overflow

javascript - How to make first array the header and remaining arrays rows of objects - Stack Overflow

javascript - Using Array.join(&#39;&lt;br &gt;&#39;) - Stack Overflow

javascript - How to check if image is loaded? - Stack Overflow

javascript - Node.js consistently giving 503 error - Stack Overflow

最新文章

路由器配置基础

信号与系统第五版吴大正PDF资源下载

基于PHP的学生成绩管理系统(多用户版)

麒麟系统在VMware Workstation Linux环境下的安装

Red Hat Enterprise Linux 7.9 下载

javascript - Dynamically add and remove &lt;input&gt; jQuery - Stack Overflow

javascript - Detecting browser autofill before selection - Stack Overflow

asp.net - Simple Javascript function is not working on .ascx file - Stack Overflow

javascript - How to change material color for only one object in THREE.js - Stack Overflow

javascript - AngularJS - Why select drop down doesn&#39;t have $event on change - Stack Overflow

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价

javascript - JSON Data - Parsed Or 'Eval'ed - Stack Overflow

javascript - AngularJS - Why select drop down doesn't have $event on change - Stack Overflow

javascript - Testing for a button's disabled state in React - Stack Overflow

javascript - Using Array.join('<br >') - Stack Overflow

javascript - Dynamically add and remove <input> jQuery - Stack Overflow

javascript - AngularJS - Why select drop down doesn't have $event on change - Stack Overflow