admin管理员组文章数量:1357633
I have a large (gigabyte) file where an S-expression appears, and I want to skip to the end of the S-expression. The depth of the S-expression is limited to 2, so I tried using a Python regexp (b'\\((?:[^()]|\\((?:[^()]|)*\\))*\\)'
). This turned out to consume too much RAM, and digging deeper I found that memory consumption of moderately complex regexps seems highly unpredictable if the match is large. For instance, the following four equivalent regexps all match a full ten megabyte string. The fourth one (arguably the most complex one) uses a reasonable amount (30M) of RAM, whereas the others consume one gigabyte:
import re
dot = b'[' + b''.join(b'\\x%02x' % (c,) for c in range(256)) + b']'
assert re.match(b'(?:.|.)*', b'a'*10000000).end() > 1000000
assert re.match(b'(?:.|.|a)*', b'a'*10000000).end() > 1000000
assert re.match(b'(?:%s|%s)*' % (dot,dot), b'a'*10000000).end() > 1000000
assert re.match(b'(?:%s|%s|a)*' % (dot,dot), b'a'*10000000).end() > 1000000
(using Python 3.12.3)
Is there a reasonable way to predict if the performance of a Python regexp can scale? And in particular, are there some design principles I can follow if I want to avoid performance pitfalls?
(This question is specifically about the re
module, because I prefer to use standard Python libraries; I suspect this would not be an issue if I switched to a third-party lib like regex
)
I have a large (gigabyte) file where an S-expression appears, and I want to skip to the end of the S-expression. The depth of the S-expression is limited to 2, so I tried using a Python regexp (b'\\((?:[^()]|\\((?:[^()]|)*\\))*\\)'
). This turned out to consume too much RAM, and digging deeper I found that memory consumption of moderately complex regexps seems highly unpredictable if the match is large. For instance, the following four equivalent regexps all match a full ten megabyte string. The fourth one (arguably the most complex one) uses a reasonable amount (30M) of RAM, whereas the others consume one gigabyte:
import re
dot = b'[' + b''.join(b'\\x%02x' % (c,) for c in range(256)) + b']'
assert re.match(b'(?:.|.)*', b'a'*10000000).end() > 1000000
assert re.match(b'(?:.|.|a)*', b'a'*10000000).end() > 1000000
assert re.match(b'(?:%s|%s)*' % (dot,dot), b'a'*10000000).end() > 1000000
assert re.match(b'(?:%s|%s|a)*' % (dot,dot), b'a'*10000000).end() > 1000000
(using Python 3.12.3)
Is there a reasonable way to predict if the performance of a Python regexp can scale? And in particular, are there some design principles I can follow if I want to avoid performance pitfalls?
(This question is specifically about the re
module, because I prefer to use standard Python libraries; I suspect this would not be an issue if I switched to a third-party lib like regex
)
- regex101 has a tool that shows how a regexp is processed. I'm not sure if this is what you need. I don't think there's any easy way to analyze a regexp statically. There may be some heuristics to look for things like catastrophic backtracking. – Barmar Commented Mar 28 at 20:18
1 Answer
Reset to default 2In Python 3.11 (or later), try using "possessive quantifiers" instead. Meaning, in your example, replacing instances of *
with *+.
Your regexp doesn't require backtracking, and a possessive quantifier tells the match engine not to bother saving any backtracking info to begin with. This can save a lot of RAM and time.
A more capable engine could deduce on its own that backtracking isn't useful in your case, but that's beyond what Python's engine can do.
Relatedly, when backtracking isn't useful, it can also help to replace non-capturing groups ((?:
) with atomic groups ((?>)
, Atomic groups are also non-capturing, but also skip saving backtracking information. Possessive quantifiers are in fact syntactic sugar for writing out a more long-winded atomic group.
本文标签: pythonPredicting re regexp memory consumptionStack Overflow
版权声明:本文标题:python - Predicting `re` regexp memory consumption - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1744015877a2576318.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论