python - Predicting `re` regexp memory consumption - Stack Overflow

IT技术

更新时间：2025-04-070

admin管理员组
文章数量:1357633

I have a large (gigabyte) file where an S-expression appears, and I want to skip to the end of the S-expression. The depth of the S-expression is limited to 2, so I tried using a Python regexp (b'\\((?:[^()]|\\((?:[^()]|)*\\))*\\)'). This turned out to consume too much RAM, and digging deeper I found that memory consumption of moderately complex regexps seems highly unpredictable if the match is large. For instance, the following four equivalent regexps all match a full ten megabyte string. The fourth one (arguably the most complex one) uses a reasonable amount (30M) of RAM, whereas the others consume one gigabyte:

import re
dot = b'[' + b''.join(b'\\x%02x' % (c,) for c in range(256)) + b']'
assert re.match(b'(?:.|.)*', b'a'*10000000).end() > 1000000
assert re.match(b'(?:.|.|a)*', b'a'*10000000).end() > 1000000
assert re.match(b'(?:%s|%s)*' % (dot,dot), b'a'*10000000).end() > 1000000
assert re.match(b'(?:%s|%s|a)*' % (dot,dot), b'a'*10000000).end() > 1000000

(using Python 3.12.3)

Is there a reasonable way to predict if the performance of a Python regexp can scale? And in particular, are there some design principles I can follow if I want to avoid performance pitfalls?

(This question is specifically about the re module, because I prefer to use standard Python libraries; I suspect this would not be an issue if I switched to a third-party lib like regex)

I have a large (gigabyte) file where an S-expression appears, and I want to skip to the end of the S-expression. The depth of the S-expression is limited to 2, so I tried using a Python regexp (b'\\((?:[^()]|\\((?:[^()]|)*\\))*\\)'). This turned out to consume too much RAM, and digging deeper I found that memory consumption of moderately complex regexps seems highly unpredictable if the match is large. For instance, the following four equivalent regexps all match a full ten megabyte string. The fourth one (arguably the most complex one) uses a reasonable amount (30M) of RAM, whereas the others consume one gigabyte:

import re
dot = b'[' + b''.join(b'\\x%02x' % (c,) for c in range(256)) + b']'
assert re.match(b'(?:.|.)*', b'a'*10000000).end() > 1000000
assert re.match(b'(?:.|.|a)*', b'a'*10000000).end() > 1000000
assert re.match(b'(?:%s|%s)*' % (dot,dot), b'a'*10000000).end() > 1000000
assert re.match(b'(?:%s|%s|a)*' % (dot,dot), b'a'*10000000).end() > 1000000

(using Python 3.12.3)

Is there a reasonable way to predict if the performance of a Python regexp can scale? And in particular, are there some design principles I can follow if I want to avoid performance pitfalls?

(This question is specifically about the re module, because I prefer to use standard Python libraries; I suspect this would not be an issue if I switched to a third-party lib like regex)

Share Improve this question asked Mar 28 at 20:05 Erik Carstensen 9006 silver badges18 bronze badges

regex101 has a tool that shows how a regexp is processed. I'm not sure if this is what you need. I don't think there's any easy way to analyze a regexp statically. There may be some heuristics to look for things like catastrophic backtracking. – Barmar Commented Mar 28 at 20:18

Add a comment |

1 Answer 1

Sorted by: Reset to default 2

In Python 3.11 (or later), try using "possessive quantifiers" instead. Meaning, in your example, replacing instances of * with *+.

Your regexp doesn't require backtracking, and a possessive quantifier tells the match engine not to bother saving any backtracking info to begin with. This can save a lot of RAM and time.

A more capable engine could deduce on its own that backtracking isn't useful in your case, but that's beyond what Python's engine can do.

Relatedly, when backtracking isn't useful, it can also help to replace non-capturing groups ((?:) with atomic groups ((?>), Atomic groups are also non-capturing, but also skip saving backtracking information. Possessive quantifiers are in fact syntactic sugar for writing out a more long-winded atomic group.

本文标签： pythonPredicting re regexp memory consumptionStack Overflow

版权声明：本文标题：python - Predicting `re` regexp memory consumption - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1744015877a2576318.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

python - Predicting `re` regexp memory consumption - Stack Overflow

1 Answer 1

更多相关文章