admin管理员组文章数量:1123367
I am working with python re. I am trying to create 3 groups in matches for text that is set up like so:
1. INTRO:
1.1 Title
1.2 Context:
1.2.1 item1
item1 continued
item1 continued
1.2.2 item2 (03-15-A-B)
1.2.3 item3
text continued
(03-35-A-B)
1.2.4 item4
(03-15
-B-4)
1.2.5 item5, (B-21-Q-2) continued
1.3 Background:
1.3.1 Not Applicable
I am working specifically within the Context (1.2) section at the moment, so everything I would want to extract would start with 1.2.x
, where x is any integer greater than 0.
Each match would be a list item starting with the section number and ending right before the next list item (i.e., section number).
I am trying to extract three groups in each match:
- The section number (i.e., 1.2.1 and so on)
- The text of the section number (i.e., item1)
- The occasional/optional identification code (i.e., (03-15-A-B) in 1.2.2). This identification code is sometimes found within the text (group 2) of the list item.
With the following regex pattern, I am able to extract all three groups. The only issue is, when an item's text runs onto the next line(s), only the first line is extracted into group 2.
So in this example, with the pattern I am currently using, list item 1.2.1 and 1.2.3 are not doing extracting how I would like. For example, for 1.2.1
, it is only extracting item1
and not item1 continued item1 continued
in the match or group 2.
With this pattern, I also extract the identification code without the parenthesis in group 4, which is intentional.
For clarity, with the above example, my desired matches and groups would be as follows:
Match 1:
1.2.1 item1 item1 continued item1 continued
Group 1:
1.2.1
Group 2:
item1 item1 continued item1 continued
Match 2:
1.2.2 item2 (03-15-A-B)
Group 1:
1.2.2
Group 2:
item2
Group 3:
(03-15-A-B)
Match 3:
1.2.3 item3 text continued (03-35-A-B)
Group 1:
1.2.3
Group 2:
item3 text continued
Group 3:
(03-35-A-B)
Match 4:
1.2.4 item4 (03-15 -B-4)
Group 1:
1.2.4
Group 2:
item4
Group 3:
(03-15-B-4)
Match 5:
1.2.5 item5, (B-21-Q-2) continued
Group 1:
1.2.5
Group 2:
item5, continued
Group 3:
(B-21-Q-2)
Here is my pattern:
(1\.2\.\d+)\s+([\s\S].*?(?:\s*(\(?\s*([a-zA-Z0-9.]+\s*[-—]\s*[a-zA-Z0-9.]+\s*[-—]\s*[a-zA-Z0-9.]+\s*[-—]\s*[a-zA-Z0-9.]+)\)?).*?)?)?$
The pattern for the identification code is set up this way because there have been many instances of different formatting.
本文标签:
版权声明:本文标题:python - How to include all text in a regex group up until either an option group or the next match? - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1736564003a1944679.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论