admin管理员组

文章数量:1123367

I am working with python re. I am trying to create 3 groups in matches for text that is set up like so:

1. INTRO:
   1.1 Title
   1.2 Context:
   1.2.1 item1
   item1 continued
   item1 continued
   1.2.2 item2 (03-15-A-B)
   1.2.3 item3
   text continued
   (03-35-A-B)
   1.2.4 item4
   (03-15
   -B-4)
   1.2.5 item5, (B-21-Q-2) continued
   1.3 Background:
   1.3.1 Not Applicable

I am working specifically within the Context (1.2) section at the moment, so everything I would want to extract would start with 1.2.x, where x is any integer greater than 0.

Each match would be a list item starting with the section number and ending right before the next list item (i.e., section number).

I am trying to extract three groups in each match:

  1. The section number (i.e., 1.2.1 and so on)
  2. The text of the section number (i.e., item1)
  3. The occasional/optional identification code (i.e., (03-15-A-B) in 1.2.2). This identification code is sometimes found within the text (group 2) of the list item.

With the following regex pattern, I am able to extract all three groups. The only issue is, when an item's text runs onto the next line(s), only the first line is extracted into group 2.

So in this example, with the pattern I am currently using, list item 1.2.1 and 1.2.3 are not doing extracting how I would like. For example, for 1.2.1, it is only extracting item1 and not item1 continued item1 continued in the match or group 2.

With this pattern, I also extract the identification code without the parenthesis in group 4, which is intentional.

For clarity, with the above example, my desired matches and groups would be as follows:

  • Match 1:

    1.2.1 item1
    item1 continued
    item1 continued
    
    • Group 1:

      1.2.1
      
    • Group 2:

      item1
      item1 continued
      item1 continued
      
  • Match 2:

    1.2.2 item2 (03-15-A-B)
    
    • Group 1:

      1.2.2
      
    • Group 2:

      item2
      
    • Group 3:

      (03-15-A-B)
      
  • Match 3:

    1.2.3 item3
    text continued
    (03-35-A-B)
    
    • Group 1:

      1.2.3
      
    • Group 2:

      item3
      text continued    
      
    • Group 3:

      (03-35-A-B)
      
  • Match 4:

    1.2.4 item4
    (03-15
    -B-4)
    
    • Group 1:

      1.2.4
      
    • Group 2:

      item4  
      
    • Group 3:

      (03-15-B-4)
      
  • Match 5:

    1.2.5 item5, (B-21-Q-2) continued
    
    • Group 1:

      1.2.5
      
    • Group 2:

      item5,  continued  
      
    • Group 3:

      (B-21-Q-2)
      

Here is my pattern:

(1\.2\.\d+)\s+([\s\S].*?(?:\s*(\(?\s*([a-zA-Z0-9.]+\s*[-—]\s*[a-zA-Z0-9.]+\s*[-—]\s*[a-zA-Z0-9.]+\s*[-—]\s*[a-zA-Z0-9.]+)\)?).*?)?)?$

The pattern for the identification code is set up this way because there have been many instances of different formatting.

本文标签: