admin管理员组

文章数量:1389853

Need to separate the following long string using oracle regexp -

Row1 has following value- 'van dam is brother of Prince Charles(12345). Mathew Perker is son of Prince Charles(12345).'

Row2- 'Madam Currie is grandmother of Albert Eistine(56789). Pieer Currie is grandfather of Albert Eistine(56789). CV Raman is friend of Albert Eistine(56789).'

Now by split I need the following separate string -

From Row1 - 'van dam is brother of Prince Charles(12345).' 'Mathew Perker is son of Prince Charles(12345).'

From Row2- 'Madam Currie is grandmother of Albert Eistine(56789).' 'Pieer Currie is grandfather of Albert Eistine(56789).' 'CV Raman is friend of Albert Eistine(56789).'

These separate strings can be presented in separate column. The numbers in brackets are actually ID stored in ID field of the table.

Is it possible to achieve such split using Oracle regexp?

Need to separate the following long string using oracle regexp -

Row1 has following value- 'van dam is brother of Prince Charles(12345). Mathew Perker is son of Prince Charles(12345).'

Row2- 'Madam Currie is grandmother of Albert Eistine(56789). Pieer Currie is grandfather of Albert Eistine(56789). CV Raman is friend of Albert Eistine(56789).'

Now by split I need the following separate string -

From Row1 - 'van dam is brother of Prince Charles(12345).' 'Mathew Perker is son of Prince Charles(12345).'

From Row2- 'Madam Currie is grandmother of Albert Eistine(56789).' 'Pieer Currie is grandfather of Albert Eistine(56789).' 'CV Raman is friend of Albert Eistine(56789).'

These separate strings can be presented in separate column. The numbers in brackets are actually ID stored in ID field of the table.

Is it possible to achieve such split using Oracle regexp?

Share Improve this question edited Mar 13 at 8:14 MT0 169k12 gold badges67 silver badges129 bronze badges asked Mar 13 at 6:36 Kaustav NandyKaustav Nandy 511 bronze badge 3
  • This question is similar to: Oracle: Connect by Level & regexp_substr. If you believe it’s different, please edit the question, make it clear how it’s different and/or how the answers on that question are not helpful for your problem. – p3consulting Commented Mar 13 at 6:45
  • Do you simply want to break the whole string in multiple parts based on period(.)? – Ankit Bajpai Commented Mar 13 at 7:06
  • In one word Yes. Not sure if we can also use the ID field as well for each row as ID would be unique for each row and it is present at the end of every part before (.) – Kaustav Nandy Commented Mar 13 at 8:55
Add a comment  | 

4 Answers 4

Reset to default 2

Regular expressions would work, but - on large data sets - string functions (such as combination of substr and instr) would perform better. Here's how.

Sample data:

SQL> WITH
  2     test (col)
  3     AS
  4        (SELECT 'van dam is brother of Prince Charles(12345). Mathew Perker is son of Prince Charles(12345).'
  5           FROM DUAL
  6         UNION ALL
  7         SELECT 'Madam Currie is grandmother of Albert Eistine(56789). Pieer Currie is grandfather of Albert Eistine(56789). CV Raman is friend of Albert Eistine(56789).'
  8           FROM DUAL)

Query begins here; it splits source value on a dot (.) character.

  9  SELECT trim(substr(col, 1, instr(col, '.', 1, 1))) val_1,
 10         --
 11         trim(substr(col, instr(col, '.', 1, 1) + 1,
 12                          instr(col, '.', 1, 2) - instr(col, '.', 1, 1))) val_2,
 13         --
 14         trim(substr(col, instr(col, '.', 1, 2) + 1,
 15                          instr(col, '.', 1, 3) - instr(col, '.', 1, 2))) val_3
 16    FROM test;

VAL_1                                                 VAL_2                                                  VAL_3
----------------------------------------------------- ------------------------------------------------------ -----------------------------------------------------
van dam is brother of Prince Charles(12345).          Mathew Perker is son of Prince Charles(12345).
Madam Currie is grandmother of Albert Eistine(56789). Pieer Currie is grandfather of Albert Eistine(56789).  CV Raman is friend of Albert Eistine(56789).

SQL>

You'd add as many val_ns as necessary.

Can it be dynamic? Not that easy, I think, because you want every value in its own column. If you'd just want to split the source value into separate rows, that would be easy - and regular expressions handle that nicely.

You can use:

SELECT REGEXP_SUBSTR(
         column_name,
         '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*',
         1,
         1
       ) AS relationship1,
       REGEXP_SUBSTR(
         column_name,
         '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*',
         1,
         2
       ) AS relationship2,
       REGEXP_SUBSTR(
         column_name,
         '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*',
         1,
         3
       ) AS relationship3,
       REGEXP_SUBSTR(
         column_name,
         '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*',
         1,
         4
       ) AS relationship4
FROM   table_name

Which, for the sample data:

CREATE TABLE table_name(column_name) AS
  SELECT 'van dam is brother of Prince Charles(12345). Mathew Perker is son of Prince Charles(12345).' FROM DUAL UNION ALL
  SELECT 'Madam Currie is grandmother of Albert Eistine(56789). Pieer Currie is grandfather of Albert Eistine(56789). CV Raman is friend of Albert Eistine(56789).' FROM DUAL;

Outputs:

RELATIONSHIP1 RELATIONSHIP2 RELATIONSHIP3 RELATIONSHIP4
van dam is brother of Prince Charles(12345). Mathew Perker is son of Prince Charles(12345). null null
Madam Currie is grandmother of Albert Eistine(56789). Pieer Currie is grandfather of Albert Eistine(56789). CV Raman is friend of Albert Eistine(56789). null

If you want a more detailed breakdown, you can extract the sub-groups from the expression:

SELECT REGEXP_SUBSTR( column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*', 1, 1, NULL, 1) AS from1,
       REGEXP_SUBSTR( column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*', 1, 1, NULL, 2) AS relationship1,
       REGEXP_SUBSTR( column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*', 1, 1, NULL, 3) AS to1,
       REGEXP_SUBSTR( column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*', 1, 1, NULL, 4) AS id1,
       REGEXP_SUBSTR( column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*', 1, 2, NULL, 1) AS from2,
       REGEXP_SUBSTR( column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*', 1, 2, NULL, 2) AS relationship2,
       REGEXP_SUBSTR( column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*', 1, 2, NULL, 3) AS to2,
       REGEXP_SUBSTR( column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*', 1, 2, NULL, 4) AS id2,
       REGEXP_SUBSTR( column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*', 1, 3, NULL, 1) AS from3,
       REGEXP_SUBSTR( column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*', 1, 3, NULL, 2) AS relationship3,
       REGEXP_SUBSTR( column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*', 1, 3, NULL, 3) AS to3,
       REGEXP_SUBSTR( column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*', 1, 3, NULL, 4) AS id3,
       REGEXP_SUBSTR( column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*', 1, 4, NULL, 1) AS from4,
       REGEXP_SUBSTR( column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*', 1, 4, NULL, 2) AS relationship4,
       REGEXP_SUBSTR( column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*', 1, 4, NULL, 3) AS to4,
       REGEXP_SUBSTR( column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*', 1, 4, NULL, 4) AS id4
FROM   table_name

Which outputs:

FROM1 RELATIONSHIP1 TO1 ID1 FROM2 RELATIONSHIP2 TO2 ID2 FROM3 RELATIONSHIP3 TO3 ID3 FROM4 RELATIONSHIP4 TO4 ID4
van dam brother Prince Charles 12345 Mathew Perker son Prince Charles 12345 null null null null null null null null
Madam Currie grandmother Albert Eistine 56789 Pieer Currie grandfather Albert Eistine 56789 CV Raman friend Albert Eistine 56789 null null null null

If you want it to have a dynamic number of matches then output the data in rows, not columns:

SELECT item,
       REGEXP_SUBSTR( column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*', 1, item ) AS relationship
FROM   table_name
       CROSS APPLY (
         SELECT LEVEL AS item
         FROM   DUAL
         CONNECT BY LEVEL <= REGEXP_COUNT(column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*')
       )

or:

SELECT item,
       REGEXP_SUBSTR( column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*', 1, item, NULL, 1) AS from_name,
       REGEXP_SUBSTR( column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*', 1, item, NULL, 2) AS relationship,
       REGEXP_SUBSTR( column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*', 1, item, NULL, 3) AS to_name,
       REGEXP_SUBSTR( column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*', 1, item, NULL, 4) AS id
FROM   table_name
       CROSS APPLY (
         SELECT LEVEL AS item
         FROM   DUAL
         CONNECT BY LEVEL <= REGEXP_COUNT(column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*')
       )

Which the latter outputs:

ITEM FROM_NAME RELATIONSHIP TO_NAME ID
1 van dam brother Prince Charles 12345
2 Mathew Perker son Prince Charles 12345
1 Madam Currie grandmother Albert Eistine 56789
2 Pieer Currie grandfather Albert Eistine 56789
3 CV Raman friend Albert Eistine 56789

fiddle

Here's another way of thinking about it. Use CONNECT BY to traverse the string. Assumption is the substrings you want are always separated by a period-space. Uses a Common Table Expression (CTE) to set up test data. This handles variable amounts of substrings. Since the ending period is consumed when matching, it's added back on in the select. This may cause issues if you have a null row, as it will return just the period.

with tbl(id, data) as (
  select 1, 'van dam is brother of Prince Charles(12345). Mathew Perker is son of Prince Charles(12345).'
    from dual union all
  select 2, 'Madam Currie is grandmother of Albert Eistine(56789). Pieer Currie is grandfather of Albert Eistine(56789). CV Raman is friend of Albert Eistine(56789).'
    from dual
)
select id,
  regexp_substr(data, '(.*?)(\. |\.$)', 1, level, NULL, 1) || '.' substring
from tbl
connect by level <= regexp_count(data, '\. ')+1   
  and prior id = id
  and prior sys_guid() is not null; 


ID SUBSTRING                                                   
-- ------------------------------------------------------------
 1 van dam is brother of Prince Charles(12345).                
 1 Mathew Perker is son of Prince Charles(12345).              
 2 Madam Currie is grandmother of Albert Eistine(56789).       
 2 Pieer Currie is grandfather of Albert Eistine(56789).       
 2 CV Raman is friend of Albert Eistine(56789).                

5 rows selected.

I thought about this further, and if you were to construct the select into it's own CTE, you could select from that to get more detail if you need it.

with tbl(id, data) as (
  select 1, 'van dam is brother of Prince Charles(12345). Mathew Perker is son of Prince Charles(12345).'
    from dual union all
  select 2, 'Madam Currie is grandmother of Albert Eistine(56789). Pieer Currie is grandfather of Albert Eistine(56789). CV Raman is friend of Albert Eistine(56789).'
    from dual
),
tbl_substrings(id, sub_id, substring) as (
select id, level as sub_id,
  regexp_substr(data, '(.*?)(\. |\.$)', 1, level, NULL, 1) || '.' substring
from tbl
connect by level <= regexp_count(data, '\. ')+1   
  and prior id = id
  and prior sys_guid() is not null
)
-- Uncomment detail below if needed.
select id, sub_id, substring
--, regexp_replace(substring, '(.*?) is .*$', '\1') rel_person
--, regexp_replace(substring, '.* is (.*?) of .*$', '\1') relation
, regexp_replace(substring, '.* of (.*?)\(.*$', '\1') orig_person
, regexp_replace(substring, '.*\((.*?)\).*$', '\1') orig_person_id
from tbl_substrings
order by id, sub_id;


ID SUB_ID SUBSTRING                                               ORIG_PERSON     ORIG_PERSON_ID
-- ------ ------------------------------------------------------- --------------- --------------
 1      1 van dam is brother of Prince Charles(12345).            Prince Charles  12345         
 1      2 Mathew Perker is son of Prince Charles(12345).          Prince Charles  12345         
 2      1 Madam Currie is grandmother of Albert Eistine(56789).   Albert Eistine  56789         
 2      2 Pieer Currie is grandfather of Albert Eistine(56789).   Albert Eistine  56789         
 2      3 CV Raman is friend of Albert Eistine(56789).            Albert Eistine  56789         

5 rows selected.

try

WITH data AS (
  SELECT 'van dam is brother of Prince Charles(12345). Mathew Perker is son of Prince Charles(12345).' AS row1, 
         'Madam Currie is grandmother of Albert Eistine(56789). Pieer Currie is grandfather of Albert Eistine(56789). CV Raman is friend of Albert Eistine(56789).' AS row2
  FROM dual
)
SELECT 
  REGEXP_SUBSTR(row_value, '[^\.]+(\(\d+\))\.', 1, level) AS split_string
FROM (
  SELECT row1 AS row_value FROM data
  UNION ALL
  SELECT row2 FROM data
) t
CONNECT BY REGEXP_SUBSTR(row_value, '[^\.]+(\(\d+\))\.', 1, level) IS NOT NULL
AND PRIOR row_value = row_value
AND PRIOR dbms_random.value IS NOT NULL;

本文标签: sqlString separation with oracle regexpStack Overflow