sql - String separation with oracle regexp - Stack Overflow

IT技术

更新时间：2025-04-154

admin管理员组
文章数量:1389853

Need to separate the following long string using oracle regexp -

Row1 has following value- 'van dam is brother of Prince Charles(12345). Mathew Perker is son of Prince Charles(12345).'

Row2- 'Madam Currie is grandmother of Albert Eistine(56789). Pieer Currie is grandfather of Albert Eistine(56789). CV Raman is friend of Albert Eistine(56789).'

Now by split I need the following separate string -

From Row1 - 'van dam is brother of Prince Charles(12345).' 'Mathew Perker is son of Prince Charles(12345).'

From Row2- 'Madam Currie is grandmother of Albert Eistine(56789).' 'Pieer Currie is grandfather of Albert Eistine(56789).' 'CV Raman is friend of Albert Eistine(56789).'

These separate strings can be presented in separate column. The numbers in brackets are actually ID stored in ID field of the table.

Is it possible to achieve such split using Oracle regexp?

Need to separate the following long string using oracle regexp -

Row1 has following value- 'van dam is brother of Prince Charles(12345). Mathew Perker is son of Prince Charles(12345).'

Row2- 'Madam Currie is grandmother of Albert Eistine(56789). Pieer Currie is grandfather of Albert Eistine(56789). CV Raman is friend of Albert Eistine(56789).'

Now by split I need the following separate string -

From Row1 - 'van dam is brother of Prince Charles(12345).' 'Mathew Perker is son of Prince Charles(12345).'

From Row2- 'Madam Currie is grandmother of Albert Eistine(56789).' 'Pieer Currie is grandfather of Albert Eistine(56789).' 'CV Raman is friend of Albert Eistine(56789).'

These separate strings can be presented in separate column. The numbers in brackets are actually ID stored in ID field of the table.

Is it possible to achieve such split using Oracle regexp?

Share Improve this question edited Mar 13 at 8:14 MT0 169k12 gold badges67 silver badges129 bronze badges asked Mar 13 at 6:36 Kaustav Nandy 511 bronze badge

This question is similar to: Oracle: Connect by Level & regexp_substr. If you believe it’s different, please edit the question, make it clear how it’s different and/or how the answers on that question are not helpful for your problem. – p3consulting Commented Mar 13 at 6:45
Do you simply want to break the whole string in multiple parts based on period(.)? – Ankit Bajpai Commented Mar 13 at 7:06
In one word Yes. Not sure if we can also use the ID field as well for each row as ID would be unique for each row and it is present at the end of every part before (.) – Kaustav Nandy Commented Mar 13 at 8:55

Add a comment |

4 Answers 4

Sorted by: Reset to default 2

Regular expressions would work, but - on large data sets - string functions (such as combination of substr and instr) would perform better. Here's how.

Sample data:

SQL> WITH
  2     test (col)
  3     AS
  4        (SELECT 'van dam is brother of Prince Charles(12345). Mathew Perker is son of Prince Charles(12345).'
  5           FROM DUAL
  6         UNION ALL
  7         SELECT 'Madam Currie is grandmother of Albert Eistine(56789). Pieer Currie is grandfather of Albert Eistine(56789). CV Raman is friend of Albert Eistine(56789).'
  8           FROM DUAL)

Query begins here; it splits source value on a dot (.) character.

  9  SELECT trim(substr(col, 1, instr(col, '.', 1, 1))) val_1,
 10         --
 11         trim(substr(col, instr(col, '.', 1, 1) + 1,
 12                          instr(col, '.', 1, 2) - instr(col, '.', 1, 1))) val_2,
 13         --
 14         trim(substr(col, instr(col, '.', 1, 2) + 1,
 15                          instr(col, '.', 1, 3) - instr(col, '.', 1, 2))) val_3
 16    FROM test;

VAL_1                                                 VAL_2                                                  VAL_3
----------------------------------------------------- ------------------------------------------------------ -----------------------------------------------------
van dam is brother of Prince Charles(12345).          Mathew Perker is son of Prince Charles(12345).
Madam Currie is grandmother of Albert Eistine(56789). Pieer Currie is grandfather of Albert Eistine(56789).  CV Raman is friend of Albert Eistine(56789).

SQL>

You'd add as many val_ns as necessary.

Can it be dynamic? Not that easy, I think, because you want every value in its own column. If you'd just want to split the source value into separate rows, that would be easy - and regular expressions handle that nicely.

You can use:

SELECT REGEXP_SUBSTR(
         column_name,
         '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*',
         1,
         1
       ) AS relationship1,
       REGEXP_SUBSTR(
         column_name,
         '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*',
         1,
         2
       ) AS relationship2,
       REGEXP_SUBSTR(
         column_name,
         '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*',
         1,
         3
       ) AS relationship3,
       REGEXP_SUBSTR(
         column_name,
         '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*',
         1,
         4
       ) AS relationship4
FROM   table_name

Which, for the sample data:

CREATE TABLE table_name(column_name) AS
  SELECT 'van dam is brother of Prince Charles(12345). Mathew Perker is son of Prince Charles(12345).' FROM DUAL UNION ALL
  SELECT 'Madam Currie is grandmother of Albert Eistine(56789). Pieer Currie is grandfather of Albert Eistine(56789). CV Raman is friend of Albert Eistine(56789).' FROM DUAL;

Outputs:

RELATIONSHIP1	RELATIONSHIP2	RELATIONSHIP3	RELATIONSHIP4
van dam is brother of Prince Charles(12345).	Mathew Perker is son of Prince Charles(12345).	null	null
Madam Currie is grandmother of Albert Eistine(56789).	Pieer Currie is grandfather of Albert Eistine(56789).	CV Raman is friend of Albert Eistine(56789).	null

If you want a more detailed breakdown, you can extract the sub-groups from the expression:

SELECT REGEXP_SUBSTR( column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*', 1, 1, NULL, 1) AS from1,
       REGEXP_SUBSTR( column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*', 1, 1, NULL, 2) AS relationship1,
       REGEXP_SUBSTR( column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*', 1, 1, NULL, 3) AS to1,
       REGEXP_SUBSTR( column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*', 1, 1, NULL, 4) AS id1,
       REGEXP_SUBSTR( column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*', 1, 2, NULL, 1) AS from2,
       REGEXP_SUBSTR( column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*', 1, 2, NULL, 2) AS relationship2,
       REGEXP_SUBSTR( column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*', 1, 2, NULL, 3) AS to2,
       REGEXP_SUBSTR( column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*', 1, 2, NULL, 4) AS id2,
       REGEXP_SUBSTR( column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*', 1, 3, NULL, 1) AS from3,
       REGEXP_SUBSTR( column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*', 1, 3, NULL, 2) AS relationship3,
       REGEXP_SUBSTR( column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*', 1, 3, NULL, 3) AS to3,
       REGEXP_SUBSTR( column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*', 1, 3, NULL, 4) AS id3,
       REGEXP_SUBSTR( column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*', 1, 4, NULL, 1) AS from4,
       REGEXP_SUBSTR( column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*', 1, 4, NULL, 2) AS relationship4,
       REGEXP_SUBSTR( column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*', 1, 4, NULL, 3) AS to4,
       REGEXP_SUBSTR( column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*', 1, 4, NULL, 4) AS id4
FROM   table_name

Which outputs:

FROM1	RELATIONSHIP1	TO1	ID1	FROM2	RELATIONSHIP2	TO2	ID2	FROM3	RELATIONSHIP3	TO3	ID3	FROM4	RELATIONSHIP4	TO4	ID4
van dam	brother	Prince Charles	12345	Mathew Perker	son	Prince Charles	12345	null	null	null	null	null	null	null	null
Madam Currie	grandmother	Albert Eistine	56789	Pieer Currie	grandfather	Albert Eistine	56789	CV Raman	friend	Albert Eistine	56789	null	null	null	null

If you want it to have a dynamic number of matches then output the data in rows, not columns:

SELECT item,
       REGEXP_SUBSTR( column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*', 1, item ) AS relationship
FROM   table_name
       CROSS APPLY (
         SELECT LEVEL AS item
         FROM   DUAL
         CONNECT BY LEVEL <= REGEXP_COUNT(column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*')
       )

or:

SELECT item,
       REGEXP_SUBSTR( column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*', 1, item, NULL, 1) AS from_name,
       REGEXP_SUBSTR( column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*', 1, item, NULL, 2) AS relationship,
       REGEXP_SUBSTR( column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*', 1, item, NULL, 3) AS to_name,
       REGEXP_SUBSTR( column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*', 1, item, NULL, 4) AS id
FROM   table_name
       CROSS APPLY (
         SELECT LEVEL AS item
         FROM   DUAL
         CONNECT BY LEVEL <= REGEXP_COUNT(column_name, '(.*?) is (.*?) of (.*?)\((\d+)\)\.\s*')
       )

Which the latter outputs:

ITEM	FROM_NAME	RELATIONSHIP	TO_NAME	ID
1	van dam	brother	Prince Charles	12345
2	Mathew Perker	son	Prince Charles	12345
1	Madam Currie	grandmother	Albert Eistine	56789
2	Pieer Currie	grandfather	Albert Eistine	56789
3	CV Raman	friend	Albert Eistine	56789

fiddle

Here's another way of thinking about it. Use CONNECT BY to traverse the string. Assumption is the substrings you want are always separated by a period-space. Uses a Common Table Expression (CTE) to set up test data. This handles variable amounts of substrings. Since the ending period is consumed when matching, it's added back on in the select. This may cause issues if you have a null row, as it will return just the period.

with tbl(id, data) as (
  select 1, 'van dam is brother of Prince Charles(12345). Mathew Perker is son of Prince Charles(12345).'
    from dual union all
  select 2, 'Madam Currie is grandmother of Albert Eistine(56789). Pieer Currie is grandfather of Albert Eistine(56789). CV Raman is friend of Albert Eistine(56789).'
    from dual
)
select id,
  regexp_substr(data, '(.*?)(\. |\.$)', 1, level, NULL, 1) || '.' substring
from tbl
connect by level <= regexp_count(data, '\. ')+1   
  and prior id = id
  and prior sys_guid() is not null;



ID SUBSTRING                                                   
-- ------------------------------------------------------------
 1 van dam is brother of Prince Charles(12345).                
 1 Mathew Perker is son of Prince Charles(12345).              
 2 Madam Currie is grandmother of Albert Eistine(56789).       
 2 Pieer Currie is grandfather of Albert Eistine(56789).       
 2 CV Raman is friend of Albert Eistine(56789).                

5 rows selected.

I thought about this further, and if you were to construct the select into it's own CTE, you could select from that to get more detail if you need it.

with tbl(id, data) as (
  select 1, 'van dam is brother of Prince Charles(12345). Mathew Perker is son of Prince Charles(12345).'
    from dual union all
  select 2, 'Madam Currie is grandmother of Albert Eistine(56789). Pieer Currie is grandfather of Albert Eistine(56789). CV Raman is friend of Albert Eistine(56789).'
    from dual
),
tbl_substrings(id, sub_id, substring) as (
select id, level as sub_id,
  regexp_substr(data, '(.*?)(\. |\.$)', 1, level, NULL, 1) || '.' substring
from tbl
connect by level <= regexp_count(data, '\. ')+1   
  and prior id = id
  and prior sys_guid() is not null
)
-- Uncomment detail below if needed.
select id, sub_id, substring
--, regexp_replace(substring, '(.*?) is .*$', '\1') rel_person
--, regexp_replace(substring, '.* is (.*?) of .*$', '\1') relation
, regexp_replace(substring, '.* of (.*?)\(.*$', '\1') orig_person
, regexp_replace(substring, '.*\((.*?)\).*$', '\1') orig_person_id
from tbl_substrings
order by id, sub_id;


ID SUB_ID SUBSTRING                                               ORIG_PERSON     ORIG_PERSON_ID
-- ------ ------------------------------------------------------- --------------- --------------
 1      1 van dam is brother of Prince Charles(12345).            Prince Charles  12345         
 1      2 Mathew Perker is son of Prince Charles(12345).          Prince Charles  12345         
 2      1 Madam Currie is grandmother of Albert Eistine(56789).   Albert Eistine  56789         
 2      2 Pieer Currie is grandfather of Albert Eistine(56789).   Albert Eistine  56789         
 2      3 CV Raman is friend of Albert Eistine(56789).            Albert Eistine  56789         

5 rows selected.

try

WITH data AS (
  SELECT 'van dam is brother of Prince Charles(12345). Mathew Perker is son of Prince Charles(12345).' AS row1, 
         'Madam Currie is grandmother of Albert Eistine(56789). Pieer Currie is grandfather of Albert Eistine(56789). CV Raman is friend of Albert Eistine(56789).' AS row2
  FROM dual
)
SELECT 
  REGEXP_SUBSTR(row_value, '[^\.]+(\(\d+\))\.', 1, level) AS split_string
FROM (
  SELECT row1 AS row_value FROM data
  UNION ALL
  SELECT row2 FROM data
) t
CONNECT BY REGEXP_SUBSTR(row_value, '[^\.]+(\(\d+\))\.', 1, level) IS NOT NULL
AND PRIOR row_value = row_value
AND PRIOR dbms_random.value IS NOT NULL;

本文标签： sqlString separation with oracle regexpStack Overflow

版权声明：本文标题：sql - String separation with oracle regexp - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1744716924a2621457.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

sql - String separation with oracle regexp - Stack Overflow

4 Answers 4

更多相关文章