string - Remove duplicate names while replacing underscores with spaces in R - Stack Overflow

IT技术

更新时间：2025-04-151

admin管理员组
文章数量:1391968

I have last names (left) and first names (right) separated by a comma.
Among the last names, I often (but not always) have duplicates separated by an underscore. How to remove the duplicates and, for compound names, replace the underscore with a space?
Also, first names never have duplicates but always begin with an underscore, which I would like to replace with a space. For compound first names also separated by an underscore, I would like to replace this underscore with a space.
I have thousands of lines and I'm struggling to find the solution.
Thanks for help

Input data:

> dat0
                  name_ko
1           BLA_BLA,_BLIM
2 CLO_CLO,_SPITCH_SPLOTCH
3           BAD_BOY,_GOOD  
4      GOOD_BOY,_BAD_GIRL

Desired output:

> dat1
              name_ok
1           BLA, BLIM
2 CLO, SPITCH SPLOTCH
3       BAD BOY, GOOD
4  GOOD BOY, BAD GIRL

Data:

name_ko <- c(
  "BLA_BLA,_BLIM",
  "CLO_CLO,_SPITCH_SPLOTCH",
  "BAD_BOY,_GOOD",
  "GOOD_BOY,_BAD_GIRL")
dat0 <- data.frame(name_ko)

I have last names (left) and first names (right) separated by a comma.
Among the last names, I often (but not always) have duplicates separated by an underscore. How to remove the duplicates and, for compound names, replace the underscore with a space?
Also, first names never have duplicates but always begin with an underscore, which I would like to replace with a space. For compound first names also separated by an underscore, I would like to replace this underscore with a space.
I have thousands of lines and I'm struggling to find the solution.
Thanks for help

Input data:

> dat0
                  name_ko
1           BLA_BLA,_BLIM
2 CLO_CLO,_SPITCH_SPLOTCH
3           BAD_BOY,_GOOD  
4      GOOD_BOY,_BAD_GIRL

Desired output:

> dat1
              name_ok
1           BLA, BLIM
2 CLO, SPITCH SPLOTCH
3       BAD BOY, GOOD
4  GOOD BOY, BAD GIRL

Data:

name_ko <- c(
  "BLA_BLA,_BLIM",
  "CLO_CLO,_SPITCH_SPLOTCH",
  "BAD_BOY,_GOOD",
  "GOOD_BOY,_BAD_GIRL")
dat0 <- data.frame(name_ko)

Share Improve this question asked Mar 13 at 22:27 denis 8425 silver badges14 bronze badges

3 Are duplicates limited to a single pair like in your example or could there be something like BAD_BAD_BAD_BOY_BOY,_GOOD ? – margusl Commented Mar 13 at 23:00
3 Or along the same lines, a duplicated compound name, eg "BAD_BOY_BAD_BOY,_GOOD"? – zephryl Commented Mar 13 at 23:03
1 Yes, duplicates are limited to a single pair. Thank you very much! – denis Commented Mar 13 at 23:31

Add a comment |

1 Answer 1

Sorted by: Reset to default 3

You can try

name_ok = gsub("_"," ",gsub("(\\b\\w+)_(\\1)", "\\1",name_ko))

 "BLA, BLIM"          
 "CLO, SPITCH SPLOTCH" 
 "BAD BOY, GOOD"  
 "GOOD BOY, BAD GIRL"

To handle triplets and more as Margusl and zephryl suggested - thank you

name_ko <- c(
  "BLA_BLA,_BLIM",
  "CLO_CLO,_SPITCH_SPLOTCH",
  "BAD_BOY,_GOOD",
  "GOOD_BOY,_BAD_GIRL",
  "BAD_BAD_BAD_BOY_BOY,_GOOD",
  "BAD_BOY_BAD_BOY,_GOOD"
)

name_ok = sapply(strsplit(name_ko, ","), function(x) {
  last_names <- unique(unlist(strsplit(trimws(x[1]), "_"))) 
  first_names <- gsub("_", " ",trimws(x[2]))   
  paste(paste(last_names, collapse = " "), first_names, sep = ", ")
})

"BLA,  BLIM"          
"CLO,  SPITCH SPLOTCH" 
"BAD BOY,  GOOD"      
"GOOD BOY,  BAD GIRL"  
"BAD BOY,  GOOD"
"BAD BOY,  GOOD"

本文标签： stringRemove duplicate names while replacing underscores with spaces in RStack Overflow

版权声明：本文标题：string - Remove duplicate names while replacing underscores with spaces in R - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1744682042a2619470.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

string - Remove duplicate names while replacing underscores with spaces in R - Stack Overflow

1 Answer 1

更多相关文章