admin管理员组

文章数量:1391968

I have last names (left) and first names (right) separated by a comma.
Among the last names, I often (but not always) have duplicates separated by an underscore. How to remove the duplicates and, for compound names, replace the underscore with a space?
Also, first names never have duplicates but always begin with an underscore, which I would like to replace with a space. For compound first names also separated by an underscore, I would like to replace this underscore with a space.
I have thousands of lines and I'm struggling to find the solution.
Thanks for help

Input data:

> dat0
                  name_ko
1           BLA_BLA,_BLIM
2 CLO_CLO,_SPITCH_SPLOTCH
3           BAD_BOY,_GOOD  
4      GOOD_BOY,_BAD_GIRL  

Desired output:

> dat1
              name_ok
1           BLA, BLIM
2 CLO, SPITCH SPLOTCH
3       BAD BOY, GOOD
4  GOOD BOY, BAD GIRL

Data:

name_ko <- c(
  "BLA_BLA,_BLIM",
  "CLO_CLO,_SPITCH_SPLOTCH",
  "BAD_BOY,_GOOD",
  "GOOD_BOY,_BAD_GIRL")
dat0 <- data.frame(name_ko)

I have last names (left) and first names (right) separated by a comma.
Among the last names, I often (but not always) have duplicates separated by an underscore. How to remove the duplicates and, for compound names, replace the underscore with a space?
Also, first names never have duplicates but always begin with an underscore, which I would like to replace with a space. For compound first names also separated by an underscore, I would like to replace this underscore with a space.
I have thousands of lines and I'm struggling to find the solution.
Thanks for help

Input data:

> dat0
                  name_ko
1           BLA_BLA,_BLIM
2 CLO_CLO,_SPITCH_SPLOTCH
3           BAD_BOY,_GOOD  
4      GOOD_BOY,_BAD_GIRL  

Desired output:

> dat1
              name_ok
1           BLA, BLIM
2 CLO, SPITCH SPLOTCH
3       BAD BOY, GOOD
4  GOOD BOY, BAD GIRL

Data:

name_ko <- c(
  "BLA_BLA,_BLIM",
  "CLO_CLO,_SPITCH_SPLOTCH",
  "BAD_BOY,_GOOD",
  "GOOD_BOY,_BAD_GIRL")
dat0 <- data.frame(name_ko)
Share Improve this question asked Mar 13 at 22:27 denisdenis 8425 silver badges14 bronze badges 3
  • 3 Are duplicates limited to a single pair like in your example or could there be something like BAD_BAD_BAD_BOY_BOY,_GOOD ? – margusl Commented Mar 13 at 23:00
  • 3 Or along the same lines, a duplicated compound name, eg "BAD_BOY_BAD_BOY,_GOOD"? – zephryl Commented Mar 13 at 23:03
  • 1 Yes, duplicates are limited to a single pair. Thank you very much! – denis Commented Mar 13 at 23:31
Add a comment  | 

1 Answer 1

Reset to default 3

You can try

name_ok = gsub("_"," ",gsub("(\\b\\w+)_(\\1)", "\\1",name_ko))

 "BLA, BLIM"          
 "CLO, SPITCH SPLOTCH" 
 "BAD BOY, GOOD"  
 "GOOD BOY, BAD GIRL"

To handle triplets and more as Margusl and zephryl suggested - thank you

name_ko <- c(
  "BLA_BLA,_BLIM",
  "CLO_CLO,_SPITCH_SPLOTCH",
  "BAD_BOY,_GOOD",
  "GOOD_BOY,_BAD_GIRL",
  "BAD_BAD_BAD_BOY_BOY,_GOOD",
  "BAD_BOY_BAD_BOY,_GOOD"
)

name_ok = sapply(strsplit(name_ko, ","), function(x) {
  last_names <- unique(unlist(strsplit(trimws(x[1]), "_"))) 
  first_names <- gsub("_", " ",trimws(x[2]))   
  paste(paste(last_names, collapse = " "), first_names, sep = ", ")
})

"BLA,  BLIM"          
"CLO,  SPITCH SPLOTCH" 
"BAD BOY,  GOOD"      
"GOOD BOY,  BAD GIRL"  
"BAD BOY,  GOOD"
"BAD BOY,  GOOD"      

本文标签: stringRemove duplicate names while replacing underscores with spaces in RStack Overflow