admin管理员组文章数量:1287885
I'm new to R - so I'm sorry if this has been asked but I've not found a solution online. I have a data set of survey responses related to gender and sex that were typed in by 350 participants. Many of the responses are the same thing but typed/spelled differently. Below are some of the outcomes I get when I run "unique(df$variable)". There is a lot of variation, misspelling, differences in capitalizations, etc.
[1] Male Female
[3] female Female/woman
[5] Female F
[7] female Woman
[9] Cis female, she her Female cisgender
[11] Female heterosexual I identify as a trans woman!
[13] Demiboy Transwoman
[15] My sex is female and my gender identity is nonbinary male
[17] m woman
[19] Woman Nonbinary
[21] my gender doesn't exist Male/AMAB
What I've done: I have tried classifying all unique values and replacing with mutate:
f <- c("Female/woman", "female", "Female cisgender", "Female", "Woman", "woman", "Women", "women", "f", "F" )
m <- c("male", "Cis Male", "Male", "m", "M", "ma,e=]]")
gq <- c("genderqueer", "nonbinary", "genderfluid")
df |>
mutate(GenderNew = case_when(
GenderSex %in% f ~ "F",
GenderSex %in% m ~ "M",
GenderSex %in% gq ~ "Q",
)) -> df_new
But this gave me multiple NA in my GenderNew column. I've also not had success with grepl.
What I'm looking to do: Replace all occurrences of the letter "f" regardless of position in the string/participant response with the letter "F". Same with responses indicating male or genderqueer. I would like my outcome to be either "M", "F", "GQ", or whatever the participant typed as a response so I can re-code it without missing anything.
GenderSex <-
c("Male", "Female", "female", "Female/woman", "Female", "F",
"female", "Woman", "Cis female, she her", "Female cisgender",
"Female heterosexual", "I identify as a trans woman!", "Demiboy",
"Transwoman", "My sex is female and my gender identity is nonbinary",
"male", "m", "woman", "Woman", "Nonbinary", "my gender doesn't exist",
"Male/AMAB")
I'm new to R - so I'm sorry if this has been asked but I've not found a solution online. I have a data set of survey responses related to gender and sex that were typed in by 350 participants. Many of the responses are the same thing but typed/spelled differently. Below are some of the outcomes I get when I run "unique(df$variable)". There is a lot of variation, misspelling, differences in capitalizations, etc.
[1] Male Female
[3] female Female/woman
[5] Female F
[7] female Woman
[9] Cis female, she her Female cisgender
[11] Female heterosexual I identify as a trans woman!
[13] Demiboy Transwoman
[15] My sex is female and my gender identity is nonbinary male
[17] m woman
[19] Woman Nonbinary
[21] my gender doesn't exist Male/AMAB
What I've done: I have tried classifying all unique values and replacing with mutate:
f <- c("Female/woman", "female", "Female cisgender", "Female", "Woman", "woman", "Women", "women", "f", "F" )
m <- c("male", "Cis Male", "Male", "m", "M", "ma,e=]]")
gq <- c("genderqueer", "nonbinary", "genderfluid")
df |>
mutate(GenderNew = case_when(
GenderSex %in% f ~ "F",
GenderSex %in% m ~ "M",
GenderSex %in% gq ~ "Q",
)) -> df_new
But this gave me multiple NA in my GenderNew column. I've also not had success with grepl.
What I'm looking to do: Replace all occurrences of the letter "f" regardless of position in the string/participant response with the letter "F". Same with responses indicating male or genderqueer. I would like my outcome to be either "M", "F", "GQ", or whatever the participant typed as a response so I can re-code it without missing anything.
GenderSex <-
c("Male", "Female", "female", "Female/woman", "Female", "F",
"female", "Woman", "Cis female, she her", "Female cisgender",
"Female heterosexual", "I identify as a trans woman!", "Demiboy",
"Transwoman", "My sex is female and my gender identity is nonbinary",
"male", "m", "woman", "Woman", "Nonbinary", "my gender doesn't exist",
"Male/AMAB")
Share
Improve this question
edited Feb 22 at 19:51
Rui Barradas
76.7k8 gold badges39 silver badges73 bronze badges
asked Feb 22 at 17:49
user29756984user29756984
211 bronze badge
3
|
2 Answers
Reset to default 2Use an auxiliary function, a vectorized form of grepl
followed by a logical value per row.
Thanks to r2evans for the suggestion of having a default value.
library(dplyr)
f <- c("Female/woman", "female", "Female cisgender", "Female",
"Woman", "woman", "Women", "women", "f", "F" )
m <- c("male", "Cis Male", "Male", "m", "M", "ma,e=]]")
gq <- c("genderqueer", "nonbinary", "genderfluid")
fun <- function(x, pattern) {
Grepl <- Vectorize(grepl, "pattern")
out <- Grepl(pattern, x)
rowSums(out) > 0L
}
df <- data.frame(GenderSex)
df |>
mutate(GenderNew = case_when(
fun(GenderSex, f) ~ "F",
fun(GenderSex, m) ~ "M",
fun(GenderSex, gq) ~ "Q",
.default = GenderSex
))
I like your initial approach better. I don't suppose your gender variable has millions of expressions, so dividing the unique
and sort
ed values by hand should be a safe option. dput
prepares a vector on the console you can copy.
> df$gender |> unique() |> sort() |> dput()
c("Cis female, she her", "Demiboy", "F", "female", "Female",
"Female cisgender", "Female heterosexual", "Female/woman", "I identify as a trans woman!",
"m", "male", "Male", "Male/AMAB", "my gender doesn't exist",
"Nonbinary", "Transwoman", "woman", "Woman")
Since we want F, M, and others (Q), we just need the first two guys.
> f <- c("Cis female, she her", "F", "female", "Female", "Female cisgender",
+ "Female heterosexual", "Female/woman", "woman", "Woman")
> m <- c("m", "male", "Male", "Male/AMAB")
Then just replace
three times. It is easy to read and more efficient than ifelse
or the like.
> df |>
+ transform(gender_new=replace(gender, gender %in% f, 'F')) |>
+ transform(gender_new=replace(gender_new, gender %in% m, 'M')) |>
+ transform(gender_new=replace(gender_new, !gender %in% c(f, m), 'Q'))
gender x gender_new
1 Woman -0.09465904 F
2 Female/woman 0.63286260 F
3 Female/woman 0.63286260 F
4 Male/AMAB -1.78130843 M
5 woman -2.65645542 F
6 Demiboy -1.38886070 Q
7 Female 0.40426832 F
8 Female/woman 0.63286260 F
9 Female -0.56469817 F
10 woman -2.65645542 F
11 female 0.36312841 F
12 m -0.28425292 M
13 my gender doesn't exist -0.30663859 Q
14 woman -2.65645542 F
15 F -0.10612452 F
16 F -0.10612452 F
17 Female -0.56469817 F
18 Nonbinary 1.32011335 Q
19 female 0.36312841 F
20 Male/AMAB -1.78130843 M
21 my gender doesn't exist -0.30663859 Q
22 Female -0.56469817 F
23 F -0.10612452 F
24 Female cisgender -0.06271410 F
25 Woman -0.09465904 F
26 Female 0.40426832 F
27 Male 1.37095845 M
28 m -0.28425292 M
29 female 1.51152200 F
30 Female/woman 0.63286260 F
31 Demiboy -1.38886070 Q
32 Female cisgender -0.06271410 F
33 Cis female, she her 2.01842371 F
34 I identify as a trans woman! 2.28664539 Q
35 Nonbinary 1.32011335 Q
36 Cis female, she her 2.01842371 F
37 Female heterosexual 1.30486965 F
38 female 0.36312841 F
39 male 0.63595040 M
40 Female 0.40426832 F
41 Transwoman -0.27878877 Q
42 Female 0.40426832 F
43 Male/AMAB -1.78130843 M
44 Female -0.56469817 F
45 woman -2.65645542 F
46 m -0.28425292 M
47 woman -2.65645542 F
48 Female 0.40426832 F
49 Transwoman -0.27878877 Q
50 Woman -0.09465904 F
Data:
set.seed(42)
df <- data.frame(
gender=GenderSex, ## from OP
x=rnorm(length(GenderSex))
)
df <- df[sample.int(nrow(df), 50, replace=TRUE), ] |> `rownames<-`(NULL)
本文标签: data cleaningR Function to replace all instances of quotfquot in a factored variableStack Overflow
版权声明:本文标题:data cleaning - R Function to replace all instances of "f" in a factored variable? - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1741332233a2372832.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
case_when(... , .default=GenderSex)
. That might show the values that don't quite match what you intend. The output showing[1] Male
with no quotes suggests that your values are already factors and not strings; generallyfactor(.) %in% .
should still work, so it's not clear what you have going on. I suggest you sharedput(df$GenderSex)
instead of what you assumeGenderSex
is, since often it's what you don't expect that is the real culprit. – r2evans Commented Feb 22 at 19:59setdiff(df$GenderSex, c(f, m, gq))
shows nine values that are not otherwise included in yourf
,m
, andgq
vectors. – r2evans Commented Feb 22 at 20:05