admin管理员组文章数量:1122833
I am struggling to understand the behavior of dplyr::case_match() in dealing with missing values. Lets suppose we have a variable with unique values of 1, 2 and NA, and we want to convert: missing to 9, 2 to 0, leave 1's alone. I have pasted some R code with the problem below.
df <- tibble(var1 = c(NA, NA, 1, 2))
>
> df
# A tibble: 4 × 1
var1
<dbl>
1 NA
2 NA
3 1
4 2
>
> # convert missing to 9, 2 to 0 and leave 1's alone
> df %>% mutate(var1_recode = case_when(is.na(var1)~9, var1==2~0, .default=var1))
# A tibble: 4 × 2
var1 var1_recode
<dbl> <dbl>
1 NA 9
2 NA 9
3 1 1
4 2 0
>
> # now using case_match() rather than case_when(), why does this not work properly?
> df %>% mutate(var1_recode = case_match(var1, is.na(var1)~9, 2~0, .default=var1))
# A tibble: 4 × 2
var1 var1_recode
<dbl> <dbl>
1 NA NA
2 NA NA
3 1 9
4 2 0
>
> # but this does work properly
> df %>% mutate(var1_recode = case_match(var1, NA~9, 2~0, .default=var1))
# A tibble: 4 × 2
var1 var1_recode
<dbl> <dbl>
1 NA 9
2 NA 9
3 1 1
4 2 0
>
I realise that my first attempt using case_match() is an ugly way to program. The reason I discovered this issue was using an across() function which passed x from an anonymous function so that the second of my arguments in case_match would read is.na(x)
. This obviously didn't work either.
I'm wondering whether dplyr::case_match() should display a warning or even an error if coded the way I did in my first example, which clearly does not given the desired result?
I am struggling to understand the behavior of dplyr::case_match() in dealing with missing values. Lets suppose we have a variable with unique values of 1, 2 and NA, and we want to convert: missing to 9, 2 to 0, leave 1's alone. I have pasted some R code with the problem below.
df <- tibble(var1 = c(NA, NA, 1, 2))
>
> df
# A tibble: 4 × 1
var1
<dbl>
1 NA
2 NA
3 1
4 2
>
> # convert missing to 9, 2 to 0 and leave 1's alone
> df %>% mutate(var1_recode = case_when(is.na(var1)~9, var1==2~0, .default=var1))
# A tibble: 4 × 2
var1 var1_recode
<dbl> <dbl>
1 NA 9
2 NA 9
3 1 1
4 2 0
>
> # now using case_match() rather than case_when(), why does this not work properly?
> df %>% mutate(var1_recode = case_match(var1, is.na(var1)~9, 2~0, .default=var1))
# A tibble: 4 × 2
var1 var1_recode
<dbl> <dbl>
1 NA NA
2 NA NA
3 1 9
4 2 0
>
> # but this does work properly
> df %>% mutate(var1_recode = case_match(var1, NA~9, 2~0, .default=var1))
# A tibble: 4 × 2
var1 var1_recode
<dbl> <dbl>
1 NA 9
2 NA 9
3 1 1
4 2 0
>
I realise that my first attempt using case_match() is an ugly way to program. The reason I discovered this issue was using an across() function which passed x from an anonymous function so that the second of my arguments in case_match would read is.na(x)
. This obviously didn't work either.
I'm wondering whether dplyr::case_match() should display a warning or even an error if coded the way I did in my first example, which clearly does not given the desired result?
Share Improve this question asked Nov 21, 2024 at 8:48 user167591user167591 2931 silver badge7 bronze badges 2 |1 Answer
Reset to default 0I'll demonstrate why your second attempt doesn't work, including why the value of 1 was converted to 9. I've changed your tibble a little to make things a little more clear (to me at least :p). I added a row with -1 and changed the default to 10.
library(tibble)
library(dplyr)
df <- tibble(var1 = c(0, -1, NA, 1, 2))
df %>% mutate(var1_recode = case_match(var1, is.na(var1)~9, 2~0, .default=10))
# A tibble: 5 × 2
var1 var1_recode
<dbl> <dbl>
1 0 9
2 -1 10
3 NA 10
4 1 9
5 2 0
It helped me to understand that the case_match
is equivalent to
case_match(var1,
c(FALSE, FALSE, TRUE, FALSE, FALSE) ~ 9,
2 ~ 0,
.default = 10)
because
> is.na(df$var1)
[1] FALSE FALSE TRUE FALSE FALSE
case_match
works by seeing if any values in the first argument are equal to or are in any values on the LHS of the arguments. Going row by row through var1 now
0
is equivalent to FALSE
so row 1 is in c(FALSE, FALSE, TRUE, FALSE, FALSE)
, thus it returns 9
. (So I do respectfully disagree with @Allan Cameron that "neither TRUE nor FALSE are equal to the values in var1")
-1
is not in c(FALSE, FALSE, TRUE, FALSE, FALSE)
and !=2
so it goes to .default
NA
comparisons return NA
so again .default
Similar to row 1, 1==TRUE
so row 1 is in c(FALSE, FALSE, TRUE, FALSE, FALSE)
, thus it returns 9
.
Lastly, 2 ~ 0
needs no explanation.
Hence, you get var1_recode == c(9, 10, 10, 9, 0)
.
本文标签: rdplyrcasematch handling of missing valuesStack Overflow
版权声明:本文标题:r - dplyr::case_match handling of missing values - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1736312351a1935071.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
~
and replaces it with the value on the right. In your first, non-working example,is.na(var1)
resolves to a vector of TRUE and FALSE. Since neither TRUE nor FALSE are equal to the values invar1
, their values are not matched and therefore we get the.default
value, which is the originalvar1
value ofNA
. As for whether this should emit a warning or error in case someone misunderstands the syntax, this might be best addressed via a bug report. – Allan Cameron Commented Nov 21, 2024 at 9:06