admin管理员组

文章数量:1122833

I am struggling to understand the behavior of dplyr::case_match() in dealing with missing values. Lets suppose we have a variable with unique values of 1, 2 and NA, and we want to convert: missing to 9, 2 to 0, leave 1's alone. I have pasted some R code with the problem below.

df <- tibble(var1 = c(NA, NA, 1, 2))
> 
> df
# A tibble: 4 × 1
   var1
  <dbl>
1    NA
2    NA
3     1
4     2
> 
> # convert missing to 9, 2 to 0 and leave 1's alone
> df %>% mutate(var1_recode = case_when(is.na(var1)~9, var1==2~0, .default=var1))
# A tibble: 4 × 2
   var1 var1_recode
  <dbl>       <dbl>
1    NA           9
2    NA           9
3     1           1
4     2           0
> 
> # now using case_match() rather than case_when(), why does this not work properly?
> df %>% mutate(var1_recode = case_match(var1, is.na(var1)~9, 2~0, .default=var1))
# A tibble: 4 × 2
   var1 var1_recode
  <dbl>       <dbl>
1    NA          NA
2    NA          NA
3     1           9
4     2           0
> 
> # but this does work properly
> df %>% mutate(var1_recode = case_match(var1, NA~9, 2~0, .default=var1))
# A tibble: 4 × 2
   var1 var1_recode
  <dbl>       <dbl>
1    NA           9
2    NA           9
3     1           1
4     2           0
> 

I realise that my first attempt using case_match() is an ugly way to program. The reason I discovered this issue was using an across() function which passed x from an anonymous function so that the second of my arguments in case_match would read is.na(x). This obviously didn't work either.

I'm wondering whether dplyr::case_match() should display a warning or even an error if coded the way I did in my first example, which clearly does not given the desired result?

I am struggling to understand the behavior of dplyr::case_match() in dealing with missing values. Lets suppose we have a variable with unique values of 1, 2 and NA, and we want to convert: missing to 9, 2 to 0, leave 1's alone. I have pasted some R code with the problem below.

df <- tibble(var1 = c(NA, NA, 1, 2))
> 
> df
# A tibble: 4 × 1
   var1
  <dbl>
1    NA
2    NA
3     1
4     2
> 
> # convert missing to 9, 2 to 0 and leave 1's alone
> df %>% mutate(var1_recode = case_when(is.na(var1)~9, var1==2~0, .default=var1))
# A tibble: 4 × 2
   var1 var1_recode
  <dbl>       <dbl>
1    NA           9
2    NA           9
3     1           1
4     2           0
> 
> # now using case_match() rather than case_when(), why does this not work properly?
> df %>% mutate(var1_recode = case_match(var1, is.na(var1)~9, 2~0, .default=var1))
# A tibble: 4 × 2
   var1 var1_recode
  <dbl>       <dbl>
1    NA          NA
2    NA          NA
3     1           9
4     2           0
> 
> # but this does work properly
> df %>% mutate(var1_recode = case_match(var1, NA~9, 2~0, .default=var1))
# A tibble: 4 × 2
   var1 var1_recode
  <dbl>       <dbl>
1    NA           9
2    NA           9
3     1           1
4     2           0
> 

I realise that my first attempt using case_match() is an ugly way to program. The reason I discovered this issue was using an across() function which passed x from an anonymous function so that the second of my arguments in case_match would read is.na(x). This obviously didn't work either.

I'm wondering whether dplyr::case_match() should display a warning or even an error if coded the way I did in my first example, which clearly does not given the desired result?

Share Improve this question asked Nov 21, 2024 at 8:48 user167591user167591 2931 silver badge7 bronze badges 2
  • 3 The second version has the behaviour I would expect; it matches the value on the left of the ~ and replaces it with the value on the right. In your first, non-working example, is.na(var1) resolves to a vector of TRUE and FALSE. Since neither TRUE nor FALSE are equal to the values in var1, their values are not matched and therefore we get the .default value, which is the original var1 value of NA. As for whether this should emit a warning or error in case someone misunderstands the syntax, this might be best addressed via a bug report. – Allan Cameron Commented Nov 21, 2024 at 9:06
  • 1 Thanks @AllanCameron. That makes sense but in addition, it was really confusing to me how in the non-working example, the value of 1 was converted to 9 – user167591 Commented Nov 21, 2024 at 9:25
Add a comment  | 

1 Answer 1

Reset to default 0

I'll demonstrate why your second attempt doesn't work, including why the value of 1 was converted to 9. I've changed your tibble a little to make things a little more clear (to me at least :p). I added a row with -1 and changed the default to 10.

library(tibble)
library(dplyr)
df <- tibble(var1 = c(0, -1, NA, 1, 2))
df %>% mutate(var1_recode = case_match(var1, is.na(var1)~9, 2~0, .default=10))
# A tibble: 5 × 2
   var1 var1_recode
  <dbl>       <dbl>
1     0           9
2    -1          10
3    NA          10
4     1           9
5     2           0

It helped me to understand that the case_match is equivalent to

case_match(var1, 
           c(FALSE, FALSE, TRUE, FALSE, FALSE) ~ 9, 
           2 ~ 0, 
           .default = 10)

because

> is.na(df$var1)
    [1] FALSE FALSE  TRUE FALSE FALSE

case_match works by seeing if any values in the first argument are equal to or are in any values on the LHS of the arguments. Going row by row through var1 now

0 is equivalent to FALSE so row 1 is in c(FALSE, FALSE, TRUE, FALSE, FALSE), thus it returns 9. (So I do respectfully disagree with @Allan Cameron that "neither TRUE nor FALSE are equal to the values in var1")

-1 is not in c(FALSE, FALSE, TRUE, FALSE, FALSE) and !=2 so it goes to .default

NA comparisons return NA so again .default

Similar to row 1, 1==TRUE so row 1 is in c(FALSE, FALSE, TRUE, FALSE, FALSE), thus it returns 9.

Lastly, 2 ~ 0 needs no explanation.

Hence, you get var1_recode == c(9, 10, 10, 9, 0).

本文标签: rdplyrcasematch handling of missing valuesStack Overflow