admin管理员组

文章数量:1417092

I have an R data frame that I need to perform a random binomial draw for each row. The n = argument in the random binomial draw will be based on a value in a column of that row. Further, this operation should be within a case_when() based upon a conditional in the data.

Note: R's rowwise() function in tidyverse is much too slow, the data frame is too large and is being performed at each timestep in a simulation model. Is there a way to quickly and efficiently do this?

Example:

library(tidyverse)

df = data.frame(condition = c("A","B","A","B","C"),
                number = c(1000,1000,1000,1000,1))
prob1 = 0.000517143
prob2 = 0.000213472


set.seed(1)
df = df %>% 
  mutate(output = case_when(condition == "A" ~ sum(rbinom(n = number,
                                                          size = 1,
                                                          prob = prob1)),
                            condition == "B" ~ sum(rbinom(n = number,
                                                          size = 1,
                                                          prob = prob2)),
                            TRUE ~ 0))
print(df)
#>   condition number output
#> 1         A   1000      0
#> 2         B   1000      0
#> 3         A   1000      0
#> 4         B   1000      0
#> 5         C      1      0

Here, it looks like the random binomial draws are being reused and returning all zeros.

For a check, here it is sampled repeatedly. Feasibly, the sum(df$output) should be around 2 each draw.

for(i in 1:10){
  df = df %>% 
    mutate(output = case_when(condition == "A" ~ sum(rbinom(n = number,
                                                            size = 1,
                                                            prob = prob1)),
                              condition == "B" ~ sum(rbinom(n = number,
                                                            size = 1,
                                                            prob = prob2)),
                              TRUE ~ 0))
  print(sum(df$output))}
#> [1] 0
#> [1] 0
#> [1] 0
#> [1] 0
#> [1] 0
#> [1] 0
#> [1] 0
#> [1] 0
#> [1] 0
#> [1] 0

Unsure of the way forward.

I have an R data frame that I need to perform a random binomial draw for each row. The n = argument in the random binomial draw will be based on a value in a column of that row. Further, this operation should be within a case_when() based upon a conditional in the data.

Note: R's rowwise() function in tidyverse is much too slow, the data frame is too large and is being performed at each timestep in a simulation model. Is there a way to quickly and efficiently do this?

Example:

library(tidyverse)

df = data.frame(condition = c("A","B","A","B","C"),
                number = c(1000,1000,1000,1000,1))
prob1 = 0.000517143
prob2 = 0.000213472


set.seed(1)
df = df %>% 
  mutate(output = case_when(condition == "A" ~ sum(rbinom(n = number,
                                                          size = 1,
                                                          prob = prob1)),
                            condition == "B" ~ sum(rbinom(n = number,
                                                          size = 1,
                                                          prob = prob2)),
                            TRUE ~ 0))
print(df)
#>   condition number output
#> 1         A   1000      0
#> 2         B   1000      0
#> 3         A   1000      0
#> 4         B   1000      0
#> 5         C      1      0

Here, it looks like the random binomial draws are being reused and returning all zeros.

For a check, here it is sampled repeatedly. Feasibly, the sum(df$output) should be around 2 each draw.

for(i in 1:10){
  df = df %>% 
    mutate(output = case_when(condition == "A" ~ sum(rbinom(n = number,
                                                            size = 1,
                                                            prob = prob1)),
                              condition == "B" ~ sum(rbinom(n = number,
                                                            size = 1,
                                                            prob = prob2)),
                              TRUE ~ 0))
  print(sum(df$output))}
#> [1] 0
#> [1] 0
#> [1] 0
#> [1] 0
#> [1] 0
#> [1] 0
#> [1] 0
#> [1] 0
#> [1] 0
#> [1] 0

Unsure of the way forward.

Share Improve this question asked Feb 3 at 2:26 geoscience123geoscience123 2301 silver badge17 bronze badges 3
  • How big is the real data frame? – Edward Commented Feb 3 at 2:42
  • Are prob1 and prob2 constant, as in your example data, or do they vary? How many unique values does number take — is it substantially less than the number of rows? – zephryl Commented Feb 3 at 3:05
  • @Edward, The real data frame varies, but somewhere around 80,000 observations. – geoscience123 Commented Feb 3 at 11:45
Add a comment  | 

3 Answers 3

Reset to default 5

Why are you summing draws of size 1? Refer to Wikipedia:

In probability theory and statistics, the binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yes–no question, and each with its own Boolean-valued outcome: success (with probability p) or failure (with probability q = 1 − p).

Thus, you can sample once per row and don't need to sum. Since rbinom is fully vectorized, you don't need a loop.

df <- merge(df, data.frame(condition = c("A", "B"),
                     prob = c(0.000517143, 0.000213472)), 
      by = "condition", all.x = TRUE)
df[is.na(df$prob), "prob"] <- 0

set.seed(1)
df$output <- with(df, rbinom(length(number), size = number, prob = prob)) 

#  condition number        prob output
#1         A   1000 0.000517143      0
#2         A   1000 0.000517143      0
#3         B   1000 0.000213472      0
#4         B   1000 0.000213472      1
#5         C      1 0.000000000      0

You could use mapply:

set.seed(1)
df['output'] <- mapply(function(cond, num) sum(rbinom(n = num, 
                                                      size = 1, 
                                                      prob = ifelse(cond=="A", prob1,
                                                                    ifelse(cond=="B", prob2, 0)))),
                       cond=df$condition, num=df$number)
df

  condition number output
1         A   1000      1
2         B   1000      0
3         A   1000      1
4         B   1000      0
5         C      1      0

For a larger data frame (one with 100,000 rows), the above command takes about 5 seconds on my machine.

You don't really need tidyverse for this problem.

You can avoid slow ifelse/case_when calls using a probs list. Another advantage is improved clarity.

> probs <- list(A=0.000517143, B=0.000213472, C=0)
> 
> set.seed(1)
> mapply(\(x, y) rbinom(n=x, size=1, prob=probs[[y]]) |> sum(), 
+        df$number, df$condition)
[1] 1 0 1 0 0

Altogether:

> set.seed(1)
> df |> 
+   transform(
+     out=mapply(\(x, y) rbinom(n=x, size=1, prob=probs[[y]]) |> sum(), 
+                number, condition)
+   )
  condition number out
1         A   1000   1
2         B   1000   0
3         A   1000   1
4         B   1000   0
5         C      1   0

本文标签: conditional statementsPerform a random binomial draw for each row in R without rowwise()Stack Overflow