admin管理员组

文章数量:1290929

I'm comparing the speed difference between these two methods of updating a column on a data table (for testing, the flag column is 50% TRUE, 50% FALSE):

A[flag==TRUE,b:=b + myfunc(a,b)]
A[,b:=b + fifelse(flag,myfunc(a,b),0)]

Naively, I would expect the first case to be faster, since they perform the same number of if checks on the flag column, the same number of computations, but the second method has a number of gratuitous b+0 calculations and b:=b reassignments. However, the results I got using the full code below are:

[1] "Total time run 1: 0.281515121459961"
[1] "Total time run 2: 0.227149963378906"

The second method is almost 24% faster. Why is this the case?

Full test code:

library(data.table)

n = 10000000

# Random non-trivial function
myfunc= function(x,y) {
  return((x+y)*x/(y^2))
}


A = data.table(flag=rep(c(TRUE,FALSE),n/2),a=(1:n)/n, b = (1:n)/n)
run1Start = Sys.time()
A[flag==TRUE,b:=b + myfunc(a,b)]
run1End = Sys.time()
print(paste0("Total time run 1: ", difftime(run1End,run1Start,units="secs")))


A = data.table(flag=rep(c(TRUE,FALSE),n/2),a=(1:n)/n, b = (1:n)/n)
run2Start = Sys.time()
A[,b:=b + fifelse(flag,myfunc(a,b),0)]
run2End = Sys.time()
print(paste0("Total time run 2: ", difftime(run2End,run2Start,units="secs")))

I'm comparing the speed difference between these two methods of updating a column on a data table (for testing, the flag column is 50% TRUE, 50% FALSE):

A[flag==TRUE,b:=b + myfunc(a,b)]
A[,b:=b + fifelse(flag,myfunc(a,b),0)]

Naively, I would expect the first case to be faster, since they perform the same number of if checks on the flag column, the same number of computations, but the second method has a number of gratuitous b+0 calculations and b:=b reassignments. However, the results I got using the full code below are:

[1] "Total time run 1: 0.281515121459961"
[1] "Total time run 2: 0.227149963378906"

The second method is almost 24% faster. Why is this the case?

Full test code:

library(data.table)

n = 10000000

# Random non-trivial function
myfunc= function(x,y) {
  return((x+y)*x/(y^2))
}


A = data.table(flag=rep(c(TRUE,FALSE),n/2),a=(1:n)/n, b = (1:n)/n)
run1Start = Sys.time()
A[flag==TRUE,b:=b + myfunc(a,b)]
run1End = Sys.time()
print(paste0("Total time run 1: ", difftime(run1End,run1Start,units="secs")))


A = data.table(flag=rep(c(TRUE,FALSE),n/2),a=(1:n)/n, b = (1:n)/n)
run2Start = Sys.time()
A[,b:=b + fifelse(flag,myfunc(a,b),0)]
run2End = Sys.time()
print(paste0("Total time run 2: ", difftime(run2End,run2Start,units="secs")))

Share Improve this question edited Feb 13 at 23:08 jpsmith 17.7k6 gold badges22 silver badges45 bronze badges asked Feb 13 at 22:53 SinnombreSinnombre 4682 silver badges9 bronze badges 2
  • 1 Why do you care about a difference of a few ms? In a real-world application, those should not matter. Or are you running this code very often in a loop? Then you should reconsider the whole approach. – Roland Commented Feb 14 at 5:57
  • Maybe worth metioning that which(flag==TRUE) might be faster than flag==TRUE or (flag). – s_baldur Commented Feb 14 at 13:37
Add a comment  | 

1 Answer 1

Reset to default 5

When you do:

A[flag == TRUE, b := b + myfunc(a,b)]

R first has to process which rows satisfy flag == TRUE (i.e. it does a subsetting operation), then perform the function on those rows. Even though you're only running the function on ~1/2 the rows, that extra step of subsetting adds overhead.

On the other hand, when you use:

A[, b := b + fifelse(flag, myfunc(a,b), 0)]

the entire operation is done in one go and reduces/removes that overhead.

However, since flag is a boolean vector, the fastest approach may be:

A[(flag), b:=b + myfunc(a,b)]

This may seem counterintuitive since it's similar to the first approach, but it takes advantage of the logical vector to subset the rows (not evaluate the rows).

microbenchmark::microbenchmark(
  x1 = A[flag==TRUE,b:=b + myfunc(a,b)],
  x2 = A[,b:=b + fifelse(flag,myfunc(a,b),0)],
  y1 = A[(flag), b:=b + myfunc(a,b)]
)

Unit: milliseconds
 expr      min       lq     mean   median       uq       max neval cld
   x1 149.3406 234.1432 333.2438 300.7187 375.2823 1040.9021   100  a 
   x2  94.7926 161.1049 246.2662 218.9331 267.1994  904.3553   100   b
   y1 100.1030 153.8516 209.7753 188.3112 228.9370  866.6606   100   b

An imperfect analogy would be if you were a gardener needing to only water half your plants. The first approach you would walk through and tag the ones needing watering while someone else follows behind and waters where needed. The second approach, you would identify the plants that need watering and water them on the spot. The third approach has all the plants needed for watering already grouped together.

本文标签: rWhy is fifelse faster than checking a subset in datatableStack Overflow