r - Why is fifelse faster than checking a subset in data.table? - Stack Overflow

IT技术

更新时间：2025-03-090

admin管理员组
文章数量:1290929

I'm comparing the speed difference between these two methods of updating a column on a data table (for testing, the flag column is 50% TRUE, 50% FALSE):

A[flag==TRUE,b:=b + myfunc(a,b)]
A[,b:=b + fifelse(flag,myfunc(a,b),0)]

Naively, I would expect the first case to be faster, since they perform the same number of if checks on the flag column, the same number of computations, but the second method has a number of gratuitous b+0 calculations and b:=b reassignments. However, the results I got using the full code below are:

[1] "Total time run 1: 0.281515121459961"
[1] "Total time run 2: 0.227149963378906"

The second method is almost 24% faster. Why is this the case?

Full test code:

library(data.table)

n = 10000000

# Random non-trivial function
myfunc= function(x,y) {
  return((x+y)*x/(y^2))
}


A = data.table(flag=rep(c(TRUE,FALSE),n/2),a=(1:n)/n, b = (1:n)/n)
run1Start = Sys.time()
A[flag==TRUE,b:=b + myfunc(a,b)]
run1End = Sys.time()
print(paste0("Total time run 1: ", difftime(run1End,run1Start,units="secs")))


A = data.table(flag=rep(c(TRUE,FALSE),n/2),a=(1:n)/n, b = (1:n)/n)
run2Start = Sys.time()
A[,b:=b + fifelse(flag,myfunc(a,b),0)]
run2End = Sys.time()
print(paste0("Total time run 2: ", difftime(run2End,run2Start,units="secs")))

I'm comparing the speed difference between these two methods of updating a column on a data table (for testing, the flag column is 50% TRUE, 50% FALSE):

A[flag==TRUE,b:=b + myfunc(a,b)]
A[,b:=b + fifelse(flag,myfunc(a,b),0)]

Naively, I would expect the first case to be faster, since they perform the same number of if checks on the flag column, the same number of computations, but the second method has a number of gratuitous b+0 calculations and b:=b reassignments. However, the results I got using the full code below are:

[1] "Total time run 1: 0.281515121459961"
[1] "Total time run 2: 0.227149963378906"

The second method is almost 24% faster. Why is this the case?

Full test code:

library(data.table)

n = 10000000

# Random non-trivial function
myfunc= function(x,y) {
  return((x+y)*x/(y^2))
}


A = data.table(flag=rep(c(TRUE,FALSE),n/2),a=(1:n)/n, b = (1:n)/n)
run1Start = Sys.time()
A[flag==TRUE,b:=b + myfunc(a,b)]
run1End = Sys.time()
print(paste0("Total time run 1: ", difftime(run1End,run1Start,units="secs")))


A = data.table(flag=rep(c(TRUE,FALSE),n/2),a=(1:n)/n, b = (1:n)/n)
run2Start = Sys.time()
A[,b:=b + fifelse(flag,myfunc(a,b),0)]
run2End = Sys.time()
print(paste0("Total time run 2: ", difftime(run2End,run2Start,units="secs")))

Share Improve this question edited Feb 13 at 23:08 jpsmith 17.7k6 gold badges22 silver badges45 bronze badges asked Feb 13 at 22:53 Sinnombre 4682 silver badges9 bronze badges

1 Why do you care about a difference of a few ms? In a real-world application, those should not matter. Or are you running this code very often in a loop? Then you should reconsider the whole approach. – Roland Commented Feb 14 at 5:57
Maybe worth metioning that which(flag==TRUE) might be faster than flag==TRUE or (flag). – s_baldur Commented Feb 14 at 13:37

Add a comment |

1 Answer 1

Sorted by: Reset to default 5

When you do:

A[flag == TRUE, b := b + myfunc(a,b)]

R first has to process which rows satisfy flag == TRUE (i.e. it does a subsetting operation), then perform the function on those rows. Even though you're only running the function on ~1/2 the rows, that extra step of subsetting adds overhead.

On the other hand, when you use:

A[, b := b + fifelse(flag, myfunc(a,b), 0)]

the entire operation is done in one go and reduces/removes that overhead.

However, since flag is a boolean vector, the fastest approach may be:

A[(flag), b:=b + myfunc(a,b)]

This may seem counterintuitive since it's similar to the first approach, but it takes advantage of the logical vector to subset the rows (not evaluate the rows).

microbenchmark::microbenchmark(
  x1 = A[flag==TRUE,b:=b + myfunc(a,b)],
  x2 = A[,b:=b + fifelse(flag,myfunc(a,b),0)],
  y1 = A[(flag), b:=b + myfunc(a,b)]
)

Unit: milliseconds
 expr      min       lq     mean   median       uq       max neval cld
   x1 149.3406 234.1432 333.2438 300.7187 375.2823 1040.9021   100  a 
   x2  94.7926 161.1049 246.2662 218.9331 267.1994  904.3553   100   b
   y1 100.1030 153.8516 209.7753 188.3112 228.9370  866.6606   100   b

An imperfect analogy would be if you were a gardener needing to only water half your plants. The first approach you would walk through and tag the ones needing watering while someone else follows behind and waters where needed. The second approach, you would identify the plants that need watering and water them on the spot. The third approach has all the plants needed for watering already grouped together.

本文标签： rWhy is fifelse faster than checking a subset in datatableStack Overflow

版权声明：本文标题：r - Why is fifelse faster than checking a subset in data.table? - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1741502978a2382154.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

r - Why is fifelse faster than checking a subset in data.table? - Stack Overflow

1 Answer 1

更多相关文章