admin管理员组文章数量:1290929
I'm comparing the speed difference between these two methods of updating a column on a data table (for testing, the flag column is 50% TRUE, 50% FALSE):
A[flag==TRUE,b:=b + myfunc(a,b)]
A[,b:=b + fifelse(flag,myfunc(a,b),0)]
Naively, I would expect the first case to be faster, since they perform the same number of if checks on the flag column, the same number of computations, but the second method has a number of gratuitous b+0 calculations and b:=b reassignments. However, the results I got using the full code below are:
[1] "Total time run 1: 0.281515121459961"
[1] "Total time run 2: 0.227149963378906"
The second method is almost 24% faster. Why is this the case?
Full test code:
library(data.table)
n = 10000000
# Random non-trivial function
myfunc= function(x,y) {
return((x+y)*x/(y^2))
}
A = data.table(flag=rep(c(TRUE,FALSE),n/2),a=(1:n)/n, b = (1:n)/n)
run1Start = Sys.time()
A[flag==TRUE,b:=b + myfunc(a,b)]
run1End = Sys.time()
print(paste0("Total time run 1: ", difftime(run1End,run1Start,units="secs")))
A = data.table(flag=rep(c(TRUE,FALSE),n/2),a=(1:n)/n, b = (1:n)/n)
run2Start = Sys.time()
A[,b:=b + fifelse(flag,myfunc(a,b),0)]
run2End = Sys.time()
print(paste0("Total time run 2: ", difftime(run2End,run2Start,units="secs")))
I'm comparing the speed difference between these two methods of updating a column on a data table (for testing, the flag column is 50% TRUE, 50% FALSE):
A[flag==TRUE,b:=b + myfunc(a,b)]
A[,b:=b + fifelse(flag,myfunc(a,b),0)]
Naively, I would expect the first case to be faster, since they perform the same number of if checks on the flag column, the same number of computations, but the second method has a number of gratuitous b+0 calculations and b:=b reassignments. However, the results I got using the full code below are:
[1] "Total time run 1: 0.281515121459961"
[1] "Total time run 2: 0.227149963378906"
The second method is almost 24% faster. Why is this the case?
Full test code:
library(data.table)
n = 10000000
# Random non-trivial function
myfunc= function(x,y) {
return((x+y)*x/(y^2))
}
A = data.table(flag=rep(c(TRUE,FALSE),n/2),a=(1:n)/n, b = (1:n)/n)
run1Start = Sys.time()
A[flag==TRUE,b:=b + myfunc(a,b)]
run1End = Sys.time()
print(paste0("Total time run 1: ", difftime(run1End,run1Start,units="secs")))
A = data.table(flag=rep(c(TRUE,FALSE),n/2),a=(1:n)/n, b = (1:n)/n)
run2Start = Sys.time()
A[,b:=b + fifelse(flag,myfunc(a,b),0)]
run2End = Sys.time()
print(paste0("Total time run 2: ", difftime(run2End,run2Start,units="secs")))
Share
Improve this question
edited Feb 13 at 23:08
jpsmith
17.7k6 gold badges22 silver badges45 bronze badges
asked Feb 13 at 22:53
SinnombreSinnombre
4682 silver badges9 bronze badges
2
|
1 Answer
Reset to default 5When you do:
A[flag == TRUE, b := b + myfunc(a,b)]
R first has to process which rows satisfy flag == TRUE
(i.e. it does a subsetting operation), then perform the function on those rows. Even though you're only running the function on ~1/2 the rows, that extra step of subsetting adds overhead.
On the other hand, when you use:
A[, b := b + fifelse(flag, myfunc(a,b), 0)]
the entire operation is done in one go and reduces/removes that overhead.
However, since flag
is a boolean vector, the fastest approach may be:
A[(flag), b:=b + myfunc(a,b)]
This may seem counterintuitive since it's similar to the first approach, but it takes advantage of the logical vector to subset the rows (not evaluate the rows).
microbenchmark::microbenchmark(
x1 = A[flag==TRUE,b:=b + myfunc(a,b)],
x2 = A[,b:=b + fifelse(flag,myfunc(a,b),0)],
y1 = A[(flag), b:=b + myfunc(a,b)]
)
Unit: milliseconds
expr min lq mean median uq max neval cld
x1 149.3406 234.1432 333.2438 300.7187 375.2823 1040.9021 100 a
x2 94.7926 161.1049 246.2662 218.9331 267.1994 904.3553 100 b
y1 100.1030 153.8516 209.7753 188.3112 228.9370 866.6606 100 b
An imperfect analogy would be if you were a gardener needing to only water half your plants. The first approach you would walk through and tag the ones needing watering while someone else follows behind and waters where needed. The second approach, you would identify the plants that need watering and water them on the spot. The third approach has all the plants needed for watering already grouped together.
本文标签: rWhy is fifelse faster than checking a subset in datatableStack Overflow
版权声明:本文标题:r - Why is fifelse faster than checking a subset in data.table? - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1741502978a2382154.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
which(flag==TRUE)
might be faster than flag==TRUE or (flag). – s_baldur Commented Feb 14 at 13:37