admin管理员组文章数量:1415420
I want to understand the performance implications of elementwise transformations on rolling window aggregation. Consider the following two versions of a rolling aggregation (of floating values):
I)
X = frame.rolling(index_column="date", group_by="group", period="360d").agg(
pl.col("value").sin().sum().alias("sin(value)"),
pl.col("value").cos().sum().alias("cos(value)"),
pl.col("value").sum()
)
II)
Y = frame.with_columns(
pl.col("value").sin().alias("sin(value)"),
pl.col("value").cos().alias("cos(value)")
).rolling(index_column="date", group_by="group", period="360d").agg(
pl.col("sin(value)").sum(),
pl.col("cos(value)").sum(),
pl.col("value").sum())
Naively I'd expect the second version to be universally faster than the first version, since by design it avoids redundant re-computation of sin(value)
and cos(value)
per each window (and group).
I was however surprised to find that both versions are almost identical in runtime for different size of the group and the time dimension. How is that possible? Is polars automagically pushing the elementwise transformations (sin
and cos
) out of the rolling window aggregation?
In addition for a large number of dates the second version can be slower than the first version, cf. image below.
Can anyone help me understand what is going on here?
Full code for the experiment below
import datetime
import itertools
import time
import numpy as np
import polars as pl
import polars.testing
def run_experiment():
start = datetime.date.fromisoformat("1991-01-01")
result = {"num_dates": [], "num_groups": [], "version1": [], "version2": [], }
for n_dates in [1000, 2000, 5000, 10000]:
end = start + datetime.timedelta(days=(n_dates - 1))
dates = pl.date_range(start, end, eager=True)
for m_groups in [10, 20, 50, 100, 200, 500, 1000]:
groups = [f"g_{i + 1}" for i in range(m_groups)]
groups_, dates_ = list(zip(*itertools.product(groups, dates)))
frame = pl.from_dict({"group": groups_, "date": dates_, "value": np.random.rand(n_dates * m_groups)})
t0 = time.time()
X = frame.rolling(index_column="date", group_by="group", period="360d").agg(
pl.col("value").sin().sum().alias("sin(value)"),
pl.col("value").cos().sum().alias("cos(value)"),
pl.col("value").sum()
)
t1 = time.time() - t0
t0 = time.time()
Y = frame.with_columns(
pl.col("value").sin().alias("sin(value)"),
pl.col("value").cos().alias("cos(value)")
).rolling(index_column="date", group_by="group", period="360d").agg(
pl.col("sin(value)").sum(),
pl.col("cos(value)").sum(),
pl.col("value").sum()
)
t2 = time.time() - t0
polars.testing.assert_frame_equal(X, Y)
result["num_dates"].append(n_dates)
result["num_groups"].append(m_groups)
result["version1"].append(t1)
result["version2"].append(t2)
return pl.from_dict(result)
本文标签:
版权声明:本文标题:python - Optimisation of window aggregations: Pushing per-element expressions out of the window aggregation - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1745236459a2649067.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论