admin管理员组文章数量:1301557
I'm trying a function in polars, and it is significantly slower than my pandas equivalent.
My pandas function is the following:
import pandas as pd
import time
import numpy as np
target_value = 0.5
data = np.random.rand(1000,100)
df = pd.DataFrame(data)
run_times = []
for i in range(100):
st = time.perf_counter()
df_filtered = df.loc[(df[0] - target_value).abs() == (df[0] - target_value).abs().min()]
run_time = time.perf_counter() - st
run_times.append(run_time)
print(f"avg pandas run: {sum(run_times)/len(run_times)}")
and polars is the following
import polars as pl
import time
import numpy as np
target_value = 0.5
data = np.random.rand(1000,100)
df = pl.DataFrame(data)
run_times = []
for i in range(100):
st = time.perf_counter()
df = df.with_columns(abs_diff = (pl.col('column_0')-target_value).abs())
df_filtered = df.filter(pl.col('abs_diff') == df['abs_diff'].min())
run_time = time.perf_counter() - st
run_times.append(run_time)
print(f"avg polars run: {sum(run_times)/len(run_times)}")
My real datasets have 1,000 to 10,000 rows and 100 columns, and I need to filter through many different datasets. For one example of df shape (1_000, 100), I'm seeing my pandas version is magnitudes faster (0.0006s for pandas and 0.0037s for polars), which was unexpected. Is there a more efficient way to write my polars query? Or is it just expected for pandas to outperform with smaller datasets of this size?
One thing to note, when I test it with 2 columns, polars is faster, and the more columns I add, the slower polars is. On the other hand, polars begins to outperform pandas after about 500_000 rows vs 100 columns.
Additionally in my real use case, I would need to return multiple rows that match the closest value.
Not sure if this is important, but for additional context, I'm running python on a linux server.
I'm trying a function in polars, and it is significantly slower than my pandas equivalent.
My pandas function is the following:
import pandas as pd
import time
import numpy as np
target_value = 0.5
data = np.random.rand(1000,100)
df = pd.DataFrame(data)
run_times = []
for i in range(100):
st = time.perf_counter()
df_filtered = df.loc[(df[0] - target_value).abs() == (df[0] - target_value).abs().min()]
run_time = time.perf_counter() - st
run_times.append(run_time)
print(f"avg pandas run: {sum(run_times)/len(run_times)}")
and polars is the following
import polars as pl
import time
import numpy as np
target_value = 0.5
data = np.random.rand(1000,100)
df = pl.DataFrame(data)
run_times = []
for i in range(100):
st = time.perf_counter()
df = df.with_columns(abs_diff = (pl.col('column_0')-target_value).abs())
df_filtered = df.filter(pl.col('abs_diff') == df['abs_diff'].min())
run_time = time.perf_counter() - st
run_times.append(run_time)
print(f"avg polars run: {sum(run_times)/len(run_times)}")
My real datasets have 1,000 to 10,000 rows and 100 columns, and I need to filter through many different datasets. For one example of df shape (1_000, 100), I'm seeing my pandas version is magnitudes faster (0.0006s for pandas and 0.0037s for polars), which was unexpected. Is there a more efficient way to write my polars query? Or is it just expected for pandas to outperform with smaller datasets of this size?
One thing to note, when I test it with 2 columns, polars is faster, and the more columns I add, the slower polars is. On the other hand, polars begins to outperform pandas after about 500_000 rows vs 100 columns.
Additionally in my real use case, I would need to return multiple rows that match the closest value.
Not sure if this is important, but for additional context, I'm running python on a linux server.
Share edited Feb 11 at 17:17 Raymond Han asked Feb 11 at 5:36 Raymond HanRaymond Han 511 silver badge6 bronze badges 4- Welcome to Stack Overflow! Please remember that Stack Overflow is not your favourite Python forum, but rather a question and answer site for all programming related questions. Thus, always include the tag of the language you are programming in, that way other users familiar with that language can more easily find your question. Take the tour and read up on How to Ask to get more information on how this site works, then edit the question with the relevant tags. – Adriaan Commented Feb 11 at 7:27
- What version of polars are you using? I cannot reproduce the difference in speed. – Hericks Commented Feb 11 at 8:41
- I'm using polars 1.22.0 – Raymond Han Commented Feb 11 at 17:00
- Is there any difference if you use LazyFrames? – jqurious Commented Feb 13 at 15:22
3 Answers
Reset to default 3Testing your "function" with pandas
, polars
and numpy
import pandas as pd
import time
import numpy as np
import polars as pl
def test(func, argument):
run_times = []
for i in range(100):
st = time.perf_counter()
df = func(argument)
run_time = time.perf_counter() - st
run_times.append(run_time)
return np.mean(run_times)
def f_pandas(df):
min_abs_diff = (df[0] - target_value).abs().min()
return df.loc[(df[0] - target_value).abs() == min_abs_diff]
def f_pandas_vectorized(df):
return df.loc[(df[0] - target_value).abs().idxmin()]
def f_polars(df):
min_abs_diff = (df["column_0"] - target_value).abs().min()
return df.filter((df["column_0"] - target_value).abs() == min_abs_diff)
def f_numpy(data):
abs_diff = np.abs(data[:, 0] - target_value)
min_idx = np.argmin(abs_diff)
return pd.DataFrame(data[[min_idx]])
target_value = 0.5
data = np.random.rand(100000, 1000)
df = pd.DataFrame(data)
df_pl = pl.DataFrame(data)
print(f"average pandas runtime: {test(f_pandas, df)}")
print(f"average pandas runtime with idxmin(): {test(f_pandas_vectorized, df)}")
print(f"average polars runtime: {test(f_polars, df_pl)}")
print(f"average numpy runtime: {test(f_numpy, data)}")
I got this results running in a Jupyter Notebook on a Linux machine.
average pandas runtime: 0.00989325414002451
average pandas runtime with idxmin(): 0.005005129760029377
average polars runtime: 0.006758741329904296
average numpy runtime: 0.004175669220221607
average pandas runtime: 0.009967705049803044
average pandas runtime with idxmin(): 0.005097740050114225
average polars runtime: 0.006972378070222476
average numpy runtime: 0.004102102290034964
average pandas runtime: 0.010020545769948512
average pandas runtime with idxmin(): 0.004993948210048984
average polars runtime: 0.007027968560159934
average numpy runtime: 0.004024256040174805
You see polars
is faster than your panda
code, but using vectorized operations like idxmin() in pandas
at least in this case is better than polars
. numpy
is often faster in this type of numerical work.
You could optimize your polars query a bit, specially use expressions instead of df["col"]. Possibly even more so if you don't mind only getting one row out of the query instead of including all values tying for the minimal.
import polars as pl
import time
import numpy as np
target_value = 0.5
data = np.random.rand(1000,100)
df = pl.DataFrame(data)
run_times = []
for i in range(100):
st = time.time()
abs_diff = (pl.col('column_0') - target_value).abs()
# Option A - keep original behaviour but just better optimized
# df_filtered = df.filter(abs_diff == abs_diff.min())
# Option B - only get the row with minimal index instead of filtering
df_filtered = df.row(df.select(abs_diff.arg_min()).item())
run_time = time.time() - st
run_times.append(run_time)
print(f"avg polars run: {sum(run_times)/len(run_times)}")
Like others said, numpy (or jax etc.) may as well be better suited for this kind of work though
Per Polars Support:
First, you need way more iterations than just 100 for such a small timewindow. With 10,000 iterations I get the following:
avg polars run: 0.0005123567976988852 avg pandas run: 0.00012923809615895151 But we can rewrite the polars query to be more efficient:
df_filtered = ( df.lazy() .with_columns(abs_diff = (pl.col.column_0 - target_value).abs()) .filter(pl.col.abs_diff == pl.col.abs_diff.min()) .collect() )
Then we get:
avg polars run: 0.00018435594723559915 Ultimately Polars isn't optimized for doing many tiny tiny horizontally wide datasets though.
Unfortunately, I didn't experience much of a performance boost when I tried the version above. It does seem the speeds are very machine dependent. I will continue with pandas for this specific use case. Thanks all for looking.
本文标签: performanceWhy is this polars filtering so much slower than my pandas equivalentStack Overflow
版权声明:本文标题:performance - Why is this polars filtering so much slower than my pandas equivalent? - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1741676742a2391923.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论