performance - Why is this polars filtering so much slower than my pandas equivalent? - Stack Overflow

IT技术

更新时间：2025-03-113

admin管理员组
文章数量:1301557

I'm trying a function in polars, and it is significantly slower than my pandas equivalent.

My pandas function is the following:

import pandas as pd
import time
import numpy as np

target_value = 0.5
data = np.random.rand(1000,100)
df = pd.DataFrame(data)

run_times = []
for i in range(100):
    st = time.perf_counter()
    df_filtered = df.loc[(df[0] - target_value).abs() == (df[0] - target_value).abs().min()]
    run_time = time.perf_counter() - st
    run_times.append(run_time)
print(f"avg pandas run: {sum(run_times)/len(run_times)}")

and polars is the following

import polars as pl
import time
import numpy as np

target_value = 0.5
data = np.random.rand(1000,100)
df = pl.DataFrame(data)

run_times = []
for i in range(100):
    st = time.perf_counter()
    df = df.with_columns(abs_diff = (pl.col('column_0')-target_value).abs())
    df_filtered = df.filter(pl.col('abs_diff') == df['abs_diff'].min())
    run_time = time.perf_counter() - st
    run_times.append(run_time)
print(f"avg polars run: {sum(run_times)/len(run_times)}")

My real datasets have 1,000 to 10,000 rows and 100 columns, and I need to filter through many different datasets. For one example of df shape (1_000, 100), I'm seeing my pandas version is magnitudes faster (0.0006s for pandas and 0.0037s for polars), which was unexpected. Is there a more efficient way to write my polars query? Or is it just expected for pandas to outperform with smaller datasets of this size?

One thing to note, when I test it with 2 columns, polars is faster, and the more columns I add, the slower polars is. On the other hand, polars begins to outperform pandas after about 500_000 rows vs 100 columns.

Additionally in my real use case, I would need to return multiple rows that match the closest value.

Not sure if this is important, but for additional context, I'm running python on a linux server.

I'm trying a function in polars, and it is significantly slower than my pandas equivalent.

My pandas function is the following:

import pandas as pd
import time
import numpy as np

target_value = 0.5
data = np.random.rand(1000,100)
df = pd.DataFrame(data)

run_times = []
for i in range(100):
    st = time.perf_counter()
    df_filtered = df.loc[(df[0] - target_value).abs() == (df[0] - target_value).abs().min()]
    run_time = time.perf_counter() - st
    run_times.append(run_time)
print(f"avg pandas run: {sum(run_times)/len(run_times)}")

and polars is the following

import polars as pl
import time
import numpy as np

target_value = 0.5
data = np.random.rand(1000,100)
df = pl.DataFrame(data)

run_times = []
for i in range(100):
    st = time.perf_counter()
    df = df.with_columns(abs_diff = (pl.col('column_0')-target_value).abs())
    df_filtered = df.filter(pl.col('abs_diff') == df['abs_diff'].min())
    run_time = time.perf_counter() - st
    run_times.append(run_time)
print(f"avg polars run: {sum(run_times)/len(run_times)}")

My real datasets have 1,000 to 10,000 rows and 100 columns, and I need to filter through many different datasets. For one example of df shape (1_000, 100), I'm seeing my pandas version is magnitudes faster (0.0006s for pandas and 0.0037s for polars), which was unexpected. Is there a more efficient way to write my polars query? Or is it just expected for pandas to outperform with smaller datasets of this size?

One thing to note, when I test it with 2 columns, polars is faster, and the more columns I add, the slower polars is. On the other hand, polars begins to outperform pandas after about 500_000 rows vs 100 columns.

Additionally in my real use case, I would need to return multiple rows that match the closest value.

Not sure if this is important, but for additional context, I'm running python on a linux server.

Share edited Feb 11 at 17:17 asked Feb 11 at 5:36 Raymond Han 511 silver badge6 bronze badges

Welcome to Stack Overflow! Please remember that Stack Overflow is not your favourite Python forum, but rather a question and answer site for all programming related questions. Thus, always include the tag of the language you are programming in, that way other users familiar with that language can more easily find your question. Take the tour and read up on How to Ask to get more information on how this site works, then edit the question with the relevant tags. – Adriaan Commented Feb 11 at 7:27
What version of polars are you using? I cannot reproduce the difference in speed. – Hericks Commented Feb 11 at 8:41
I'm using polars 1.22.0 – Raymond Han Commented Feb 11 at 17:00
Is there any difference if you use LazyFrames? – jqurious Commented Feb 13 at 15:22

Add a comment |

3 Answers 3

Sorted by: Reset to default 3

Testing your "function" with pandas, polars and numpy

import pandas as pd
import time
import numpy as np
import polars as pl


def test(func, argument):
    run_times = []
    for i in range(100):
        st = time.perf_counter()
        df = func(argument)
        run_time = time.perf_counter() - st
        run_times.append(run_time)
    return np.mean(run_times)

def f_pandas(df):
    min_abs_diff = (df[0] - target_value).abs().min()
    return df.loc[(df[0] - target_value).abs() == min_abs_diff]

def f_pandas_vectorized(df):
    return df.loc[(df[0] - target_value).abs().idxmin()]

def f_polars(df):
    min_abs_diff = (df["column_0"] - target_value).abs().min() 
    return df.filter((df["column_0"] - target_value).abs() == min_abs_diff)

def f_numpy(data):
    abs_diff = np.abs(data[:, 0] - target_value)
    min_idx = np.argmin(abs_diff)
    return pd.DataFrame(data[[min_idx]])


target_value = 0.5
data = np.random.rand(100000, 1000)
df = pd.DataFrame(data)
df_pl = pl.DataFrame(data)

print(f"average pandas runtime: {test(f_pandas, df)}")
print(f"average pandas runtime with idxmin(): {test(f_pandas_vectorized, df)}")
print(f"average polars runtime: {test(f_polars, df_pl)}")
print(f"average numpy runtime: {test(f_numpy, data)}")

I got this results running in a Jupyter Notebook on a Linux machine.

average pandas runtime: 0.00989325414002451
average pandas runtime with idxmin(): 0.005005129760029377
average polars runtime: 0.006758741329904296
average numpy runtime: 0.004175669220221607

average pandas runtime: 0.009967705049803044
average pandas runtime with idxmin(): 0.005097740050114225
average polars runtime: 0.006972378070222476
average numpy runtime: 0.004102102290034964

average pandas runtime: 0.010020545769948512
average pandas runtime with idxmin(): 0.004993948210048984
average polars runtime: 0.007027968560159934
average numpy runtime: 0.004024256040174805

You see polars is faster than your panda code, but using vectorized operations like idxmin() in pandas at least in this case is better than polars. numpy is often faster in this type of numerical work.

You could optimize your polars query a bit, specially use expressions instead of df["col"]. Possibly even more so if you don't mind only getting one row out of the query instead of including all values tying for the minimal.

import polars as pl
import time
import numpy as np

target_value = 0.5
data = np.random.rand(1000,100)
df = pl.DataFrame(data)

run_times = []
for i in range(100):
    st = time.time()
    abs_diff = (pl.col('column_0') - target_value).abs()
    # Option A - keep original behaviour but just better optimized
    # df_filtered = df.filter(abs_diff == abs_diff.min())
    # Option B - only get the row with minimal index instead of filtering
    df_filtered = df.row(df.select(abs_diff.arg_min()).item())
    run_time = time.time() - st
    run_times.append(run_time)

print(f"avg polars run: {sum(run_times)/len(run_times)}")

Like others said, numpy (or jax etc.) may as well be better suited for this kind of work though

Per Polars Support:

First, you need way more iterations than just 100 for such a small timewindow. With 10,000 iterations I get the following:

avg polars run: 0.0005123567976988852 avg pandas run: 0.00012923809615895151 But we can rewrite the polars query to be more efficient:
df_filtered = (
    df.lazy()
      .with_columns(abs_diff = (pl.col.column_0 - target_value).abs())
      .filter(pl.col.abs_diff == pl.col.abs_diff.min())
      .collect()
)
Then we get:

avg polars run: 0.00018435594723559915 Ultimately Polars isn't optimized for doing many tiny tiny horizontally wide datasets though.

Unfortunately, I didn't experience much of a performance boost when I tried the version above. It does seem the speeds are very machine dependent. I will continue with pandas for this specific use case. Thanks all for looking.

本文标签： performanceWhy is this polars filtering so much slower than my pandas equivalentStack Overflow

版权声明：本文标题：performance - Why is this polars filtering so much slower than my pandas equivalent? - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1741676742a2391923.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

performance - Why is this polars filtering so much slower than my pandas equivalent? - Stack Overflow

3 Answers 3

更多相关文章