Polars Python: Filter list column using a boolean list column, but keeping list size - Stack Overflow

IT技术

更新时间：2025-01-0810

admin管理员组
文章数量:1122832

I would like to get elements from a list dtype column using another boolean list column and keeping the original size of the list (as oppose to this solution).

Starting from this dataframe:

df = pl.DataFrame({
    'identity_vector': [[True, False], [False, True]],
    'string_vector': [['name1', 'name2'], ['name3', 'name4']]
})

shape: (2, 2)
┌─────────────────┬────────────────────┐
│ identity_vector ┆ string_vector      │
│ ---             ┆ ---                │
│ list[bool]      ┆ list[str]          │
╞═════════════════╪════════════════════╡
│ [true, false]   ┆ ["name1", "name2"] │
│ [false, true]   ┆ ["name3", "name4"] │
└─────────────────┴────────────────────┘

The objective is to get this output:

shape: (2, 3)
┌─────────────────┬────────────────────┬──────────────────┐
│ identity_vector ┆ string_vector      ┆ filtered_strings │
│ ---             ┆ ---                ┆ ---              │
│ list[bool]      ┆ list[str]          ┆ list[str]        │
╞═════════════════╪════════════════════╪══════════════════╡
│ [true, false]   ┆ ["name1", "name2"] ┆ ["name1", null]  │
│ [false, true]   ┆ ["name3", "name4"] ┆ [null, "name4"]  │
└─────────────────┴────────────────────┴──────────────────┘

Which I can get using the block of code below and map_elements, but the solution is sub-optimal for performance reasons:

df.with_columns(
    filtered_strings=pl.struct(["string_vector", "identity_vector"]).map_elements(
        lambda row: [s if keep else None for s, keep in zip(row["string_vector"], row["identity_vector"])]
    )
)

Do you have any suggestion on how to improve the performance of this process?

I would like to get elements from a list dtype column using another boolean list column and keeping the original size of the list (as oppose to this solution).

Starting from this dataframe:

df = pl.DataFrame({
    'identity_vector': [[True, False], [False, True]],
    'string_vector': [['name1', 'name2'], ['name3', 'name4']]
})

shape: (2, 2)
┌─────────────────┬────────────────────┐
│ identity_vector ┆ string_vector      │
│ ---             ┆ ---                │
│ list[bool]      ┆ list[str]          │
╞═════════════════╪════════════════════╡
│ [true, false]   ┆ ["name1", "name2"] │
│ [false, true]   ┆ ["name3", "name4"] │
└─────────────────┴────────────────────┘

The objective is to get this output:

shape: (2, 3)
┌─────────────────┬────────────────────┬──────────────────┐
│ identity_vector ┆ string_vector      ┆ filtered_strings │
│ ---             ┆ ---                ┆ ---              │
│ list[bool]      ┆ list[str]          ┆ list[str]        │
╞═════════════════╪════════════════════╪══════════════════╡
│ [true, false]   ┆ ["name1", "name2"] ┆ ["name1", null]  │
│ [false, true]   ┆ ["name3", "name4"] ┆ [null, "name4"]  │
└─────────────────┴────────────────────┴──────────────────┘

Which I can get using the block of code below and map_elements, but the solution is sub-optimal for performance reasons:

df.with_columns(
    filtered_strings=pl.struct(["string_vector", "identity_vector"]).map_elements(
        lambda row: [s if keep else None for s, keep in zip(row["string_vector"], row["identity_vector"])]
    )
)

Do you have any suggestion on how to improve the performance of this process?

Share Improve this question asked Nov 22, 2024 at 11:05 yz_jc 1317 bronze badges

It may be worth asking if this would be considered as an API addition? I modified the plugin example as a test: marcogorelli.github.io/polars-plugins-tutorial/… (to give the index of true values and null for false) - it was ~10x faster than list.eval. Passing to list.gather was still expensive, but I imagine a proper native implementation could bypass that. – jqurious Commented Nov 23, 2024 at 16:11
Thanks! I sadly cannot install Rust in my machine due to limitations at org level, but I will open the request! – yz_jc Commented Nov 25, 2024 at 12:57

Add a comment |

1 Answer 1

Sorted by: Reset to default 3

Kind of standard pl.Expr.explode() / calculate / pl.Expr.implode() route:

df.with_columns(
    pl.when(
        pl.col.identity_vector.explode()
    ).then(
        pl.col.string_vector.explode()
    ).otherwise(None)
    .implode()
    .over(pl.int_range(pl.len()))
    .alias("filtered_strings")
)

shape: (2, 3)
┌─────────────────┬────────────────────┬──────────────────┐
│ identity_vector ┆ string_vector      ┆ filtered_strings │
│ ---             ┆ ---                ┆ ---              │
│ list[bool]      ┆ list[str]          ┆ list[str]        │
╞═════════════════╪════════════════════╪══════════════════╡
│ [true, false]   ┆ ["name1", "name2"] ┆ ["name1", null]  │
│ [false, true]   ┆ ["name3", "name4"] ┆ [null, "name4"]  │
└─────────────────┴────────────────────┴──────────────────┘

There're also other possible approaches, for example using pl.Expr.list.eval() and pl.Expr.list.gather()

df.with_columns(
    pl.col.string_vector.list.gather(
        pl.col.identity_vector.list.eval(
            pl.when(pl.element()).then(pl.int_range(pl.len()))
        )
    ).alias("filtered_strings")
)

shape: (2, 3)
┌─────────────────┬────────────────────┬──────────────────┐
│ identity_vector ┆ string_vector      ┆ filtered_strings │
│ ---             ┆ ---                ┆ ---              │
│ list[bool]      ┆ list[str]          ┆ list[str]        │
╞═════════════════╪════════════════════╪══════════════════╡
│ [true, false]   ┆ ["name1", "name2"] ┆ ["name1", null]  │
│ [false, true]   ┆ ["name3", "name4"] ┆ [null, "name4"]  │
└─────────────────┴────────────────────┴──────────────────┘

Or, if you know length of your lists or it's relatively small, you can create columns for each list index and then use pl.Expr.list.get() and pl.concat_list().

l = 2
df.with_columns(
    filtered_strings = pl.concat_list(
        pl.when(
            pl.col.identity_vector.list.get(i)
        ).then(
            pl.col.string_vector.list.get(i)
        )
        for i in range(2)
    )
)

shape: (2, 3)
┌─────────────────┬────────────────────┬──────────────────┐
│ identity_vector ┆ string_vector      ┆ filtered_strings │
│ ---             ┆ ---                ┆ ---              │
│ list[bool]      ┆ list[str]          ┆ list[str]        │
╞═════════════════╪════════════════════╪══════════════════╡
│ [true, false]   ┆ ["name1", "name2"] ┆ ["name1", null]  │
│ [false, true]   ┆ ["name3", "name4"] ┆ [null, "name4"]  │
└─────────────────┴────────────────────┴──────────────────┘

All solutions use pl.when() to set value to null when condition is not met.

本文标签：

版权声明：本文标题：Polars Python: Filter list column using a boolean list column, but keeping list size - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1736304186a1932145.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

发表评论

全部评论 0

暂无评论

编程频道|软件玩家 - 软件改变生活！

Polars Python: Filter list column using a boolean list column, but keeping list size - Stack Overflow

1 Answer 1

更多相关文章

雨林木风系统深度解析：优化体验与版本推荐的全面指南

windows精简工具ntlite

win11 家庭版升级成专业版

colors - How do I create CSS gradients that follow the square root average? - Stack Overflow

c# - OutOfMemoryException in .NET 8 Applications on IIS with EF core - Stack Overflow

Implement while loop inspring webflux to scroll Elasticsearch index and insert to redis - Stack Overflow

物理网卡MAC修改器v3.0 - 真实网卡硬件MAC地址修改，重装系统不变！

Custom Labelling in Multi-Class Classification in XGBoost LightGBM - Stack Overflow

swift - Cannot launch maps in CarPlay from my app - Stack Overflow

python - Calling AIOKafkaConsumer via FastAPI raises &quot;object should be created within an async function or provide loop

python - Mocking imported class set to attribute in constructor with custom init of tested class - Stack Overflow

Diagnostic analyzer runner is currently unavailable doe to an internal error (with CodeRush) - Stack Overflow

How do I partition disks in a VM instance using cloud-init - Stack Overflow

promql - Prometheus - how to group by lable 2 metrics and filter one with another? - Stack Overflow

How to run steps in parallel in Buildbot - Stack Overflow

Unable to use rename in neovim with ruby lsp - Stack Overflow

scalatest - Scala-cli test doesnt exit after test run - Stack Overflow

python - Diffusers pipeline Instant ID with Ipadapter - Stack Overflow

Color a portion of a minipage in Manim - Stack Overflow

apache kafka - Unknown feature gate KafkaNodePools found in the configuration - Stack Overflow

发表评论

推荐文章

Send to different single pages by category when multiple categories exist

email - Is it secure to use SMTP password in .php file in WordPress website?

wp query - Ajax and WP_Querytax_query parameter

plugin development - Output HTML Tags In Gutenberg Block

python - Detecting duplicate files based on binary content - Stack Overflow

热门文章

hooks - Add HTML code before the title of the Tag page

htaccess - Site redirects to wrong url when saving settings

elixir - Oban Error After OS Update: &quot;expected :strategy option to be given - Stack Overflow

c# - How to handle disposal of objects when using DIIoC container? - Stack Overflow

Getting the current menu item id from specific menu

Enable national collation on PostgreSQL database installation - Stack Overflow

flutter - Awesome notification icon not showing - Stack Overflow

python - Django devserver load once - Stack Overflow

r - Multiple plots based on list elements - Stack Overflow

php - Get URL to a Sulu page by content id - Stack Overflow

最新文章

Java入门级教学（IDEA的下载与安装与JDK的环境配置）

华硕笔记本电脑用U盘重装windows系统

物理网卡MAC修改器v3.0 - 真实网卡硬件MAC地址修改，重装系统不变！

如何一键安装win7系统(一键安装win7系统步骤)

Windows 11最稳定版本详解

multithreading - C++ thread exiting without a notice -- need help debugging with gdb - Stack Overflow

apache kafka - Unknown feature gate KafkaNodePools found in the configuration - Stack Overflow

New Python Instance in VS Code and the terminal is passing indentions that do not exist in the code editor window - Stack Overfl

ros2 - how to modify imu_filter_madgwick to transform RPY from imu_sensor frame to base_link frame? - Stack Overflow

Color a portion of a minipage in Manim - Stack Overflow

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价

python - Calling AIOKafkaConsumer via FastAPI raises "object should be created within an async function or provide loop

elixir - Oban Error After OS Update: "expected :strategy option to be given - Stack Overflow