Polars, python, how to change the number of conditions inputted when making a new column - Stack Overflow

IT技术

更新时间：2025-01-0719

admin管理员组
文章数量:1122847

I have large datasets (ranging from 100k - 4 million rows) where I am looking for different relevant codes across multiple columns. For example, if I wanted to identify each row which has some start to a string '302' I would do:

import polars as pl

df = pl.DataFrame({
'Codes_1': ['302E513', '301E513', '302E512'],
'Codes_2': ['303E513', '306E510', '302E512']}).lazy()

conditions = ['302E513', '306E510']
column_names = ['Codes_1', 'Codes_2']

#create new column
df = df.with_columns(
   pl.when(pl.any_horizontal( 
       pl.col(column_names).str.starts_with(conditions[0]),
       pl.col(column_names).str.starts_with(conditions[1])))
   .then(1.0)
   .otherwise(0.0)
   .alias('Column_name')
)

It is really annoying when I am looking for say 4 codes instead of 2 to have to type in each of the codes to form my new column:

import polars as pl

df = pl.DataFrame({
'Codes_1': ['302E513', '301E513', '302E512'],
'Codes_2': ['303E513', '306E510', '302E512']}).lazy()

conditions = ['302E513', '306E510', '5164E23', '302E514']
column_names = ['Codes_1', 'Codes_2']

#create new column
df = df.with_columns(
   pl.when(pl.any_horizontal(
       #Tedious part 
       pl.col(column_names).str.starts_with(conditions[0]),
       pl.col(column_names).str.starts_with(conditions[1]),
       pl.col(column_names).str.starts_with(conditions[2]),
       pl.col(column_names).str.starts_with(conditions[3])
))
   .then(1.0)
   .otherwise(0.0)
   .alias('Column_name')
)

I know that this can be done with pandas by updating a mask with a for loop

import pandas as pd

df = pd.DataFrame({
'Codes_1': ['302E513', '301E513', '302E512'],
'Codes_2': ['303E513', '306E510', '302E512']})

conditions = ['302E513', '306E510']
column_names = ['Codes_1', 'Codes_2']

#loop to create new column
mask = False
for code in conditions:
   mask |= df[column_names].eq(code).any(axis=1)

df['Column_name'] = 0.0
df.loc[mask, 'Column_name'] = 1.0
print(df['Column_name'])

And I could change the number of conditions to any number and this code would execute. However, I would much rather use polars as it is faster and does not overflow the RAM on my machine for larger datasets. Any help is appreciated.

import polars as pl

df = pl.DataFrame({
'Codes_1': ['302E513', '301E513', '302E512'],
'Codes_2': ['303E513', '306E510', '302E512']}).lazy()

conditions = ['302E513', '306E510']
column_names = ['Codes_1', 'Codes_2']

#create new column
df = df.with_columns(
   pl.when(pl.any_horizontal( 
       pl.col(column_names).str.starts_with(conditions[0]),
       pl.col(column_names).str.starts_with(conditions[1])))
   .then(1.0)
   .otherwise(0.0)
   .alias('Column_name')
)

It is really annoying when I am looking for say 4 codes instead of 2 to have to type in each of the codes to form my new column:

import polars as pl

df = pl.DataFrame({
'Codes_1': ['302E513', '301E513', '302E512'],
'Codes_2': ['303E513', '306E510', '302E512']}).lazy()

conditions = ['302E513', '306E510', '5164E23', '302E514']
column_names = ['Codes_1', 'Codes_2']

#create new column
df = df.with_columns(
   pl.when(pl.any_horizontal(
       #Tedious part 
       pl.col(column_names).str.starts_with(conditions[0]),
       pl.col(column_names).str.starts_with(conditions[1]),
       pl.col(column_names).str.starts_with(conditions[2]),
       pl.col(column_names).str.starts_with(conditions[3])
))
   .then(1.0)
   .otherwise(0.0)
   .alias('Column_name')
)

I know that this can be done with pandas by updating a mask with a for loop

import pandas as pd

df = pd.DataFrame({
'Codes_1': ['302E513', '301E513', '302E512'],
'Codes_2': ['303E513', '306E510', '302E512']})

conditions = ['302E513', '306E510']
column_names = ['Codes_1', 'Codes_2']

#loop to create new column
mask = False
for code in conditions:
   mask |= df[column_names].eq(code).any(axis=1)

df['Column_name'] = 0.0
df.loc[mask, 'Column_name'] = 1.0
print(df['Column_name'])

Share Improve this question edited Dec 3, 2024 at 19:43 jqurious 20.9k4 gold badges19 silver badges35 bronze badges asked Dec 3, 2024 at 18:24 zebra_zach_12345 333 bronze badges

Add a comment |

1 Answer 1

Sorted by: Reset to default 2

You could replace the multiple str.starts_with with a single regex and str.contains:

df.with_columns(
   pl.when(pl.any_horizontal(
       pl.col(column_names).str.contains(f"^({'|'.join(conditions)})"),
))
   .then(1.0)
   .otherwise(0.0)
   .alias('Column_name')
)

Or use a loop:

df.with_columns(
   pl.when(pl.any_horizontal(
       pl.col(column_names).str.starts_with(c)
        for c in conditions
))
   .then(1.0)
   .otherwise(0.0)
   .alias('Column_name')
)

Intermediate:

# f"^({'|'.join(conditions)})"
'^(302E513|306E510|5164E23|302E514)'

Output (non-lazy):

┌─────────┬─────────┬─────────────┐
│ Codes_1 ┆ Codes_2 ┆ Column_name │
│ ---     ┆ ---     ┆ ---         │
│ str     ┆ str     ┆ f64         │
╞═════════╪═════════╪═════════════╡
│ 302E513 ┆ 303E513 ┆ 1.0         │
│ 301E513 ┆ 306E510 ┆ 1.0         │
│ 302E512 ┆ 302E512 ┆ 0.0         │
└─────────┴─────────┴─────────────┘

本文标签：

版权声明：本文标题：Polars, python, how to change the number of conditions inputted when making a new column - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1736188288a1909246.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

发表评论

全部评论 0

暂无评论

编程频道|软件玩家 - 软件改变生活！

Polars, python, how to change the number of conditions inputted when making a new column - Stack Overflow

1 Answer 1

更多相关文章

colors - How do I create CSS gradients that follow the square root average? - Stack Overflow

c++ - AutoMake Conditional build Multple Projects - Stack Overflow

python - dask `var` and `std` with ddof in groupby context and other aggregations - Stack Overflow

Implement while loop inspring webflux to scroll Elasticsearch index and insert to redis - Stack Overflow

assembly - Calling the world&#39;s simplest NASM function from C - segfault - Stack Overflow

python - Calling AIOKafkaConsumer via FastAPI raises &quot;object should be created within an async function or provide loop

Diagnostic analyzer runner is currently unavailable doe to an internal error (with CodeRush) - Stack Overflow

promql - Prometheus - how to group by lable 2 metrics and filter one with another? - Stack Overflow

c# - Printing Popup Hangs over 5 seconds for each page - Stack Overflow

kubernetes - istio canary strategy with dynamic routing rules with different apps - Stack Overflow

scalatest - Scala-cli test doesnt exit after test run - Stack Overflow

linux - Do all fragments of an IP packet greater than MTU carry the full PPPoE header when modified in an eBPF tc program? - Sta

ros2 - how to modify imu_filter_madgwick to transform RPY from imu_sensor frame to base_link frame? - Stack Overflow

multithreading - C++ thread exiting without a notice -- need help debugging with gdb - Stack Overflow

Kubernetes: How can I run pods but reference of Volume on a different node? - Stack Overflow

spring boot - oauth2 with google not working in docker container - Stack Overflow

c# - Dataverse plugin accessing APIs inside company&#39;s Azure Tenant: error? - Stack Overflow

pac4j v6 replacement of Pac4JHttpServletRequestWrapper - Stack Overflow

react hooks - My browser localstorage clears everytime i refresh - Stack Overflow

Embedding of sequence of events sets - Stack Overflow

发表评论

推荐文章

java - Site not rendering because of NoSuchMethodError - Stack Overflow

plugins - Automatic assignment of multi images to products in WooCommerceWordpress with SKU

Unable to get result using Contains query to Search service AzureOpenAI using python code - Stack Overflow

javascript - Group lists using DOM manipulation - Stack Overflow

background color - In python 3.12 and kivymd2.0.1 change backgroundcolor in the topappbar - Stack Overflow

热门文章

Running into issue: cannot import name &#39;__version__&#39; from &#39;tensorflow.python.keras&#39; - Stack Over

MULTISITE: Password issue and Error Cant access site

javascript - Ajax call from Plugin using Class

Undefined Variable (Displaying image via custom field)

Getting error with obspy reading reftek130 files on python 3.13.0 windows 11 - Stack Overflow

r - problem installing ggiraph on Ubuntu 24.04 - Stack Overflow

ios - &quot;Failed to resolve host network app id to config&quot; Error on Deployed URL in Xcode Build - Stack Overflow

WP-Query and Searching Inside Arrays

wp query - Showing all posts of the current custom taxonomy on archive page

wordpress - Thim Elementor Kit Causing Pagination Issues on CoursesProductShop Pages - Stack Overflow

最新文章

Java入门级教学（IDEA的下载与安装与JDK的环境配置）

华硕笔记本电脑用U盘重装windows系统

物理网卡MAC修改器v3.0 - 真实网卡硬件MAC地址修改，重装系统不变！

如何一键安装win7系统(一键安装win7系统步骤)

Windows 11最稳定版本详解

javascript - Detecting overflowing menu items doesn&#39;t always calculate correctly - Stack Overflow

winapi - Win32 DrawText() ignores text color set on the device context and draws text in background color - Stack Overflow

How to get Graalvm to convert AWT Java program to exe - Stack Overflow

Embedding of sequence of events sets - Stack Overflow

hcl - How to create parallel builds foreach item in list using packer template - Stack Overflow

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价

assembly - Calling the world's simplest NASM function from C - segfault - Stack Overflow

python - Calling AIOKafkaConsumer via FastAPI raises "object should be created within an async function or provide loop

c# - Dataverse plugin accessing APIs inside company's Azure Tenant: error? - Stack Overflow

Running into issue: cannot import name 'version' from 'tensorflow.python.keras' - Stack Over

ios - "Failed to resolve host network app id to config" Error on Deployed URL in Xcode Build - Stack Overflow

javascript - Detecting overflowing menu items doesn't always calculate correctly - Stack Overflow