apache spark - Create a Column with Values Based on an Array of Column Names Provided in Another Column - Stack Overflow

IT技术

更新时间：2025-01-1214

admin管理员组
文章数量:1125760

I want to create a new column that contains an array of values for the column names listed in the lookup column.

Sample Input

input_df = spark.createDataFrame([
    Row(id=123, alert=1, operation=1, lookup=[]),
    Row(id=234, alert=0, operation=0, lookup=['alert']),
    Row(id=345, alert=1, operation=0, lookup=['operation']),
    Row(id=456, alert=0, operation=1, lookup=['alert', 'operation']),
])

Expected Output

id	alert	operation	lookup	lookup_values
123	1	1	`[]`	`[]`
234	0	0	`[alert]`	`[0]`
345	1	0	`[operation]`	`[0]`
456	0	1	`[alert, operation]`	`[0, 1]`

I want to create a new column that contains an array of values for the column names listed in the lookup column.

Sample Input

input_df = spark.createDataFrame([
    Row(id=123, alert=1, operation=1, lookup=[]),
    Row(id=234, alert=0, operation=0, lookup=['alert']),
    Row(id=345, alert=1, operation=0, lookup=['operation']),
    Row(id=456, alert=0, operation=1, lookup=['alert', 'operation']),
])

Expected Output

id	alert	operation	lookup	lookup_values
123	1	1	`[]`	`[]`
234	0	0	`[alert]`	`[0]`
345	1	0	`[operation]`	`[0]`
456	0	1	`[alert, operation]`	`[0, 1]`

What I have tried

input_df.withColumn("lookup_values", F.transform(F.col("lookup"), lambda x: input_df[f'{x}'])).show()

Fails with the error:

AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with the name Column<'x_1'> cannot be resolved. Did you mean one of the following? [id, alert, operation, lookup].

This error is surprising because the following code does not produce an error, although it does not yield the intended result:

input_df.withColumn("lookup_values", F.transform(F.col("lookup"), lambda x: input_df['alert'])).show()

id	alert	operation	lookup	lookup_values
123	1	1	`[]`	`[]`
234	0	0	`[alert]`	`[0]`
345	1	0	`[operation]`	`[1]`
456	0	1	`[alert, operation]`	`[0, 0]`

Share Improve this question edited 2 days ago Steven 15.2k7 gold badges46 silver badges78 bronze badges asked Jan 9 at 3:25 smurphy 3741 silver badge10 bronze badges

Add a comment |

2 Answers 2

Sorted by: Reset to default 1

Here is an answer without UDF using builtin functions. It should be faster with big volumn of data :

from pyspark.sql import functions as F

input_df.withColumn(
    "lookup_values",
    F.create_map(
        [F.lit("alert"), F.col("alert"), F.lit("operation"), F.col("operation")]
    ),
).withColumn(
    "lookup_values", 
    F.transform(F.col("lookup"), lambda x: F.col("lookup_values")[x])
).display()

+---+-----+---------+------------------+-------------+
| id|alert|operation|            lookup|lookup_values|
+---+-----+---------+------------------+-------------+
|123|    1|        1|                []|           []|
|234|    0|        0|           [alert]|          [0]|
|345|    1|        0|       [operation]|          [0]|
|456|    0|        1|[alert, operation]|       [0, 1]|
+---+-----+---------+------------------+-------------+

One way to do that is to pass the whole row into a UDF, and put the lookup value into the list based on the lookup column:

@func.udf(returnType=ArrayType(IntegerType()))
def lookup_values_udf(row):
    return [row[field] for field in row["lookup"]]

input_df.withColumn(
    "lookup_values", 
    lookup_values_udf(func.struct([func.col(col) for col in input_df.columns]))
).show(
    10, False
)

+---+-----+---------+------------------+-------------+
|id |alert|operation|lookup            |lookup_values|
+---+-----+---------+------------------+-------------+
|123|1    |1        |[]                |[]           |
|234|0    |0        |[alert]           |[0]          |
|345|1    |0        |[operation]       |[0]          |
|456|0    |1        |[alert, operation]|[0, 1]       |
+---+-----+---------+------------------+-------------+

本文标签：

版权声明：本文标题：apache spark - Create a Column with Values Based on an Array of Column Names Provided in Another Column - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1736674400a1947100.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

发表评论

全部评论 0

暂无评论

编程频道|软件玩家 - 软件改变生活！

apache spark - Create a Column with Values Based on an Array of Column Names Provided in Another Column - Stack Overflow

Sample Input

Expected Output

Sample Input

Expected Output

What I have tried

2 Answers 2

更多相关文章

How to sort strings in JavaScript - Stack Overflow

javascript - How to inspect FormData? - Stack Overflow

How to add custom css file in theme?

javascript - Which keycode for escape key with jQuery - Stack Overflow

How do you check if a JavaScript Object is a DOM Object? - Stack Overflow

JavaScript for...in vs for - Stack Overflow

python - Counting the hashtags in a collection of tweets: two methods with inconsistent results - Stack Overflow

xaml - UWP or WinUI3: Is there a way to send a notification before the app first launch? - Stack Overflow

javascript - How to uncheck a radio button? - Stack Overflow

javascript - How to set custom validation messages for HTML forms? - Stack Overflow

posts - Redirection - bulk edit 404s to new tomain

javascript - How to format a UTC date as a `YYYY-MM-DD hh:mm:ss` string using NodeJS? - Stack Overflow

javascript - Is it safe to expose Firebase apiKey to the public? - Stack Overflow

javascript - why compare two dates Journal Date with date Today Not matches or not working on jQuery? - Stack Overflow

javascript - react-router - pass props to handler component - Stack Overflow

javascript - Rails 4: how to use $(document).ready() with turbo-links - Stack Overflow

rounding - How to round float numbers in javascript? - Stack Overflow

multithreading - Python FastAPI multithreaded process behaving different in Foreground Execution (Manual) and Background Executi

testing - Matching a subset of a json array with another super set of json array in karate - Stack Overflow

javascript - Get the index of the object inside an array, matching a condition - Stack Overflow

发表评论

推荐文章

java - How to Implement a Multi-Level ExpandableListView with Hierarchy: Plant → Department &amp; Section → Line → Devices i

pyenv-win —— windows 端 python 版本管理工具

docker - Passing epoch as environment variable in zeek - Stack Overflow

azure subscription - Power BI Embedded A1 Node Type Maximum Load - Stack Overflow

migration - How can I dynamically set the base URL of an ACF custom URL field?

热门文章

dart - Flutter hot restart speed on Windows - Stack Overflow

Azure pipeline does not pull the whole codebase from repo - Stack Overflow

python - Setting non-zero time step in Multibody Plant still results in a Continuous system - Stack Overflow

typescript - How to use an import in a try and catch - Stack Overflow

In Docusaurus markdown link inside details won&#39;t parse - Stack Overflow

kotlin - How to mock authenticated user in Micronaut? - Stack Overflow

java - Is there a way to create lists of joined objects in JPA queries? - Stack Overflow

custom taxonomy - Set Variant optionsattributes values on WooCommerce

javascript - How to detect changes in WooCommerce store (react)?

multisite - WordPress and SQL - Update and Insert from another table if column value doesn&#39;t exist

最新文章

Java入门级教学（IDEA的下载与安装与JDK的环境配置）

华硕笔记本电脑用U盘重装windows系统

物理网卡MAC修改器v3.0 - 真实网卡硬件MAC地址修改，重装系统不变！

如何一键安装win7系统(一键安装win7系统步骤)

Windows 11最稳定版本详解

docker - Kafka TestContainer advertised listeners configuration when using DinD - Stack Overflow

javascript - How do I remove documents using Node.js Mongoose? - Stack Overflow

memory - How to get the size of a JavaScript object? - Stack Overflow

graphdb - Is there any support for importing prefixnamespaces - Stack Overflow

WordPress Multisite Domain Mapping Issue: Subsite Domain Redirecting to Primary Domain

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价

java - How to Implement a Multi-Level ExpandableListView with Hierarchy: Plant → Department & Section → Line → Devices i

In Docusaurus markdown link inside details won't parse - Stack Overflow

multisite - WordPress and SQL - Update and Insert from another table if column value doesn't exist