admin管理员组文章数量:1125760
I want to create a new column that contains an array of values for the column names listed in the lookup
column.
Sample Input
input_df = spark.createDataFrame([
Row(id=123, alert=1, operation=1, lookup=[]),
Row(id=234, alert=0, operation=0, lookup=['alert']),
Row(id=345, alert=1, operation=0, lookup=['operation']),
Row(id=456, alert=0, operation=1, lookup=['alert', 'operation']),
])
Expected Output
id | alert | operation | lookup | lookup_values |
---|---|---|---|---|
123 | 1 | 1 | [] |
[] |
234 | 0 | 0 | [alert] |
[0] |
345 | 1 | 0 | [operation] |
[0] |
456 | 0 | 1 | [alert, operation] |
[0, 1] |
I want to create a new column that contains an array of values for the column names listed in the lookup
column.
Sample Input
input_df = spark.createDataFrame([
Row(id=123, alert=1, operation=1, lookup=[]),
Row(id=234, alert=0, operation=0, lookup=['alert']),
Row(id=345, alert=1, operation=0, lookup=['operation']),
Row(id=456, alert=0, operation=1, lookup=['alert', 'operation']),
])
Expected Output
id | alert | operation | lookup | lookup_values |
---|---|---|---|---|
123 | 1 | 1 | [] |
[] |
234 | 0 | 0 | [alert] |
[0] |
345 | 1 | 0 | [operation] |
[0] |
456 | 0 | 1 | [alert, operation] |
[0, 1] |
What I have tried
input_df.withColumn("lookup_values", F.transform(F.col("lookup"), lambda x: input_df[f'{x}'])).show()
Fails with the error:
AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with the name
Column<'x_1'>
cannot be resolved. Did you mean one of the following? [id
,alert
,operation
,lookup
].
This error is surprising because the following code does not produce an error, although it does not yield the intended result:
input_df.withColumn("lookup_values", F.transform(F.col("lookup"), lambda x: input_df['alert'])).show()
id | alert | operation | lookup | lookup_values |
---|---|---|---|---|
123 | 1 | 1 | [] |
[] |
234 | 0 | 0 | [alert] |
[0] |
345 | 1 | 0 | [operation] |
[1] |
456 | 0 | 1 | [alert, operation] |
[0, 0] |
2 Answers
Reset to default 1Here is an answer without UDF using builtin functions. It should be faster with big volumn of data :
from pyspark.sql import functions as F
input_df.withColumn(
"lookup_values",
F.create_map(
[F.lit("alert"), F.col("alert"), F.lit("operation"), F.col("operation")]
),
).withColumn(
"lookup_values",
F.transform(F.col("lookup"), lambda x: F.col("lookup_values")[x])
).display()
+---+-----+---------+------------------+-------------+
| id|alert|operation| lookup|lookup_values|
+---+-----+---------+------------------+-------------+
|123| 1| 1| []| []|
|234| 0| 0| [alert]| [0]|
|345| 1| 0| [operation]| [0]|
|456| 0| 1|[alert, operation]| [0, 1]|
+---+-----+---------+------------------+-------------+
One way to do that is to pass the whole row into a UDF, and put the lookup value into the list based on the lookup
column:
@func.udf(returnType=ArrayType(IntegerType()))
def lookup_values_udf(row):
return [row[field] for field in row["lookup"]]
input_df.withColumn(
"lookup_values",
lookup_values_udf(func.struct([func.col(col) for col in input_df.columns]))
).show(
10, False
)
+---+-----+---------+------------------+-------------+
|id |alert|operation|lookup |lookup_values|
+---+-----+---------+------------------+-------------+
|123|1 |1 |[] |[] |
|234|0 |0 |[alert] |[0] |
|345|1 |0 |[operation] |[0] |
|456|0 |1 |[alert, operation]|[0, 1] |
+---+-----+---------+------------------+-------------+
本文标签:
版权声明:本文标题:apache spark - Create a Column with Values Based on an Array of Column Names Provided in Another Column - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1736674400a1947100.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论