admin管理员组文章数量:1398174
Suppose I have a dataframe with a column composed of arrays of same length and another with some numerical category to group over like in this example:
df1 = spark.createDataFrame([
([1.5, 2.5],1),
([2.5, 3.5],1),
([3.5, 4.5],1),
], schema='v1 array<double>, c long')
+----------+---+
| v1| c|
+----------+---+
|[1.5, 2.5]| 1|
|[2.5, 3.5]| 1|
|[3.5, 4.5]| 1|
+----------+---+
If I simply try to group by c
and and take the average of v1
like this:
df1.groupBy('c').avg('v1')
I get an error complaining that the values of the column in question are not numeric. This also seems to happen when I try to convert the arrays to vectors using the pyspark.ml.functions.array_to_vector
, which have a defined sum.
What is the right way to do this? This is the desired result from the example:
+----------+---+
| v1| c|
+----------+---+
|[2.5, 3.5]| 1|
+----------+---+
Suppose I have a dataframe with a column composed of arrays of same length and another with some numerical category to group over like in this example:
df1 = spark.createDataFrame([
([1.5, 2.5],1),
([2.5, 3.5],1),
([3.5, 4.5],1),
], schema='v1 array<double>, c long')
+----------+---+
| v1| c|
+----------+---+
|[1.5, 2.5]| 1|
|[2.5, 3.5]| 1|
|[3.5, 4.5]| 1|
+----------+---+
If I simply try to group by c
and and take the average of v1
like this:
df1.groupBy('c').avg('v1')
I get an error complaining that the values of the column in question are not numeric. This also seems to happen when I try to convert the arrays to vectors using the pyspark.ml.functions.array_to_vector
, which have a defined sum.
What is the right way to do this? This is the desired result from the example:
+----------+---+
| v1| c|
+----------+---+
|[2.5, 3.5]| 1|
+----------+---+
Share
Improve this question
edited Mar 26 at 16:52
Ícaro Lorran
asked Mar 26 at 13:51
Ícaro LorranÍcaro Lorran
1167 bronze badges
3
|
1 Answer
Reset to default 0A clean solution that should work for any length vectors would be to
# explode the vector column to put each element (along with its position in the vector
# into a separate row
exploded = df1.select(F.col("c"), F.posexplode("v1").alias("pos", "value"))
exploded.show(truncate=False)
+---+---+-----+
|c |pos|value|
+---+---+-----+
|1 |0 |1.5 |
|1 |1 |2.5 |
|1 |0 |2.5 |
|1 |1 |3.5 |
|1 |0 |3.5 |
|1 |1 |4.5 |
+---+---+-----+
# group by both c and the position
avg = exploded.groupBy("c", "pos").agg(F.avg("value").alias("avg_value"))
avg.show(truncate=False)
+---+---+---------+
|c |pos|avg_value|
+---+---+---------+
|1 |1 |3.5 |
|1 |0 |2.5 |
+---+---+---------+
# Group by c, collect the pos, avg_value "tuples" (struct) using collect_list, sort each list, and take just avg_value from each struct
result = avg.groupBy("c").agg(
F.expr("transform(sort_array(collect_list(struct(pos, avg_value))), x -> x.avg_value)").alias("v1")
)
The last step became a bit complex because I don't think collect_list
guarantees any order but it's important for you to keep the vector's dimensions intact.
本文标签: How to average over vectorsarrays in PySparkStack Overflow
版权声明:本文标题:How to average over vectorsarrays in PySpark - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1744139513a2592553.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
df1.groupby('c').agg(F.expr("array(avg(v1[0]), avg(v1[1])) as v1"))
. – lihao Commented Mar 27 at 2:26