admin管理员组

文章数量:1398174

Suppose I have a dataframe with a column composed of arrays of same length and another with some numerical category to group over like in this example:

df1 = spark.createDataFrame([
    ([1.5, 2.5],1),
    ([2.5, 3.5],1),
    ([3.5, 4.5],1),
], schema='v1 array<double>, c long')
+----------+---+
|        v1|  c|
+----------+---+
|[1.5, 2.5]|  1|
|[2.5, 3.5]|  1|
|[3.5, 4.5]|  1|
+----------+---+

If I simply try to group by c and and take the average of v1 like this:

df1.groupBy('c').avg('v1')

I get an error complaining that the values of the column in question are not numeric. This also seems to happen when I try to convert the arrays to vectors using the pyspark.ml.functions.array_to_vector, which have a defined sum.

What is the right way to do this? This is the desired result from the example:

+----------+---+
|        v1|  c|
+----------+---+
|[2.5, 3.5]|  1|
+----------+---+

Suppose I have a dataframe with a column composed of arrays of same length and another with some numerical category to group over like in this example:

df1 = spark.createDataFrame([
    ([1.5, 2.5],1),
    ([2.5, 3.5],1),
    ([3.5, 4.5],1),
], schema='v1 array<double>, c long')
+----------+---+
|        v1|  c|
+----------+---+
|[1.5, 2.5]|  1|
|[2.5, 3.5]|  1|
|[3.5, 4.5]|  1|
+----------+---+

If I simply try to group by c and and take the average of v1 like this:

df1.groupBy('c').avg('v1')

I get an error complaining that the values of the column in question are not numeric. This also seems to happen when I try to convert the arrays to vectors using the pyspark.ml.functions.array_to_vector, which have a defined sum.

What is the right way to do this? This is the desired result from the example:

+----------+---+
|        v1|  c|
+----------+---+
|[2.5, 3.5]|  1|
+----------+---+
Share Improve this question edited Mar 26 at 16:52 Ícaro Lorran asked Mar 26 at 13:51 Ícaro LorranÍcaro Lorran 1167 bronze badges 3
  • how do you want your end result to look like? – samkart Commented Mar 26 at 16:50
  • 2 try: df1.groupby('c').agg(F.expr("array(avg(v1[0]), avg(v1[1])) as v1")). – lihao Commented Mar 27 at 2:26
  • That works, but won't work if the vectors aren't size = 2 :) – Sachin Hosmani Commented Mar 29 at 23:46
Add a comment  | 

1 Answer 1

Reset to default 0

A clean solution that should work for any length vectors would be to

# explode the vector column to put each element (along with its position in the vector
# into a separate row
exploded = df1.select(F.col("c"), F.posexplode("v1").alias("pos", "value"))
exploded.show(truncate=False)

+---+---+-----+
|c  |pos|value|
+---+---+-----+
|1  |0  |1.5  |
|1  |1  |2.5  |
|1  |0  |2.5  |
|1  |1  |3.5  |
|1  |0  |3.5  |
|1  |1  |4.5  |
+---+---+-----+
# group by both c and the position
avg = exploded.groupBy("c", "pos").agg(F.avg("value").alias("avg_value"))
avg.show(truncate=False)


+---+---+---------+
|c  |pos|avg_value|
+---+---+---------+
|1  |1  |3.5      |
|1  |0  |2.5      |
+---+---+---------+
# Group by c, collect the pos, avg_value "tuples" (struct) using collect_list, sort each list, and take just avg_value from each struct
result = avg.groupBy("c").agg(
    F.expr("transform(sort_array(collect_list(struct(pos, avg_value))), x -> x.avg_value)").alias("v1")
)

The last step became a bit complex because I don't think collect_list guarantees any order but it's important for you to keep the vector's dimensions intact.

本文标签: How to average over vectorsarrays in PySparkStack Overflow