admin管理员组文章数量:1289876
The spec of monotonically order id monotonically_increasing_id says
The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.
So I assume there is some ordering otherwise increasing has no meaning. So the question what does increasing mean? Or is it simply a badly named unique id?
The spec of monotonically order id monotonically_increasing_id says
The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.
So I assume there is some ordering otherwise increasing has no meaning. So the question what does increasing mean? Or is it simply a badly named unique id?
Share Improve this question edited Feb 21 at 6:16 ZygD 24.5k41 gold badges103 silver badges138 bronze badges asked Feb 20 at 14:49 BelowZeroBelowZero 1,3833 gold badges20 silver badges35 bronze badges 7- 3 Not sure what is your goal is but any window functions without partition will copy to one node. – Emma Commented Feb 20 at 15:52
- What are you trying to actually do? monotonically_increasing_id will "increase", but it's meaningless because it's non-deterministic. – Andrew Commented Feb 20 at 17:49
- I don t care about it being deterministic. Lets say I have an ordering coming from different columns and I want to represent this ordering in one column. Then I don t care if the jump from one entry to the next is +1 +100 or anything else as long as its in the right order. – BelowZero Commented Feb 20 at 21:38
- Maybe someone can explain what increasing means in monotonically increasing id. Because if its simply unique per row its unique but increasing doesn t make sense. – BelowZero Commented Feb 20 at 21:39
- I rephrased it to not get you guys confused. I want to know what increasing means in this context. – BelowZero Commented Feb 20 at 21:54
1 Answer
Reset to default 3I have not yet come to a use case where we would make use of the "increasing" factor of monotonically_increasing_id
. I just use it to generate unique row IDs. E.g., when not having an explicit ID column, I have to explode some column and for later uses I need to be able to tell which rows came from a single original row.
You're correct that monotonically_increasing_id
has an order. But you cannot really control it. The order is just the order in which rows are written into a node.
E.g., when we explicitly create a dataframe (or read it from a single CSV file), we know the original order in which rows will be written into nodes. However, we don't necessarily know how many nodes we will have and where we will have the cut-off.
from pyspark.sql import functions as F
df1 = spark.createDataFrame([(1,), (3,), (2,), (5,), (4,), (6,)]) \
.withColumn('mono_id_1', F.monotonically_increasing_id())
df1.show()
# +---+----------+
# | _1| mono_id_1|
# +---+----------+
# | 1| 0|
# | 3| 1|
# | 2| 2|
# | 5|8589934592|
# | 4|8589934593|
# | 6|8589934594|
# +---+----------+
Here you see that the order was preserved exactly like we have specified. You see that there is a monotonical increase within nodes (there are two nodes in this case, as we see a cut-off after the first 3 rows). So, it increases just based on the order within a node.
However, after a few transformations where data shuffling is involved (joins, grouping / aggregations, distinct, window, etc.), your data will move from one node to another, and you cannot anymore control the order in which rows will be written into nodes. A simple example would be repartition
which also shuffles the data between nodes.
df2 = df1.repartition(3) \
.withColumn('mono_id_2', F.monotonically_increasing_id())
df2.show()
# +---+----------+-----------+
# | _1| mono_id_1| mono_id_2|
# +---+----------+-----------+
# | 3| 1| 0|
# | 4|8589934593| 1|
# | 1| 0| 8589934592|
# | 5|8589934592| 8589934593|
# | 2| 2|17179869184|
# | 6|8589934594|17179869185|
# +---+----------+-----------+
In this example data, we can predict how Spark will move data. But you cannot explicitly control it. And in real-world scenarios, when you have a couple of hundred executors performing a join, it not even predictable how rows will be moved and written. Some rows may be more difficult to process, some executor may not be performing well, etc. etc...
本文标签: pythonMonotonically increasing id orderStack Overflow
版权声明:本文标题:python - Monotonically increasing id order - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1741427415a2378156.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论