admin管理员组

文章数量:1289876

The spec of monotonically order id monotonically_increasing_id says

The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.

So I assume there is some ordering otherwise increasing has no meaning. So the question what does increasing mean? Or is it simply a badly named unique id?

The spec of monotonically order id monotonically_increasing_id says

The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.

So I assume there is some ordering otherwise increasing has no meaning. So the question what does increasing mean? Or is it simply a badly named unique id?

Share Improve this question edited Feb 21 at 6:16 ZygD 24.5k41 gold badges103 silver badges138 bronze badges asked Feb 20 at 14:49 BelowZeroBelowZero 1,3833 gold badges20 silver badges35 bronze badges 7
  • 3 Not sure what is your goal is but any window functions without partition will copy to one node. – Emma Commented Feb 20 at 15:52
  • What are you trying to actually do? monotonically_increasing_id will "increase", but it's meaningless because it's non-deterministic. – Andrew Commented Feb 20 at 17:49
  • I don t care about it being deterministic. Lets say I have an ordering coming from different columns and I want to represent this ordering in one column. Then I don t care if the jump from one entry to the next is +1 +100 or anything else as long as its in the right order. – BelowZero Commented Feb 20 at 21:38
  • Maybe someone can explain what increasing means in monotonically increasing id. Because if its simply unique per row its unique but increasing doesn t make sense. – BelowZero Commented Feb 20 at 21:39
  • I rephrased it to not get you guys confused. I want to know what increasing means in this context. – BelowZero Commented Feb 20 at 21:54
 |  Show 2 more comments

1 Answer 1

Reset to default 3

I have not yet come to a use case where we would make use of the "increasing" factor of monotonically_increasing_id. I just use it to generate unique row IDs. E.g., when not having an explicit ID column, I have to explode some column and for later uses I need to be able to tell which rows came from a single original row.

You're correct that monotonically_increasing_id has an order. But you cannot really control it. The order is just the order in which rows are written into a node.

E.g., when we explicitly create a dataframe (or read it from a single CSV file), we know the original order in which rows will be written into nodes. However, we don't necessarily know how many nodes we will have and where we will have the cut-off.

from pyspark.sql import functions as F

df1 = spark.createDataFrame([(1,), (3,), (2,), (5,), (4,), (6,)]) \
    .withColumn('mono_id_1', F.monotonically_increasing_id())
df1.show()
# +---+----------+
# | _1| mono_id_1|
# +---+----------+
# |  1|         0|
# |  3|         1|
# |  2|         2|
# |  5|8589934592|
# |  4|8589934593|
# |  6|8589934594|
# +---+----------+

Here you see that the order was preserved exactly like we have specified. You see that there is a monotonical increase within nodes (there are two nodes in this case, as we see a cut-off after the first 3 rows). So, it increases just based on the order within a node.

However, after a few transformations where data shuffling is involved (joins, grouping / aggregations, distinct, window, etc.), your data will move from one node to another, and you cannot anymore control the order in which rows will be written into nodes. A simple example would be repartition which also shuffles the data between nodes.

df2 = df1.repartition(3) \
    .withColumn('mono_id_2', F.monotonically_increasing_id())
df2.show()
# +---+----------+-----------+
# | _1| mono_id_1|  mono_id_2|
# +---+----------+-----------+
# |  3|         1|          0|
# |  4|8589934593|          1|
# |  1|         0| 8589934592|
# |  5|8589934592| 8589934593|
# |  2|         2|17179869184|
# |  6|8589934594|17179869185|
# +---+----------+-----------+

In this example data, we can predict how Spark will move data. But you cannot explicitly control it. And in real-world scenarios, when you have a couple of hundred executors performing a join, it not even predictable how rows will be moved and written. Some rows may be more difficult to process, some executor may not be performing well, etc. etc...

本文标签: pythonMonotonically increasing id orderStack Overflow