admin管理员组

文章数量:1181894

Here is a link (editable) to a detailed example of what I am stuck with: there is the input, the expected result and comments on the reasoning.

Basically, let's say I have in a column range a list of 1000 elements belonging to three types of object (it would be better if it could also work with more than three types). Each type has an associated probability. I want to sort this input column range so that each element is randomly picked according to their type while keeping the original relative order with this type in the input column, and whenever all elements of a type have been used, then we do the same while ignoring the items from the used up types.

So if there were only two types, with a probability of 0.000001 for Birds and 0.999999 for Insects, then we would expect to find all the Insects in the same order as the original list at the top, and the Birds at the bottom in the same order too.

I'm not sure if this kind of sorting has a name. At first my idea was to use a modified weighed random sort, but I couldn't make it keep the original order of the list for each type.

I have then written a working solution using countif to know how many items from the current type have been selected so far, and then an indirect formula to know which cell value to look at. Sadly, it's incredibly slow (around 20 seconds...), and that's a big no no as I need it to work at a reasonable speed with around 2000 items.

Could anyone help me with a more efficient solution please?

Here is a link (editable) to a detailed example of what I am stuck with: there is the input, the expected result and comments on the reasoning.

Basically, let's say I have in a column range a list of 1000 elements belonging to three types of object (it would be better if it could also work with more than three types). Each type has an associated probability. I want to sort this input column range so that each element is randomly picked according to their type while keeping the original relative order with this type in the input column, and whenever all elements of a type have been used, then we do the same while ignoring the items from the used up types.

So if there were only two types, with a probability of 0.000001 for Birds and 0.999999 for Insects, then we would expect to find all the Insects in the same order as the original list at the top, and the Birds at the bottom in the same order too.

I'm not sure if this kind of sorting has a name. At first my idea was to use a modified weighed random sort, but I couldn't make it keep the original order of the list for each type.

I have then written a working solution using countif to know how many items from the current type have been selected so far, and then an indirect formula to know which cell value to look at. Sadly, it's incredibly slow (around 20 seconds...), and that's a big no no as I need it to work at a reasonable speed with around 2000 items.

Could anyone help me with a more efficient solution please?

Share Improve this question edited yesterday Dreamk33 asked yesterday Dreamk33Dreamk33 2471 gold badge3 silver badges13 bronze badges 0
Add a comment  | 

2 Answers 2

Reset to default 3

Added one approach in your test sheet(separate tab); Do test it out extensively to see if its holding up for varied scenarios and all...

=let(Ξ,tocol(map(K3:K27,lambda(Λ,let(Δ,lambda(x,index(xlookup(x,G3:G5,E3:E5,,-1))),chooserows(filter(B:B,C:C=Δ(Λ)),countif(Δ(K3:Λ),Δ(Λ)))))),3),
 vstack(Ξ,filter(B3:B27,isna(xmatch(B3:B27,Ξ)))))

Here's a possible solution

=ARRAYFORMULA(LET(
   INPUT.val, B3:B,
   INPUT.type, C3:C,
   PROB.type, E3:E,
   PROB.prob, F3:F,
   rand, TOCOL(K3:K, 1),
   NORMALIZE, LAMBDA(table, LET(
     type, INDEX(table,,1), 
     prob, INDEX(table,,2),
     HSTACK( 
       {0; IF(ROWS(table) = 1 ,TOCOL(,1), SCAN(, ARRAY_CONSTRAIN(prob, ROWS(prob)-1, 1) / SUM(prob), LAMBDA(a, c, a + c)))},
       type
     )
   )),
   GROUP, LAMBDA(t, LET(u, UNIQUE(TOCOL(INPUT.type, 1)), {u, COUNTIF(t, u)})),
   global_counts, GROUP(INPUT.type),
   REDUCE(TOCOL(,1), rand, LAMBDA(result, cur_rand, LET(
     local_counts, GROUP(VLOOKUP(result, {INPUT.val, INPUT.type}, 2, )),
     prob_table, NORMALIZE(
       FILTER(
         {PROB.type, PROB.prob},
         VLOOKUP(PROB.type, local_counts, 2, ) < VLOOKUP(PROB.type, global_counts, 2, )
       )
     ),
     cur_type, VLOOKUP(cur_rand, prob_table, 2),
     VSTACK(
       result,
       INDEX(
         FILTER(INPUT.val, INPUT.type = cur_type),
         1 + VLOOKUP(cur_type, local_counts, 2, )
       )
     )
   )))
 ))

How it works

First, we assign a readable name to the ranges using the LET function

=LET(
   INPUT.val, B3:B,
   INPUT.type, C3:C,
   PROB.type, E3:E,
   PROB.prob, F3:F,

Then we define a function NORMALIZE(table) that takes as input a table in the form {type, probability} and returns the normalized probability table

   NORMALIZE, LAMBDA(table, LET(
     type, INDEX(table,,1), 
     prob, INDEX(table,,2),
     HSTACK( 
       {0; IF(ROWS(table) = 1 ,TOCOL(,1), SCAN(, ARRAY_CONSTRAIN(prob, ROWS(prob)-1, 1) / SUM(prob), LAMBDA(a, c, a + c)))},
       type
     )
   ))

For example, if table is:

COUNTRY 0.6
COLOUR 0.3
NUMBER 0.1

NORMALIZE(table) is:

0 COUNTRY
0.6 COLOUR
0.9 NUMBER

If table is:

COUNTRY 0.6
NUMBER 0.1

NORMALIZE(table) is:

0 COUNTRY
0.857 NUMBER

The normalized table puts the cumulative probabilities on the left so it's easier to perform a vertical lookup.

Then we define a function GROUP(t) that takes as input an array of types (t) and for each unique type in C3:C, returns how many times it occurs in t

   GROUP, LAMBDA(t, LET(u, UNIQUE(TOCOL(INPUT.type, 1)), {u, COUNTIF(t, u)}))

For example, if t is:

COUNTRY
NUMBER

GROUP(t) is:

COUNTRY 1
COLOUR 0
NUMBER 1

We then define a variable global_counts that counts how many times each type occurs in C3:C using the GROUP function we've just defined

   global_counts, GROUP(INPUT.type)

Now, we use REDUCE to iterate over each random number in K3:K

   REDUCE(TOCOL(,1), rand, LAMBDA(result, cur_rand, 

Then we calculate the current counts using the GROUP function

     local_counts, GROUP(VLOOKUP(result, {INPUT.val, INPUT.type}, 2, ))

And the normalized probability table, while filtering out the types that were already used

     prob_table, NORMALIZE(
       FILTER(
         {PROB.type, PROB.prob},
         VLOOKUP(PROB.type, local_counts, 2, ) < VLOOKUP(PROB.type, global_counts, 2, )
       )
     ),

We can now calculate the current type associated to the current random number using a simple VLOOKUP

     cur_type, VLOOKUP(cur_rand, prob_table, 2),

Finally, we append the value associated to the current type to the result

     VSTACK(
       result,
       INDEX(
         FILTER(INPUT.val, INPUT.type = cur_type),
         1 + VLOOKUP(cur_type, local_counts, 2, )
       )
     )

本文标签: sortingHow to randomly select by type probability all items from a google sheet column rangeStack Overflow