admin管理员组文章数量:1181894
Here is a link (editable) to a detailed example of what I am stuck with: there is the input, the expected result and comments on the reasoning.
Basically, let's say I have in a column range a list of 1000 elements belonging to three types of object (it would be better if it could also work with more than three types). Each type has an associated probability. I want to sort this input column range so that each element is randomly picked according to their type while keeping the original relative order with this type in the input column, and whenever all elements of a type have been used, then we do the same while ignoring the items from the used up types.
So if there were only two types, with a probability of 0.000001 for Birds and 0.999999 for Insects, then we would expect to find all the Insects in the same order as the original list at the top, and the Birds at the bottom in the same order too.
I'm not sure if this kind of sorting has a name. At first my idea was to use a modified weighed random sort, but I couldn't make it keep the original order of the list for each type.
I have then written a working solution using countif to know how many items from the current type have been selected so far, and then an indirect formula to know which cell value to look at. Sadly, it's incredibly slow (around 20 seconds...), and that's a big no no as I need it to work at a reasonable speed with around 2000 items.
Could anyone help me with a more efficient solution please?
Here is a link (editable) to a detailed example of what I am stuck with: there is the input, the expected result and comments on the reasoning.
Basically, let's say I have in a column range a list of 1000 elements belonging to three types of object (it would be better if it could also work with more than three types). Each type has an associated probability. I want to sort this input column range so that each element is randomly picked according to their type while keeping the original relative order with this type in the input column, and whenever all elements of a type have been used, then we do the same while ignoring the items from the used up types.
So if there were only two types, with a probability of 0.000001 for Birds and 0.999999 for Insects, then we would expect to find all the Insects in the same order as the original list at the top, and the Birds at the bottom in the same order too.
I'm not sure if this kind of sorting has a name. At first my idea was to use a modified weighed random sort, but I couldn't make it keep the original order of the list for each type.
I have then written a working solution using countif to know how many items from the current type have been selected so far, and then an indirect formula to know which cell value to look at. Sadly, it's incredibly slow (around 20 seconds...), and that's a big no no as I need it to work at a reasonable speed with around 2000 items.
Could anyone help me with a more efficient solution please?
Share Improve this question edited yesterday Dreamk33 asked yesterday Dreamk33Dreamk33 2471 gold badge3 silver badges13 bronze badges 02 Answers
Reset to default 3Added one approach in your test sheet(separate tab); Do test it out extensively to see if its holding up for varied scenarios and all...
=let(Ξ,tocol(map(K3:K27,lambda(Λ,let(Δ,lambda(x,index(xlookup(x,G3:G5,E3:E5,,-1))),chooserows(filter(B:B,C:C=Δ(Λ)),countif(Δ(K3:Λ),Δ(Λ)))))),3),
vstack(Ξ,filter(B3:B27,isna(xmatch(B3:B27,Ξ)))))
Here's a possible solution
=ARRAYFORMULA(LET(
INPUT.val, B3:B,
INPUT.type, C3:C,
PROB.type, E3:E,
PROB.prob, F3:F,
rand, TOCOL(K3:K, 1),
NORMALIZE, LAMBDA(table, LET(
type, INDEX(table,,1),
prob, INDEX(table,,2),
HSTACK(
{0; IF(ROWS(table) = 1 ,TOCOL(,1), SCAN(, ARRAY_CONSTRAIN(prob, ROWS(prob)-1, 1) / SUM(prob), LAMBDA(a, c, a + c)))},
type
)
)),
GROUP, LAMBDA(t, LET(u, UNIQUE(TOCOL(INPUT.type, 1)), {u, COUNTIF(t, u)})),
global_counts, GROUP(INPUT.type),
REDUCE(TOCOL(,1), rand, LAMBDA(result, cur_rand, LET(
local_counts, GROUP(VLOOKUP(result, {INPUT.val, INPUT.type}, 2, )),
prob_table, NORMALIZE(
FILTER(
{PROB.type, PROB.prob},
VLOOKUP(PROB.type, local_counts, 2, ) < VLOOKUP(PROB.type, global_counts, 2, )
)
),
cur_type, VLOOKUP(cur_rand, prob_table, 2),
VSTACK(
result,
INDEX(
FILTER(INPUT.val, INPUT.type = cur_type),
1 + VLOOKUP(cur_type, local_counts, 2, )
)
)
)))
))
How it works
First, we assign a readable name to the ranges using the LET function
=LET(
INPUT.val, B3:B,
INPUT.type, C3:C,
PROB.type, E3:E,
PROB.prob, F3:F,
Then we define a function NORMALIZE(table)
that takes as input a table in the form {type, probability}
and returns the normalized probability table
NORMALIZE, LAMBDA(table, LET(
type, INDEX(table,,1),
prob, INDEX(table,,2),
HSTACK(
{0; IF(ROWS(table) = 1 ,TOCOL(,1), SCAN(, ARRAY_CONSTRAIN(prob, ROWS(prob)-1, 1) / SUM(prob), LAMBDA(a, c, a + c)))},
type
)
))
For example, if table
is:
COUNTRY | 0.6 |
COLOUR | 0.3 |
NUMBER | 0.1 |
NORMALIZE(table)
is:
0 | COUNTRY |
0.6 | COLOUR |
0.9 | NUMBER |
If table
is:
COUNTRY | 0.6 |
NUMBER | 0.1 |
NORMALIZE(table)
is:
0 | COUNTRY |
0.857 | NUMBER |
The normalized table puts the cumulative probabilities on the left so it's easier to perform a vertical lookup.
Then we define a function GROUP(t)
that takes as input an array of types (t
) and for each unique type in C3:C
, returns how many times it occurs in t
GROUP, LAMBDA(t, LET(u, UNIQUE(TOCOL(INPUT.type, 1)), {u, COUNTIF(t, u)}))
For example, if t
is:
COUNTRY |
NUMBER |
GROUP(t)
is:
COUNTRY | 1 |
COLOUR | 0 |
NUMBER | 1 |
We then define a variable global_counts
that counts how many times each type occurs in C3:C
using the GROUP
function we've just defined
global_counts, GROUP(INPUT.type)
Now, we use REDUCE
to iterate over each random number in K3:K
REDUCE(TOCOL(,1), rand, LAMBDA(result, cur_rand,
Then we calculate the current counts using the GROUP
function
local_counts, GROUP(VLOOKUP(result, {INPUT.val, INPUT.type}, 2, ))
And the normalized probability table, while filtering out the types that were already used
prob_table, NORMALIZE(
FILTER(
{PROB.type, PROB.prob},
VLOOKUP(PROB.type, local_counts, 2, ) < VLOOKUP(PROB.type, global_counts, 2, )
)
),
We can now calculate the current type associated to the current random number using a simple VLOOKUP
cur_type, VLOOKUP(cur_rand, prob_table, 2),
Finally, we append the value associated to the current type to the result
VSTACK(
result,
INDEX(
FILTER(INPUT.val, INPUT.type = cur_type),
1 + VLOOKUP(cur_type, local_counts, 2, )
)
)
版权声明:本文标题:sorting - How to randomly select by type probability all items from a google sheet column range - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1738197948a2068255.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论