admin管理员组

文章数量:1296918

How can I create a new grp variable that divides row into groups of n unique row values each, within type? e.g. below, for n = 4, the first four subgroups of repeated -or unique- values of row are grp 1 (red), the next four subgroups of repeated -or unique- values of row are grp 2 (blue), and so on. The ideal would be a function allowing n to be changed as desired, but not necessarily.

Knowing that, within type, row is always in an ascending order but not necessarily continuously, and that the number of its repetitions can vary randomly.

Edit: in addition, to secure in the case where n is not an exact multiple of 4 for a given type (see the exchanges of the answers provided), grp would return NA for all this type, ideally.

Note: here is a small example, but my database has thousands of rows and types, with much more repetitions.

Initial and desired data:

Initial data:

dat0 <-
structure(list(type = c("a", "a", "a", "a", "a", "a", "a", "a", 
"a", "a", "a", "a", "a", "a", "a", "a", "a", "b", "b", "b", "b", 
"b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", 
"b", "b", "b", "b", "b", "b", "b", "b", "b", "b"), row = c(5, 
5, 6, 8, 8, 8, 10, 11, 11, 11, 13, 13, 14, 14, 18, 18, 18, 3, 
4, 4, 4, 6, 6, 7, 7, 7, 9, 9, 10, 10, 10, 12, 16, 16, 21, 22, 
22, 22, 23, 23, 28, 28, 28, 28)), row.names = c(NA, -44L), class = c("tbl_df", 
"tbl", "data.frame"))

How can I create a new grp variable that divides row into groups of n unique row values each, within type? e.g. below, for n = 4, the first four subgroups of repeated -or unique- values of row are grp 1 (red), the next four subgroups of repeated -or unique- values of row are grp 2 (blue), and so on. The ideal would be a function allowing n to be changed as desired, but not necessarily.

Knowing that, within type, row is always in an ascending order but not necessarily continuously, and that the number of its repetitions can vary randomly.

Edit: in addition, to secure in the case where n is not an exact multiple of 4 for a given type (see the exchanges of the answers provided), grp would return NA for all this type, ideally.

Note: here is a small example, but my database has thousands of rows and types, with much more repetitions.

Initial and desired data:

Initial data:

dat0 <-
structure(list(type = c("a", "a", "a", "a", "a", "a", "a", "a", 
"a", "a", "a", "a", "a", "a", "a", "a", "a", "b", "b", "b", "b", 
"b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", 
"b", "b", "b", "b", "b", "b", "b", "b", "b", "b"), row = c(5, 
5, 6, 8, 8, 8, 10, 11, 11, 11, 13, 13, 14, 14, 18, 18, 18, 3, 
4, 4, 4, 6, 6, 7, 7, 7, 9, 9, 10, 10, 10, 12, 16, 16, 21, 22, 
22, 22, 23, 23, 28, 28, 28, 28)), row.names = c(NA, -44L), class = c("tbl_df", 
"tbl", "data.frame"))
Share Improve this question edited Feb 13 at 6:01 denis asked Feb 11 at 18:02 denisdenis 8025 silver badges14 bronze badges 2
  • To clarify, when you say "the multiple of 4 of the row repetitions remains constant" -- do you mean the number of unique row values in the data will always be a whole multiple of 4 within each type? – zephryl Commented Feb 11 at 19:45
  • Yes if the unique row values ​​you are referring to are unique after grouping the repetitions. – denis Commented Feb 11 at 20:58
Add a comment  | 

3 Answers 3

Reset to default 1
library(dplyr)

dat0 %>% 
  distinct() %>% 
  mutate(id = gl(n()/4, 4)) %>% 
  right_join(dat0)

#> # A tibble: 44 × 3
#>    type    row id   
#>    <chr> <dbl> <fct>
#>  1 a         5 1    
#>  2 a         5 1    
#>  3 a         6 1    
#>  4 a         8 1    
#>  5 a         8 1    
#>  6 a         8 1    
#>  7 a        10 1    
#>  8 a        11 2    
#>  9 a        11 2    
#> 10 a        11 2    
#> # ℹ 34 more rows

Created on 2025-02-11 with reprex v2.1.1

You could use factor and relevel them.

> f <- \(x, n=4) {
+   xs <- factor(x)
+   ln <- length(levels(xs))
+   stopifnot(!ln %% n)  ## stops if groups would be uneven
+   levels(xs) <- rep(seq_len(ln/n), each=n)
+   xs
+ }
> 
> dat0 |> transform(g=ave(row, type, FUN=f))
   type row g
1     a   5 1
2     a   5 1
3     a   6 1
4     a   8 1
5     a   8 1
6     a   8 1
7     a  10 1
8     a  11 2
9     a  11 2
10    a  11 2
11    a  13 2
12    a  13 2
13    a  14 2
14    a  14 2
15    a  18 2
16    a  18 2
17    a  18 2
18    b   3 1
19    b   4 1
20    b   4 1
21    b   4 1
22    b   6 1
23    b   6 1
24    b   7 1
25    b   7 1
26    b   7 1
27    b   9 2
28    b   9 2
29    b  10 2
30    b  10 2
31    b  10 2
32    b  12 2
33    b  16 2
34    b  16 2
35    b  21 3
36    b  22 3
37    b  22 3
38    b  22 3
39    b  23 3
40    b  23 3
41    b  28 3
42    b  28 3
43    b  28 3
44    b  28 3

This also works if data is in disorder:

> dat0[sample.int(nrow(dat0)), ] |> 
+   transform(g=ave(row, type, FUN=f)) |> 
+   sort_by(~list(type, row))
   type row g
18    a   5 1
27    a   5 1
10    a   6 1
26    a   8 1
36    a   8 1
44    a   8 1
17    a  10 1
14    a  11 2
15    a  11 2
38    a  11 2
8     a  13 2
12    a  13 2
1     a  14 2
43    a  14 2
16    a  18 2
30    a  18 2
41    a  18 2
29    b   3 1
13    b   4 1
19    b   4 1
23    b   4 1
35    b   6 1
37    b   6 1
2     b   7 1
28    b   7 1
40    b   7 1
7     b   9 2
25    b   9 2
4     b  10 2
9     b  10 2
22    b  10 2
31    b  12 2
21    b  16 2
32    b  16 2
33    b  21 3
3     b  22 3
20    b  22 3
24    b  22 3
6     b  23 3
39    b  23 3
5     b  28 3
11    b  28 3
34    b  28 3
42    b  28 3

To see if chosen n is suitable you could check if dividing length of unique rows by group give integer, sth like

> with(unique(dat0[c('type', 'row')]), tapply(row, type, length))/4
a b 
2 3 
> with(unique(dat0[c('type', 'row')]), tapply(row, type, length))/6
       a        b 
1.333333 2.000000 

Using dplyr::consecutive_id() with floor division:

library(dplyr)

grp_n_vals <- function(x, n) (consecutive_id(x) + n - 1) %/% n

Although you specified that row is always in order within type, if you wanted to relax that assumption, you can sort your values and then "unsort" the result:

grp_n_vals <- function(x, n) {
  ord <- order(order(x))
  out <- (consecutive_id(sort(x)) + n - 1) %/% n
  out[ord]
}

Result using either version:

dat0 %>%
  mutate(grp = grp_n_vals(row, n = 4), .by = type) %>% 
  print(n = Inf)
# # A tibble: 44 × 3
#    type    row   grp
#    <chr> <dbl> <dbl>
#  1 a         5     1
#  2 a         5     1
#  3 a         6     1
#  4 a         8     1
#  5 a         8     1
#  6 a         8     1
#  7 a        10     1
#  8 a        11     2
#  9 a        11     2
# 10 a        11     2
# 11 a        13     2
# 12 a        13     2
# 13 a        14     2
# 14 a        14     2
# 15 a        18     2
# 16 a        18     2
# 17 a        18     2
# 18 b         3     1
# 19 b         4     1
# 20 b         4     1
# 21 b         4     1
# 22 b         6     1
# 23 b         6     1
# 24 b         7     1
# 25 b         7     1
# 26 b         7     1
# 27 b         9     2
# 28 b         9     2
# 29 b        10     2
# 30 b        10     2
# 31 b        10     2
# 32 b        12     2
# 33 b        16     2
# 34 b        16     2
# 35 b        21     3
# 36 b        22     3
# 37 b        22     3
# 38 b        22     3
# 39 b        23     3
# 40 b        23     3
# 41 b        28     3
# 42 b        28     3
# 43 b        28     3
# 44 b        28     3

本文标签: rHow to number groups of randomly repeated values into sets of n unique valuesStack Overflow