r - Add group ID per set of two observations - Stack Overflow

IT技术

更新时间：2025-04-050

admin管理员组
文章数量:1345031

I want to calculate the difference in time of vehicles passing camera 1 (at the entrance) and camera 2 (at the exit). The problem is often data from either camera is missing.

I tried to pivot the data to a wide format. This worked with seq_len() to create sets, but not when data are missing. Then the function shifted and grouped the wrong times. (In the real data, time is a posix object, but I don't think that matters for the example.)

Here is a short example of the data. As you can see, the camera of observation eight is 1 instead of 2, and hereafter the sequence lags behind.

How can I get this to work? Is there a way seq_len can continue, or another function that works? I tried consecutive_id() and group_indices(), but they work the wrong way around. Another idea I could not get to work was adding rows with missing data so camera 1 and 2 keep alternating.

licence <- as.factor(c("A", "A", "A", "A", "B", "B", "C", "C", "C", "C", "C", "D"))
camera <- as.factor(c(1, 2, 1, 2, 1, 2, 1, 1, 2, 1, 2, 2))
time <-   1:12

camera_data <- data.frame(licence, camera, time)

   licence camera time
1        A      1    1
2        A      2    2
3        A      1    3
4        A      2    4
5        B      1    5
6        B      2    6
7        C      1    7
8        C      1    8
9        C      2    9
10       C      1   10
11       C      2   11
12       D      2   12
13       D      1   13
14       D      2   14

camera_data_sequence <- camera_data |> 
  dplyr::group_by(camera) |> 
  dplyr::arrange(licence, time) |> 
  dplyr::mutate(sequence = seq_len(dplyr::n()))

   licence camera  time sequence
   <fct>   <fct>  <int>    <int>
 1 A       1          1        1
 2 A       2          2        1
 3 A       1          3        2
 4 A       2          4        2
 5 B       1          5        3
 6 B       2          6        3
 7 C       1          7        4
 8 C       1          8        5
 9 C       2          9        4 # this should be group 5
10 C       1         10        6
11 C       2         11        5 # this should be group 6
12 D       2         12        6 # this should be group 7
13 D       1         13        7 # this should be group 8
14 D       2         14        7 # this should be group 8

camera_data_sequence2 <- camera_data_sequence |> 
  tidyr::pivot_wider(names_from = camera,
                     values_from = time)

  licence sequence   `1`   `2`
  <fct>      <int> <int> <int>
1 A              1     1     2
2 A              2     3     4
3 B              3     5     6
4 C              4     7     9
5 C              5     8    11
6 C              6    10    NA
7 D              6    NA    12
8 D              7    13    14

I want to calculate the difference in time of vehicles passing camera 1 (at the entrance) and camera 2 (at the exit). The problem is often data from either camera is missing.

I tried to pivot the data to a wide format. This worked with seq_len() to create sets, but not when data are missing. Then the function shifted and grouped the wrong times. (In the real data, time is a posix object, but I don't think that matters for the example.)

Here is a short example of the data. As you can see, the camera of observation eight is 1 instead of 2, and hereafter the sequence lags behind.

How can I get this to work? Is there a way seq_len can continue, or another function that works? I tried consecutive_id() and group_indices(), but they work the wrong way around. Another idea I could not get to work was adding rows with missing data so camera 1 and 2 keep alternating.

licence <- as.factor(c("A", "A", "A", "A", "B", "B", "C", "C", "C", "C", "C", "D"))
camera <- as.factor(c(1, 2, 1, 2, 1, 2, 1, 1, 2, 1, 2, 2))
time <-   1:12

camera_data <- data.frame(licence, camera, time)

   licence camera time
1        A      1    1
2        A      2    2
3        A      1    3
4        A      2    4
5        B      1    5
6        B      2    6
7        C      1    7
8        C      1    8
9        C      2    9
10       C      1   10
11       C      2   11
12       D      2   12
13       D      1   13
14       D      2   14

camera_data_sequence <- camera_data |> 
  dplyr::group_by(camera) |> 
  dplyr::arrange(licence, time) |> 
  dplyr::mutate(sequence = seq_len(dplyr::n()))

   licence camera  time sequence
   <fct>   <fct>  <int>    <int>
 1 A       1          1        1
 2 A       2          2        1
 3 A       1          3        2
 4 A       2          4        2
 5 B       1          5        3
 6 B       2          6        3
 7 C       1          7        4
 8 C       1          8        5
 9 C       2          9        4 # this should be group 5
10 C       1         10        6
11 C       2         11        5 # this should be group 6
12 D       2         12        6 # this should be group 7
13 D       1         13        7 # this should be group 8
14 D       2         14        7 # this should be group 8

camera_data_sequence2 <- camera_data_sequence |> 
  tidyr::pivot_wider(names_from = camera,
                     values_from = time)

  licence sequence   `1`   `2`
  <fct>      <int> <int> <int>
1 A              1     1     2
2 A              2     3     4
3 B              3     5     6
4 C              4     7     9
5 C              5     8    11
6 C              6    10    NA
7 D              6    NA    12
8 D              7    13    14

Share Improve this question asked yesterday Raoul Van Oosten 2421 silver badge7 bronze badges

Would the problem be solved if we would add a row after a row like 7 which adds an NA for time? I.e. completion based on licence and camera? – Friede Commented yesterday
Your example data doesn’t match what you show. (eg, licence <- has just one "G" but the dataframe you show has three.) – zephryl Commented yesterday

Add a comment |

4 Answers 4

Sorted by: Reset to default 1

licence <- as.factor(c("A", "A", "A", "A", "B", "B", 
                       "C", "C", "C", "C", "C", "D", "D", "D"))
camera <- as.factor(c(1, 2, 1, 2, 1, 2, 1, 1, 2, 1, 2, 2, 1, 2))
time <-   1:14

camera_data <- data.frame(licence, camera, time)


library(dplyr)
library(tidyr)

## doing a pivot_wider, but you could summarize/reframe as well
camera_data %>% 
   mutate(sequence = cumsum(as.numeric(camera) - 
                              lag(as.numeric(camera), default = 1) < 1)) %>% 
  pivot_wider(names_from = "camera", values_from = time)
#> # A tibble: 8 × 4
#>   licence sequence   `1`   `2`
#>   <fct>      <int> <int> <int>
#> 1 A              1     1     2
#> 2 A              2     3     4
#> 3 B              3     5     6
#> 4 C              4     7    NA
#> 5 C              5     8     9
#> 6 C              6    10    11
#> 7 D              7    NA    12
#> 8 D              8    13    14

^{Created on 2025-04-03 with reprex v2.1.1}

library(dplyr)
library(tidyr)

mutate(camera_data, id=row_number()) %>%
  complete(licence, camera) %>%
  arrange(id) %>%
  mutate(sequence=cumsum(camera==1)) # You may need .by=licence

Gives:

# A tibble: 13 × 5
   licence camera  time    id sequence
   <fct>   <fct>  <int> <int>    <int>
 1 A       1          1     1        1
 2 A       2          2     2        1
 3 A       1          3     3        2
 4 A       2          4     4        2
 5 B       1          5     5        3
 6 B       2          6     6        3
 7 C       1          7     7        4
 8 C       1          8     8        5
 9 C       2          9     9        5
10 C       1         10    10        6
11 C       2         11    11        6
12 D       2         12    12        6
13 D       1         NA    NA        7

Then use summarise or reframe to get the time differences

... %>%
  reframe(time.diff=nth(time,2) - first(time), .by=c(licence, sequence))

# A tibble: 8 × 3
  licence sequence time.diff
  <fct>      <int>     <int>
1 A              1         1
2 A              2         1
3 B              3         1
4 C              4        NA
5 C              5         1
6 C              6         1
7 D              6        NA
8 D              7         1

Within each license, use cumsum() and lag() to increment sequence when the current row is for camera 1, the previous row is for camera 2, or it’s the first row.

library(dplyr)

camera_data <- camera_data |>
  mutate(
    sequence = cumsum(camera == 1 | lag(camera) == 2 | is.na(lag(camera))),
    .by = licence
  )

   licence camera time sequence
1        A      1    1        1
2        A      2    2        1
3        A      1    3        2
4        A      2    4        2
5        B      1    5        1
6        B      2    6        1
7        C      1    7        1
8        C      1    8        2
9        C      2    9        2
10       C      1   10        3
11       C      2   11        3
12       D      2   12        1
13       D      1   13        2
14       D      2   14        2

Then to get time differences:

camera_data |>
  summarize(
    diff = if (n() == 2) diff(time) else NA, 
    .by = c(licence, sequence)
  )

  licence sequence diff
1       A        1    1
2       A        2    1
3       B        1    1
4       C        1   NA
5       C        2    1
6       C        3    1
7       D        1   NA
8       D        2    1

Or, for only sequences with both cameras:

camera_data |>
  reframe(diff = diff(time), .by = c(licence, sequence))

  licence sequence diff
1       A        1    1
2       A        2    1
3       B        1    1
4       C        2    1
5       C        3    1
6       D        2    1

Try this, which produces balanced data,

> (res <- split(cdat, ~license) |> 
+   lapply(transform, event=cumsum(diff(c(camera[1], camera)) != 1)) |> 
+   do.call(what='rbind') |> as.data.frame() |> 
+   {\(.) merge(., 
+               expand.grid(camera=unique(.$camera), 
+                           license=unique(.$license),
+                           event=unique(.$event)), 
+               all=TRUE)}() |> 
+   sort_by(~list(license, event)))
   license camera event time
1        A      1     1    1
4        A      2     1    2
2        A      1     2    3
5        A      2     2    4
3        A      1     3   NA
6        A      2     3   NA
7        B      1     1    5
10       B      2     1    6
8        B      1     2   NA
11       B      2     2   NA
9        B      1     3   NA
12       B      2     3   NA
13       C      1     1    7
16       C      2     1   NA
14       C      1     2    8
17       C      2     2    9
15       C      1     3   10
18       C      2     3   11
19       D      1     1   NA
22       D      2     1   12
20       D      1     2   13
23       D      2     2   14
21       D      1     3   NA
24       D      2     3   15

and reshape thereafter.

> reshape(res, direction='wide', idvar=c('license', 'event'), timevar='camera')
   license event time.1 time.2
1        A     1      1      2
2        A     2      3      4
3        A     3     NA     NA
7        B     1      5      6
8        B     2     NA     NA
9        B     3     NA     NA
13       C     1      7     NA
14       C     2      8      9
15       C     3     10     11
19       D     1     NA     12
20       D     2     13     14
21       D     3     NA     15

Data:

> dput(cdat)
structure(list(license = c("A", "A", "A", "A", "B", "B", "C", 
"C", "C", "C", "C", "D", "D", "D", "D"), camera = c(1L, 2L, 1L, 
2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 2L), time = 1:15), class = "data.frame", row.names = c(NA, 
-15L))

本文标签： rAdd group ID per set of two observationsStack Overflow

版权声明：本文标题：r - Add group ID per set of two observations - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1743784394a2538415.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

r - Add group ID per set of two observations - Stack Overflow

4 Answers 4

更多相关文章