admin管理员组

文章数量:1345031

I want to calculate the difference in time of vehicles passing camera 1 (at the entrance) and camera 2 (at the exit). The problem is often data from either camera is missing.

I tried to pivot the data to a wide format. This worked with seq_len() to create sets, but not when data are missing. Then the function shifted and grouped the wrong times. (In the real data, time is a posix object, but I don't think that matters for the example.)

Here is a short example of the data. As you can see, the camera of observation eight is 1 instead of 2, and hereafter the sequence lags behind.

How can I get this to work? Is there a way seq_len can continue, or another function that works? I tried consecutive_id() and group_indices(), but they work the wrong way around. Another idea I could not get to work was adding rows with missing data so camera 1 and 2 keep alternating.

licence <- as.factor(c("A", "A", "A", "A", "B", "B", "C", "C", "C", "C", "C", "D"))
camera <- as.factor(c(1, 2, 1, 2, 1, 2, 1, 1, 2, 1, 2, 2))
time <-   1:12

camera_data <- data.frame(licence, camera, time)

   licence camera time
1        A      1    1
2        A      2    2
3        A      1    3
4        A      2    4
5        B      1    5
6        B      2    6
7        C      1    7
8        C      1    8
9        C      2    9
10       C      1   10
11       C      2   11
12       D      2   12
13       D      1   13
14       D      2   14

camera_data_sequence <- camera_data |> 
  dplyr::group_by(camera) |> 
  dplyr::arrange(licence, time) |> 
  dplyr::mutate(sequence = seq_len(dplyr::n()))

   licence camera  time sequence
   <fct>   <fct>  <int>    <int>
 1 A       1          1        1
 2 A       2          2        1
 3 A       1          3        2
 4 A       2          4        2
 5 B       1          5        3
 6 B       2          6        3
 7 C       1          7        4
 8 C       1          8        5
 9 C       2          9        4 # this should be group 5
10 C       1         10        6
11 C       2         11        5 # this should be group 6
12 D       2         12        6 # this should be group 7
13 D       1         13        7 # this should be group 8
14 D       2         14        7 # this should be group 8

camera_data_sequence2 <- camera_data_sequence |> 
  tidyr::pivot_wider(names_from = camera,
                     values_from = time)

  licence sequence   `1`   `2`
  <fct>      <int> <int> <int>
1 A              1     1     2
2 A              2     3     4
3 B              3     5     6
4 C              4     7     9
5 C              5     8    11
6 C              6    10    NA
7 D              6    NA    12
8 D              7    13    14

I want to calculate the difference in time of vehicles passing camera 1 (at the entrance) and camera 2 (at the exit). The problem is often data from either camera is missing.

I tried to pivot the data to a wide format. This worked with seq_len() to create sets, but not when data are missing. Then the function shifted and grouped the wrong times. (In the real data, time is a posix object, but I don't think that matters for the example.)

Here is a short example of the data. As you can see, the camera of observation eight is 1 instead of 2, and hereafter the sequence lags behind.

How can I get this to work? Is there a way seq_len can continue, or another function that works? I tried consecutive_id() and group_indices(), but they work the wrong way around. Another idea I could not get to work was adding rows with missing data so camera 1 and 2 keep alternating.

licence <- as.factor(c("A", "A", "A", "A", "B", "B", "C", "C", "C", "C", "C", "D"))
camera <- as.factor(c(1, 2, 1, 2, 1, 2, 1, 1, 2, 1, 2, 2))
time <-   1:12

camera_data <- data.frame(licence, camera, time)

   licence camera time
1        A      1    1
2        A      2    2
3        A      1    3
4        A      2    4
5        B      1    5
6        B      2    6
7        C      1    7
8        C      1    8
9        C      2    9
10       C      1   10
11       C      2   11
12       D      2   12
13       D      1   13
14       D      2   14

camera_data_sequence <- camera_data |> 
  dplyr::group_by(camera) |> 
  dplyr::arrange(licence, time) |> 
  dplyr::mutate(sequence = seq_len(dplyr::n()))

   licence camera  time sequence
   <fct>   <fct>  <int>    <int>
 1 A       1          1        1
 2 A       2          2        1
 3 A       1          3        2
 4 A       2          4        2
 5 B       1          5        3
 6 B       2          6        3
 7 C       1          7        4
 8 C       1          8        5
 9 C       2          9        4 # this should be group 5
10 C       1         10        6
11 C       2         11        5 # this should be group 6
12 D       2         12        6 # this should be group 7
13 D       1         13        7 # this should be group 8
14 D       2         14        7 # this should be group 8

camera_data_sequence2 <- camera_data_sequence |> 
  tidyr::pivot_wider(names_from = camera,
                     values_from = time)

  licence sequence   `1`   `2`
  <fct>      <int> <int> <int>
1 A              1     1     2
2 A              2     3     4
3 B              3     5     6
4 C              4     7     9
5 C              5     8    11
6 C              6    10    NA
7 D              6    NA    12
8 D              7    13    14
Share Improve this question asked yesterday Raoul Van OostenRaoul Van Oosten 2421 silver badge7 bronze badges 2
  • Would the problem be solved if we would add a row after a row like 7 which adds an NA for time? I.e. completion based on licence and camera? – Friede Commented yesterday
  • Your example data doesn’t match what you show. (eg, licence <- has just one "G" but the dataframe you show has three.) – zephryl Commented yesterday
Add a comment  | 

4 Answers 4

Reset to default 1
licence <- as.factor(c("A", "A", "A", "A", "B", "B", 
                       "C", "C", "C", "C", "C", "D", "D", "D"))
camera <- as.factor(c(1, 2, 1, 2, 1, 2, 1, 1, 2, 1, 2, 2, 1, 2))
time <-   1:14

camera_data <- data.frame(licence, camera, time)


library(dplyr)
library(tidyr)

## doing a pivot_wider, but you could summarize/reframe as well
camera_data %>% 
   mutate(sequence = cumsum(as.numeric(camera) - 
                              lag(as.numeric(camera), default = 1) < 1)) %>% 
  pivot_wider(names_from = "camera", values_from = time)
#> # A tibble: 8 × 4
#>   licence sequence   `1`   `2`
#>   <fct>      <int> <int> <int>
#> 1 A              1     1     2
#> 2 A              2     3     4
#> 3 B              3     5     6
#> 4 C              4     7    NA
#> 5 C              5     8     9
#> 6 C              6    10    11
#> 7 D              7    NA    12
#> 8 D              8    13    14

Created on 2025-04-03 with reprex v2.1.1

library(dplyr)
library(tidyr)

mutate(camera_data, id=row_number()) %>%
  complete(licence, camera) %>%
  arrange(id) %>%
  mutate(sequence=cumsum(camera==1)) # You may need .by=licence

Gives:


# A tibble: 13 × 5
   licence camera  time    id sequence
   <fct>   <fct>  <int> <int>    <int>
 1 A       1          1     1        1
 2 A       2          2     2        1
 3 A       1          3     3        2
 4 A       2          4     4        2
 5 B       1          5     5        3
 6 B       2          6     6        3
 7 C       1          7     7        4
 8 C       1          8     8        5
 9 C       2          9     9        5
10 C       1         10    10        6
11 C       2         11    11        6
12 D       2         12    12        6
13 D       1         NA    NA        7

Then use summarise or reframe to get the time differences

... %>%
  reframe(time.diff=nth(time,2) - first(time), .by=c(licence, sequence))

# A tibble: 8 × 3
  licence sequence time.diff
  <fct>      <int>     <int>
1 A              1         1
2 A              2         1
3 B              3         1
4 C              4        NA
5 C              5         1
6 C              6         1
7 D              6        NA
8 D              7         1

Within each license, use cumsum() and lag() to increment sequence when the current row is for camera 1, the previous row is for camera 2, or it’s the first row.

library(dplyr)

camera_data <- camera_data |>
  mutate(
    sequence = cumsum(camera == 1 | lag(camera) == 2 | is.na(lag(camera))),
    .by = licence
  )
   licence camera time sequence
1        A      1    1        1
2        A      2    2        1
3        A      1    3        2
4        A      2    4        2
5        B      1    5        1
6        B      2    6        1
7        C      1    7        1
8        C      1    8        2
9        C      2    9        2
10       C      1   10        3
11       C      2   11        3
12       D      2   12        1
13       D      1   13        2
14       D      2   14        2

Then to get time differences:

camera_data |>
  summarize(
    diff = if (n() == 2) diff(time) else NA, 
    .by = c(licence, sequence)
  )
  licence sequence diff
1       A        1    1
2       A        2    1
3       B        1    1
4       C        1   NA
5       C        2    1
6       C        3    1
7       D        1   NA
8       D        2    1

Or, for only sequences with both cameras:

camera_data |>
  reframe(diff = diff(time), .by = c(licence, sequence))
  licence sequence diff
1       A        1    1
2       A        2    1
3       B        1    1
4       C        2    1
5       C        3    1
6       D        2    1

Try this, which produces balanced data,

> (res <- split(cdat, ~license) |> 
+   lapply(transform, event=cumsum(diff(c(camera[1], camera)) != 1)) |> 
+   do.call(what='rbind') |> as.data.frame() |> 
+   {\(.) merge(., 
+               expand.grid(camera=unique(.$camera), 
+                           license=unique(.$license),
+                           event=unique(.$event)), 
+               all=TRUE)}() |> 
+   sort_by(~list(license, event)))
   license camera event time
1        A      1     1    1
4        A      2     1    2
2        A      1     2    3
5        A      2     2    4
3        A      1     3   NA
6        A      2     3   NA
7        B      1     1    5
10       B      2     1    6
8        B      1     2   NA
11       B      2     2   NA
9        B      1     3   NA
12       B      2     3   NA
13       C      1     1    7
16       C      2     1   NA
14       C      1     2    8
17       C      2     2    9
15       C      1     3   10
18       C      2     3   11
19       D      1     1   NA
22       D      2     1   12
20       D      1     2   13
23       D      2     2   14
21       D      1     3   NA
24       D      2     3   15

and reshape thereafter.

> reshape(res, direction='wide', idvar=c('license', 'event'), timevar='camera')
   license event time.1 time.2
1        A     1      1      2
2        A     2      3      4
3        A     3     NA     NA
7        B     1      5      6
8        B     2     NA     NA
9        B     3     NA     NA
13       C     1      7     NA
14       C     2      8      9
15       C     3     10     11
19       D     1     NA     12
20       D     2     13     14
21       D     3     NA     15

Data:

> dput(cdat)
structure(list(license = c("A", "A", "A", "A", "B", "B", "C", 
"C", "C", "C", "C", "D", "D", "D", "D"), camera = c(1L, 2L, 1L, 
2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 2L), time = 1:15), class = "data.frame", row.names = c(NA, 
-15L))

本文标签: rAdd group ID per set of two observationsStack Overflow