admin管理员组文章数量:1123376
I have a dataset where each row is a cluster, and I want to know which clusters fall within the previous cluster, or 12h after the previous cluster ended.
My dataset looks like this:
cluster_id period_id clus_start clus_end
36 SAT119_1 2011-12-08 14:19 2011-12-10 14:18
37 SAT119_1 2011-12-08 18:18 2011-12-10 06:20
40 SAT119_1 2011-12-14 22:19 2011-12-16 22:18
41 SAT119_1 2011-12-17 22:18 2011-12-18 18:18
42 SAT119_1 2011-12-19 05:19 2011-12-30 14:18
47 SAT119_1 2011-12-30 18:19 2012-01-04 22:19
49 SAT119_1 2012-01-06 06:18 2012-01-07 22:19
59 SAT119_1 2012-01-20 02:19 2012-01-21 22:18
61 SAT119_1 2012-01-22 18:18 2012-01-24 14:18
63 SAT119_1 2012-01-26 06:18 2012-01-28 14:20
What I want is to check, sequentially, if the start of cluster n+1
falls within the end (or 12h later) of cluster n
, and I want to create a new column with a yes/no option. The first cluster of each period id would always have "no", and then it would be either "yes" or "no".
But, at the same time, IF cluster n+1
falls within cluster n
, I would like cluster n+2
to be checked in relation to cluster n
.
In order words, if overlap (see table below) is yes, then that cluster should be ignored to calculate the time difference for the next cluster.
So, the output in this table would be:
cluster_id period_id clus_start clus_end overlap time_diff (days)
36 SAT119_1 2011-12-08 14:19 2011-12-10 14:18 no NA
37 SAT119_1 2011-12-08 18:18 2011-12-10 06:20 yes start_37 - end_36
40 SAT119_1 2011-12-14 22:19 2011-12-16 22:18 no start_40 - end_36
41 SAT119_1 2011-12-17 22:18 2011-12-18 18:18 no start_41 - end_40
42 SAT119_1 2011-12-19 05:19 2011-12-30 14:18 yes start_42 - end_41
47 SAT119_1 2011-12-30 18:19 2012-01-04 22:19 no start_47 - end_41
49 SAT119_1 2012-01-06 06:18 2012-01-07 22:19 no start_49 - end_47
59 SAT119_1 2012-01-20 02:19 2012-01-21 22:18 no start_59 - end_49
61 SAT119_1 2012-01-22 18:18 2012-01-24 14:18 no start_61 - end_59
63 SAT119_1 2012-01-26 06:18 2012-01-28 14:20 no start_63 - end_61
(Where start_37 and so on indicates the "clus_start" of cluster id 37)
I believe that, at first, the time has to be calculated for the first two rows, then the value in the column "within" is filled, and then the next calculations will be made considering the value (yes/no) on that column.
But I am a bit stuck, so far I could only apply the difftime()
function to get the times between the clusters:
df$time_diff <- as.numeric(with(df, difftime(clus_start, ave(clus_end, period_id,FUN=lag), units='days')))
But this does not account for the yes/no column... Any idea how to incorporate this step?
Here is an example of the dataset:
structure(list(clus_start = structure(c(1323353941, 1323368281,
1323901141, 1324160281, 1324275541, 1325269141, 1325830681, 1327025941,
1327256281, 1327558681), class = c("POSIXct", "POSIXt"), tzone = "GMT"),
clus_end = structure(c(1323526681, 1323498001, 1324073881,
1324232281, 1325254681, 1325715541, 1325974741, 1327184281,
1327414681, 1327760401), class = c("POSIXct", "POSIXt"), tzone = "GMT"),
cluster_id = c("36", "37", "40",
"41", "42", "47", "49", "59",
"61", "63"), period_id = c("SAT119_1", "SAT119_1",
"SAT119_1", "SAT119_1", "SAT119_1", "SAT119_1", "SAT119_1",
"SAT119_1", "SAT119_1", "SAT119_1")), row.names = c(2L, 3L,
4L, 5L, 6L, 8L, 9L, 13L, 14L, 15L), class = "data.frame")
I have a dataset where each row is a cluster, and I want to know which clusters fall within the previous cluster, or 12h after the previous cluster ended.
My dataset looks like this:
cluster_id period_id clus_start clus_end
36 SAT119_1 2011-12-08 14:19 2011-12-10 14:18
37 SAT119_1 2011-12-08 18:18 2011-12-10 06:20
40 SAT119_1 2011-12-14 22:19 2011-12-16 22:18
41 SAT119_1 2011-12-17 22:18 2011-12-18 18:18
42 SAT119_1 2011-12-19 05:19 2011-12-30 14:18
47 SAT119_1 2011-12-30 18:19 2012-01-04 22:19
49 SAT119_1 2012-01-06 06:18 2012-01-07 22:19
59 SAT119_1 2012-01-20 02:19 2012-01-21 22:18
61 SAT119_1 2012-01-22 18:18 2012-01-24 14:18
63 SAT119_1 2012-01-26 06:18 2012-01-28 14:20
What I want is to check, sequentially, if the start of cluster n+1
falls within the end (or 12h later) of cluster n
, and I want to create a new column with a yes/no option. The first cluster of each period id would always have "no", and then it would be either "yes" or "no".
But, at the same time, IF cluster n+1
falls within cluster n
, I would like cluster n+2
to be checked in relation to cluster n
.
In order words, if overlap (see table below) is yes, then that cluster should be ignored to calculate the time difference for the next cluster.
So, the output in this table would be:
cluster_id period_id clus_start clus_end overlap time_diff (days)
36 SAT119_1 2011-12-08 14:19 2011-12-10 14:18 no NA
37 SAT119_1 2011-12-08 18:18 2011-12-10 06:20 yes start_37 - end_36
40 SAT119_1 2011-12-14 22:19 2011-12-16 22:18 no start_40 - end_36
41 SAT119_1 2011-12-17 22:18 2011-12-18 18:18 no start_41 - end_40
42 SAT119_1 2011-12-19 05:19 2011-12-30 14:18 yes start_42 - end_41
47 SAT119_1 2011-12-30 18:19 2012-01-04 22:19 no start_47 - end_41
49 SAT119_1 2012-01-06 06:18 2012-01-07 22:19 no start_49 - end_47
59 SAT119_1 2012-01-20 02:19 2012-01-21 22:18 no start_59 - end_49
61 SAT119_1 2012-01-22 18:18 2012-01-24 14:18 no start_61 - end_59
63 SAT119_1 2012-01-26 06:18 2012-01-28 14:20 no start_63 - end_61
(Where start_37 and so on indicates the "clus_start" of cluster id 37)
I believe that, at first, the time has to be calculated for the first two rows, then the value in the column "within" is filled, and then the next calculations will be made considering the value (yes/no) on that column.
But I am a bit stuck, so far I could only apply the difftime()
function to get the times between the clusters:
df$time_diff <- as.numeric(with(df, difftime(clus_start, ave(clus_end, period_id,FUN=lag), units='days')))
But this does not account for the yes/no column... Any idea how to incorporate this step?
Here is an example of the dataset:
structure(list(clus_start = structure(c(1323353941, 1323368281,
1323901141, 1324160281, 1324275541, 1325269141, 1325830681, 1327025941,
1327256281, 1327558681), class = c("POSIXct", "POSIXt"), tzone = "GMT"),
clus_end = structure(c(1323526681, 1323498001, 1324073881,
1324232281, 1325254681, 1325715541, 1325974741, 1327184281,
1327414681, 1327760401), class = c("POSIXct", "POSIXt"), tzone = "GMT"),
cluster_id = c("36", "37", "40",
"41", "42", "47", "49", "59",
"61", "63"), period_id = c("SAT119_1", "SAT119_1",
"SAT119_1", "SAT119_1", "SAT119_1", "SAT119_1", "SAT119_1",
"SAT119_1", "SAT119_1", "SAT119_1")), row.names = c(2L, 3L,
4L, 5L, 6L, 8L, 9L, 13L, 14L, 15L), class = "data.frame")
Share
Improve this question
asked 13 hours ago
mto23mto23
4651 gold badge6 silver badges16 bronze badges
2
|
1 Answer
Reset to default 0You can produce an output matching your desired format, where:
- First cluster in each period has no time difference (NA)
- Overlapping clusters are marked with "yes"
- Time differences are calculated relative to the appropriate reference cluster
df <- structure(list(clus_start = structure(c(1323353941, 1323368281,
1323901141, 1324160281, 1324275541, 1325269141, 1325830681, 1327025941,
1327256281, 1327558681), class = c("POSIXct", "POSIXt"), tzone = "GMT"),
clus_end = structure(c(1323526681, 1323498001, 1324073881,
1324232281, 1325254681, 1325715541, 1325974741, 1327184281,
1327414681, 1327760401), class = c("POSIXct", "POSIXt"), tzone = "GMT"),
cluster_id = c("36", "37", "40",
"41", "42", "47", "49", "59",
"61", "63"), period_id = c("SAT119_1", "SAT119_1",
"SAT119_1", "SAT119_1", "SAT119_1", "SAT119_1", "SAT119_1",
"SAT119_1", "SAT119_1", "SAT119_1")), row.names = c(2L, 3L,
4L, 5L, 6L, 8L, 9L, 13L, 14L, 15L), class = "data.frame")#
analyze_clusters <- function(df) {
# Convert time columns to POSIXct if they aren't already
df$clus_start <- as.POSIXct(df$clus_start, tz = "GMT")
df$clus_end <- as.POSIXct(df$clus_end, tz = "GMT")
# Initialize result columns
df$overlap <- "no"
df$time_diff <- NA
# Process each period_id separately
result <- df %>%
group_by(period_id) %>%
group_modify(~{
dat <- .x
n <- nrow(dat)
# Reference cluster index (the one we're comparing against)
ref_idx <- 1
for(i in 2:n) {
# Add 12 hours to the reference end time
ref_end_plus_12h <- dat$clus_end[ref_idx] + 12 * 3600
print(paste0("Ref: ",ref_end_plus_12h, " for i: ", i))
# Check if current cluster starts within reference cluster (including 12h buffer)
if(dat$clus_start[i] <= ref_end_plus_12h) {
print(paste0(dat$clus_start[i], " <= ", ref_end_plus_12h, "!"))
print("Overlap-> yes")
dat$overlap[i] <- "yes"
# Calculate time difference from reference cluster
dat$time_diff[i] <- as.numeric(
difftime(dat$clus_start[i], dat$clus_end[ref_idx], units = "days")
)
} else {
print("Overlap-> no")
# If no overlap, calculate time difference and update reference
dat$time_diff[i] <- as.numeric(
difftime(dat$clus_start[i], dat$clus_end[ref_idx], units = "days")
)
ref_idx <- i
print(paste0("New ref index: ", i))
}
}
dat
}) %>%
ungroup()
return(result)
}
res <- analyze_clusters(df)
But then my result looks different from yours:
period_id clus_start clus_end cluster_id overlap time_diff
<chr> <dttm> <dttm> <chr> <chr> <dbl>
1 SAT119_1 2011-12-08 14:19:01 2011-12-10 14:18:01 36 no NA
2 SAT119_1 2011-12-08 18:18:01 2011-12-10 06:20:01 37 yes -1.83
3 SAT119_1 2011-12-14 22:19:01 2011-12-16 22:18:01 40 no 4.33
4 SAT119_1 2011-12-17 22:18:01 2011-12-18 18:18:01 41 no 1
5 SAT119_1 2011-12-19 06:19:01 2011-12-30 14:18:01 42 no 0.501
6 SAT119_1 2011-12-30 18:19:01 2012-01-04 22:19:01 47 yes 0.167
7 SAT119_1 2012-01-06 06:18:01 2012-01-07 22:19:01 49 no 6.67
8 SAT119_1 2012-01-20 02:19:01 2012-01-21 22:18:01 59 no 12.2
9 SAT119_1 2012-01-22 18:18:01 2012-01-24 14:18:01 61 no 0.833
10 SAT119_1 2012-01-26 06:18:01 2012-01-28 14:20:01 63 no 1.67
本文标签: R difftime with lag to create another columnStack Overflow
版权声明:本文标题:R: difftime with lag to create another column - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1736565618a1944698.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
clus_end
inclusive, exclusive, or it doesn't really matter for your application? – margusl Commented 13 hours ago