admin管理员组

文章数量:1386695

I have a number of chromatograms saved as .csv files in a folder that look something like

time <- c(0.001575, 0.008775, 0.015975, 0.023175, 0.030375, 0.037575, 0.044775, 0.051975, 0.059175, 0.066375, 0.073575, 0.080776, 0.087976, 0.095176, 0.102376, 0.109576, 0.116776, 0.123976, 0.131176, 0.138376, 0.145576, 0.152776, 0.159976, 0.167176, 0.174376, 0.181576, 0.188776, 0.195976, 0.203176)

RID <- c(67.36, 66.39, 65.39, 64.41, 63.52, 62.76, 62.16,61.76, 61.54,61.53,61.7,62.05,62.52, 63.09, 63.71, 64.33, 64.92, 65.46, 65.93, 66.32, 66.63, 66.87, 67.05, 67.18, 67.27, 67.32, 67.35, 67.37, 67.38)

dd<- data.frame(time, RID)

what I need to do and I'm struggling to understand how could I possibly do is "reduce" the resolution of the dataset by making averaging the data in bins of a certain resolution e.g. 0.05 meaning turning that dataframe into something like

time RID
0 average of the RID data between time 0 and 0.05
0.05 average of the RID data between time 0.05 and 0.1
0.1 average of the RID data between time 0.1 and 0.15

I have a number of chromatograms saved as .csv files in a folder that look something like

time <- c(0.001575, 0.008775, 0.015975, 0.023175, 0.030375, 0.037575, 0.044775, 0.051975, 0.059175, 0.066375, 0.073575, 0.080776, 0.087976, 0.095176, 0.102376, 0.109576, 0.116776, 0.123976, 0.131176, 0.138376, 0.145576, 0.152776, 0.159976, 0.167176, 0.174376, 0.181576, 0.188776, 0.195976, 0.203176)

RID <- c(67.36, 66.39, 65.39, 64.41, 63.52, 62.76, 62.16,61.76, 61.54,61.53,61.7,62.05,62.52, 63.09, 63.71, 64.33, 64.92, 65.46, 65.93, 66.32, 66.63, 66.87, 67.05, 67.18, 67.27, 67.32, 67.35, 67.37, 67.38)

dd<- data.frame(time, RID)

what I need to do and I'm struggling to understand how could I possibly do is "reduce" the resolution of the dataset by making averaging the data in bins of a certain resolution e.g. 0.05 meaning turning that dataframe into something like

time RID
0 average of the RID data between time 0 and 0.05
0.05 average of the RID data between time 0.05 and 0.1
0.1 average of the RID data between time 0.1 and 0.15

and so on.

The only thing I know that comes close to what I want to do is aggregate, but that would imply first creating a dummy time table with the time data cropped to the desired resolution and it feels like there should be a more easily available solution, especially because I have an entire folder of xsv files and I need to automate the entire process for future studies.

Share Improve this question edited Mar 17 at 9:58 zx8754 56.4k12 gold badges126 silver badges226 bronze badges Recognized by R Language Collective asked Mar 17 at 9:48 RaffaelloRaffaello 334 bronze badges 1
  • If aggregate is not suitable, maybe use rollmean. – zx8754 Commented Mar 17 at 10:10
Add a comment  | 

5 Answers 5

Reset to default 4

Create groups, then get mean per group:

aggregate(RID ~ cut(time, seq(0, 1, 0.05)), data = dd, mean)
#   cut(time, seq(0, 1, 0.05))      RID
# 1                   (0,0.05] 64.57000
# 2                 (0.05,0.1] 62.02714
# 3                 (0.1,0.15] 65.32857
# 4                 (0.15,0.2] 67.20143
# 5                 (0.2,0.25] 67.38000

Use findInterval():

n = seq(0, max(dd$time), .05)
tapply(dd$RID, findInterval(dd$time, n), mean) |> setNames(n)
       0     0.05      0.1     0.15      0.2 
64.57000 62.02714 65.32857 67.20143 67.38000 

EDIT

For a solution on a bunch of files

l = lapply(list.files(pattern='*.csv$'), read.csv) 
# if possible merge to one data frame else 
m = sapply(l, \(i) max(i[['time']], na.rm=TRUE)) |> max()
# move on with a global m(aximum) 

We could reduce the number of lapply.

Probably you can try

> aggregate(RID ~ cbind(time = as.character(cut(time, seq(0, 0.5, 0.05)))), dd, mean)
        time      RID
1   (0,0.05] 64.57000
2 (0.05,0.1] 62.02714
3 (0.1,0.15] 65.32857
4 (0.15,0.2] 67.20143
5 (0.2,0.25] 67.38000

or

> with(dd, by(RID, list(time = cut(time, seq(0, 0.5, 0.05))), mean))
time: (0,0.05]
[1] 64.57
------------------------------------------------------------ 
time: (0.05,0.1]
[1] 62.02714
------------------------------------------------------------
time: (0.1,0.15]
[1] 65.32857
------------------------------------------------------------
time: (0.15,0.2]
[1] 67.20143
------------------------------------------------------------
time: (0.2,0.25]
[1] 67.38
------------------------------------------------------------
time: (0.25,0.3]
[1] NA
------------------------------------------------------------
time: (0.3,0.35]
[1] NA
------------------------------------------------------------
time: (0.35,0.4]
[1] NA
------------------------------------------------------------
time: (0.4,0.45]
[1] NA
------------------------------------------------------------
time: (0.45,0.5]
[1] NA

For a single file

# Define bin width
bin_width <- 0.05

# Create bins using cut()
dd$bin <- cut(dd$time, breaks = seq(0, ceiling(max(dd$time)/bin_width)*bin_width, bin_width), 
              include.lowest = TRUE, right = FALSE)

# Calculate means for each bin
result <- aggregate(RID ~ bin, data = dd, mean)

# Extract the lower bound of each bin as the new time
result$time <- as.numeric(sub("\\[([^,]*),.*", "\\1", result$bin))
result <- result[, c("time", "RID")]

For multiple files

# Set working directory to your folder (or specify full path)
setwd("path/to/your/folder")

# Define bin width
bin_width <- 0.05

# List all CSV files
files <- list.files(pattern = "*.csv")

# Process each file
for (file in files) {
  # Read the CSV
  dd <- read.csv(file)
  
  # Create bins
  dd$bin <- cut(dd$time, breaks = seq(0, ceiling(max(dd$time)/bin_width)*bin_width, bin_width), 
                include.lowest = TRUE, right = FALSE)
  
  # Calculate means
  result <- aggregate(RID ~ bin, data = dd, mean)
  result$time <- as.numeric(sub("\\[([^,]*),.*", "\\1", result$bin))
  result <- result[, c("time", "RID")]
  
  # Save the result (e.g., append "_reduced" to the filename)
  output_file <- sub(".csv", "_reduced.csv", file)
  write.csv(result, output_file, row.names = FALSE)
  
  cat("Processed:", file, "\n")
}

timeplyr now has fixed-width time intervals which can help here

library(dplyr)
library(timeplyr)

dd |> 
  mutate(intv = time_cut_width(time, 0.05, from = 0)) |> 
  summarise(mean = mean(RID), .by = intv)
#> # A tibble: 5 × 2
#>   intv         mean
#>   <tm_ntrvl>  <dbl>
#> 1 [0, 0.05)    64.6
#> 2 [0.05, 0.1)  62.0
#> 3 [0.1, 0.15)  65.3
#> 4 [0.15, 0.2)  67.2
#> 5 [0.2, 0.25)  67.4

Created on 2025-03-17 with reprex v2.1.1

本文标签: automationAveraging temporal series with fixed resolution in RStack Overflow