admin管理员组

文章数量:1357236

I want to change a data frame value in my 'type' column to 'H', based on a row-wise condition where any value in a row is greater than or equal to 340.

# make data frame
df <- data.frame(
  cell_1 = c(50,10,10,125,110,300,75),
  cell_2 = c(0,75,10,70,35,70,85),
  cell_3 = c(340,230,10,10,110,10,80),
  cell_4 = c(10,75,70,70,35,10,85),
  cell_5 = c(0,10,300,125,110,10,75),
  type = c('A','A','J','U','S','L','F'),
  uniq_Id = c(1,2,3,4,5,6,7)
)

# change 'type' to 'H' if any of the cell values in each row are greater than or equal to 340
apply(df[,1:5], 1, function(i) {
  if (any(i >= 340)) {
    df$type = 'H'
  }
})

Output to the console suggests it's working. However, there's no change in the data frame after the apply function. I want 'type' in row 1 to be 'H'.

  cell_1 cell_2 cell_3 cell_4 cell_5 type uniq_Id
1     50      0    340     10      0    A       1
2     10     75    230     75     10    A       2
3     10     10     10     70    300    J       3
4    125     70     10     70    125    U       4
5    110     35    110     35    110    S       5
6    300     70     10     10     10    L       6
7     75     85     80     85     75    F       7

I want to change a data frame value in my 'type' column to 'H', based on a row-wise condition where any value in a row is greater than or equal to 340.

# make data frame
df <- data.frame(
  cell_1 = c(50,10,10,125,110,300,75),
  cell_2 = c(0,75,10,70,35,70,85),
  cell_3 = c(340,230,10,10,110,10,80),
  cell_4 = c(10,75,70,70,35,10,85),
  cell_5 = c(0,10,300,125,110,10,75),
  type = c('A','A','J','U','S','L','F'),
  uniq_Id = c(1,2,3,4,5,6,7)
)

# change 'type' to 'H' if any of the cell values in each row are greater than or equal to 340
apply(df[,1:5], 1, function(i) {
  if (any(i >= 340)) {
    df$type = 'H'
  }
})

Output to the console suggests it's working. However, there's no change in the data frame after the apply function. I want 'type' in row 1 to be 'H'.

  cell_1 cell_2 cell_3 cell_4 cell_5 type uniq_Id
1     50      0    340     10      0    A       1
2     10     75    230     75     10    A       2
3     10     10     10     70    300    J       3
4    125     70     10     70    125    U       4
5    110     35    110     35    110    S       5
6    300     70     10     10     10    L       6
7     75     85     80     85     75    F       7
Share Improve this question edited Mar 29 at 10:08 ThomasIsCoding 104k9 gold badges37 silver badges103 bronze badges asked Mar 27 at 18:40 Ray JRay J 2152 silver badges3 bronze badges 3
  • 1 There are also more "R like" ways of doing this. See, for example, this Q&A. – Limey Commented Mar 27 at 18:58
  • 2 Don't try to do the assignment inside apply(). Instead, try df$type[apply(df[,1:5], 1, function(i) any(i >= 340))] <- 'H' or df$type <- ifelse(apply(df[, 1:5], 1, function(i) any(i >= 340)), "H", df$type). – zephryl Commented Mar 27 at 18:59
  • This is an answer too... I think I was trying to use apply like a for loop. Will look into R-like methods further... – Ray J Commented Mar 27 at 19:17
Add a comment  | 

3 Answers 3

Reset to default 5

Try any of these. They are non-destructive, i.e. they preserve the input df.

transform(df, type = replace(type, apply(df[1:5] >= 340, 1, any), "H"))

transform(df, type = replace(type, apply(df[1:5], 1, max) >= 340, "H"))

transform(df, type = replace(type, do.call("pmax", df[1:5]) >= 340, "H"))

transform(df, type = replace(type, Reduce(pmax, df[1:5]) >= 340, "H"))

library(dplyr)
df %>%
  mutate(type = replace(type, any(pick(starts_with("cell")) >= 340), "H"), .by = uniq_Id)

Hard-coding column selection is almost always bad practice; instead we can identify such columns with a pattern: 'cell' followed by _ and a digit.

We create a Boolean matrix to check for elements which are strictly greater than 340 (threshold). Afterwards we apply rowSums(): if there is at least one value per row above the threshold, the row-wise sum is strictly greater than 0 since TRUE coerces to 1 and FALSE to 0.

We end up with a Boolean vector which length equals the number of rows in df. This allows us to overwrite type with H where the condition is met.

df$type[rowSums(df[grep('cell_\\d{1}', names(df))] >= 340) > 0] = 'H'
> df
  cell_1 cell_2 cell_3 cell_4 cell_5 type uniq_Id
1     50      0    340     10      0    H       1
2     10     75    230     75     10    A       2
3     10     10     10     70    300    J       3
4    125     70     10     70    125    U       4
5    110     35    110     35    110    S       5
6    300     70     10     10     10    L       6
7     75     85     80     85     75    F       7

Impressively slow on 1e6:

df$type[Rfast::rowAny(df[grep('cell_\\d{1}', names(df))] >= 340)] = 'H'

Using matrixStats::rowAnys()

cols <- grep('^cell_\\d+$', names(df))

replace(df$type, matrixStats::rowAnys(df[cols] >= 340), "H")

df$type[matrixStats::rowAnys(df[cols] >= 340)] <- "H"

Note, that matrixStats::rowAnys as well as base::any will ignore NAs in the data, which might be cool or not.

Benchmark 1e6 rows

df <- df[sample.int(nrow(df), 1e6, replace=TRUE), ]
options(width=200)
microbenchmark::microbenchmark(
  GG1=replace(df$type, apply(df[cols] >= 340, 1, any), "H"),
  GG2=replace(df$type, apply(df[cols], 1, max) >= 340, "H"),
  GG3=replace(df$type, do.call("pmax", df[cols]) >= 340, "H"),
  GG4=replace(df$type, Reduce(pmax, df[cols]) >= 340, "H"),
  FRI={df$type[rowMeans(df[cols] >= 340) > 0] = 'H'},
  FR2=replace(df$type, rowSums(df[cols] >= 340) > 0, "H"),
  JAY={df$type[matrixStats::rowAnys(df[cols] >= 340)] <- "H"},
  JY2=replace(df$type, matrixStats::rowAnys(df[cols] >= 340), "H"),
  times=10L
)

$ Rscript --vanilla foo.R
Unit: milliseconds
 expr       min        lq       mean     median         uq       max neval   cld
  GG1 733.31394 787.81932  891.00432  886.00872  962.07881 1152.0256   100 a    
  GG2 923.65633 988.94131 1077.83096 1066.62161 1143.83696 1569.8646   100  b   
  GG3  12.38325  12.55271   21.71313   12.67726   14.81317  155.7037   100   c  
  GG4  13.49660  13.75258   32.01618   25.36118   35.68346  150.1286   100   cd 
  FRI  22.15796  37.56721   59.07614   45.35287   59.47257  223.6276   100     e
  FR2  21.38604  23.57986   45.93177   40.56488   47.35564  252.3459   100   c e
  JAY  15.19671  34.01841   51.37185   38.70390   54.58358  265.5255   100    de
  JY2  16.16235  18.64446   41.94715   35.59563   41.59243  153.6724   100   c e

本文标签: rCan I change a value in a data frame using applywith a rowwise conditionStack Overflow