admin管理员组

文章数量:1129621

I have a dataset with multiple columns and more than 60k rows. I am using na.approx() to replace NAs in each column. I would like to identify the rows which had NA and were replaced, something like "Dataset - After removing NAs". I would appreciate if you could please help me with that.

Dataset - Before removing NAs
A    B    C
1.0  2.2  3.3
NA   5.7  8.8
6.2  7.2  NA
3.0  NA   1.1

Dataset - After removing NAs
A    A_NA   B     B_NA    C    C_NA
1.0  No     2.2   No      3.3  No
3.6  Yes    5.7   No      8.8  No
6.2  No     7.2   No      4.6  Yes
3.0  No     7.2   Yes     1.1  No 

Sorry, I cannot share the original dataset, but if it is necessary I can prepare a better example.

Thanks

UPDATE

Sorry, I forgot to mention that in my original dataset, I have some columns in my dataset which are not numeric, also my dataset is a data.table.

Dataset - Before removing NAs
ID        Trial     Weight    Height    Depth     
1         Wheat     1.0       2.2       3.3
1         Wheat     NA        5.7       8.8
2         Maize     6.2       7.2       NA
4         Maize     3.0       NA        1.1

I have a dataset with multiple columns and more than 60k rows. I am using na.approx() to replace NAs in each column. I would like to identify the rows which had NA and were replaced, something like "Dataset - After removing NAs". I would appreciate if you could please help me with that.

Dataset - Before removing NAs
A    B    C
1.0  2.2  3.3
NA   5.7  8.8
6.2  7.2  NA
3.0  NA   1.1

Dataset - After removing NAs
A    A_NA   B     B_NA    C    C_NA
1.0  No     2.2   No      3.3  No
3.6  Yes    5.7   No      8.8  No
6.2  No     7.2   No      4.6  Yes
3.0  No     7.2   Yes     1.1  No 

Sorry, I cannot share the original dataset, but if it is necessary I can prepare a better example.

Thanks

UPDATE

Sorry, I forgot to mention that in my original dataset, I have some columns in my dataset which are not numeric, also my dataset is a data.table.

Dataset - Before removing NAs
ID        Trial     Weight    Height    Depth     
1         Wheat     1.0       2.2       3.3
1         Wheat     NA        5.7       8.8
2         Maize     6.2       7.2       NA
4         Maize     3.0       NA        1.1
Share Improve this question edited Jan 9 at 1:21 Mee asked Jan 8 at 12:17 MeeMee 1417 bronze badges 1
  • is.na(df) ? It will return a data.frame where cells with NA values will be TRUE, and the others FALSE. – Billy34 Commented Jan 8 at 12:48
Add a comment  | 

6 Answers 6

Reset to default 4

You could do this in base R by indexing:

#new columns if it is NA or not
df[paste0(names(df),"_NA")] <- is.na(df)

# use zoo::na.approx
df[-grep("_NA", names(df))] <- zoo::na.approx(df[-grep("_NA", names(df))])

# reorder (if desired)
df <- df[,sort(names(df))]

Output:

    A  A_NA   B  B_NA    C  C_NA
1 1.0 FALSE 2.2 FALSE 3.30 FALSE
2 3.6  TRUE 5.7 FALSE 8.80 FALSE
3 6.2 FALSE 7.2 FALSE 4.95  TRUE
4 3.0 FALSE  NA  TRUE 1.10 FALSE

If you wanted "Yes" or "No" instead of T/F, you could just replace is.na(df) with ifelse(is.na(df), "Yes", "No")

Data:

df <- read.table(text = "A    B    C
1.0  2.2  3.3
NA   5.7  8.8
6.2  7.2  NA
3.0  NA   1.1", header = TRUE)

EDIT Using data table, you can tweak the code by:

library(data.table)

# Identify numeric columns
idxcols <- sapply(dt, is.numeric)

dt[, paste0(names(idxcols)[idxcols], "_NA") := lapply(.SD, is.na), .SDcols = idxcols]

Data table data:

dt <- data.table::data.table(read.table(text = "ID        Trial     Weight    Height    Depth     
1         Wheat     1.0       2.2       3.3
1         Wheat     NA        5.7       8.8
2         Maize     6.2       7.2       NA
4         Maize     3.0       NA        1.1", header = TRUE))

Probably you should specify rule=2 when applying na.approx

cbind(
  zoo::na.approx(df, rule = 2),
  as.data.frame(
    ifelse(is.na(df), "Yes", "No"),
    col.names = paste0(names(df), "_NA")
  )
)[order(rep(seq_along(df), 2))]

which gives

    A A_NA   B B_NA    C C_NA
1 1.0   No 2.2   No 3.30   No
2 3.6  Yes 5.7   No 8.80   No
3 6.2   No 7.2   No 4.95  Yes
4 3.0   No 7.2  Yes 1.10   No

data

> dput(df)
structure(list(A = c(1, NA, 6.2, 3), B = c(2.2, 5.7, 7.2, NA),
    C = c(3.3, 8.8, NA, 1.1)), class = "data.frame", row.names = c(NA, 
-4L))

Write a function to take care of getting the NA's before interpolation and of putting the two tables together.
The code comments should explain how the function does it.

Dataset <- read.table(text = "
A    B    C
1.0  2.2  3.3
NA   5.7  8.8
6.2  7.2  NA
3.0  NA   1.1
", header = TRUE)


fun <- function(x) {
  # auxiliary variable
  cn <- colnames(x)
  # get the NA positions as a logical matrix
  na <- sapply(x, is.na)
  na_char <- matrix("", nrow = nrow(na), ncol = ncol(na))
  # assign the values "No" and "Yes"
  na_char[na] <- "Yes"
  na_char[!na] <- "No"
  # interpolate
  y <- sapply(x, zoo::na.approx, na.rm = FALSE) |> as.data.frame()
  # put the old and new columns in wanted order
  colnames(na_char) <- paste(colnames(na), "NA", sep = "_")
  i <- order(c(cn, colnames(na_char)))
  new_names <- c(cn, colnames(na_char))[i]
  # assign the new names and exit
  setNames(cbind(y, na_char)[, i], new_names)
}

fun(Dataset)
#>     A A_NA   B B_NA    C C_NA
#> 1 1.0   No 2.2   No 3.30   No
#> 2 3.6  Yes 5.7   No 8.80   No
#> 3 6.2   No 7.2   No 4.95  Yes
#> 4 3.0   No  NA  Yes 1.10   No

Created on 2025-01-08 with reprex v2.1.1

Alternative answer using only base R:

You can just create the *_NA columns before running na.approx. You can e.g. loop over all columns with is.na and assign the result as a new column. The ordering of the columns in the way you show in your example could be done by sorting the columns alphabetically afterwards.


df <- data.frame(
  A = c(1.0, NA, 6.2, 3.0),
  B = c(2.2, 5.7, 7.2, NA),
  C = c(3.3, 8.8, NA, 1.1)
)
df_isna = lapply(df, is.na)
df_isna = lapply(df_isna, factor, labels = c("No","Yes"))
names(df_isna) = paste0(names(df_isna),"_NA")
df_new = as.data.frame(zoo::na.approx(df, rule = 2))
df_new = cbind(df_new, df_isna)
df_new = df_new[,sort(names(df_new))]

I'll add a dplyr flavor to already existing answers.

library(dplyr)

df %>%
  transmute(across(everything(), is.na, .names = "{col}_NA")) %>%
  bind_cols(zoo::na.approx(df, rule = 2)) %>%
  select(order(names(.)))

#    A  A_NA   B  B_NA    C  C_NA
#1 1.0 FALSE 2.2 FALSE 3.30 FALSE
#2 3.6  TRUE 5.7 FALSE 8.80 FALSE
#3 6.2 FALSE 7.2 FALSE 4.95  TRUE
#4 3.0 FALSE 7.2  TRUE 1.10 FALSE

1) Use na.approx and then bind it to the Yes/No data frame. Note that the pipe to setNames is within the second argument to cbind whereas the final pipe is taking the result of cbind.

library(zoo)

cbind(na.approx(DF), as.data.frame(ifelse(is.na(DF), "Yes", "No")) |>
             setNames(paste0(names(DF), "_NA")) ) |>
  ( \(x) x[order(names(x))] )()
##     A A_NA   B B_NA    C C_NA
## 1 1.0   No 2.2   No 3.30   No
## 2 3.6  Yes 5.7   No 8.80   No
## 3 6.2   No 7.2   No 4.95  Yes
## 4 3.0   No  NA  Yes 1.10   No

2) Translating that to tidyverse we have the following. This has the advantage that it goes strictly left to right and only explicitly references DF once at the beginning so that if it changes to some other name it only has to be replaced in one spot.

library(dplyr, exclude = c("filter", "lag"))
library(rlang)
library(zoo)

DF %>%
  { bind_cols(na.approx(.), data.frame(ifelse(is.na(.), "Yes", "No")) %>%
       set_names(~ paste0(.x, "_NA")))

  } %>%
  select(sort(names(.)))

##     A A_NA   B B_NA    C C_NA
## 1 1.0   No 2.2   No 3.30   No
## 2 3.6  Yes 5.7   No 8.80   No
## 3 6.2   No 7.2   No 4.95  Yes
## 4 3.0   No  NA  Yes 1.10   No

Note

DF <- data.frame(
  A = c(1, NA, 6.2, 3),
  B = c(2.2, 5.7, 7.2, NA),
  C = c(3.3, 8.8, NA, 1.1)
)

本文标签: rHow to mark rows with NAs which were replaced by naapprox()Stack Overflow