admin管理员组文章数量:1129621
I have a dataset with multiple columns and more than 60k rows. I am using na.approx()
to replace NAs in each column. I would like to identify the rows which had NA and were replaced, something like "Dataset - After removing NAs". I would appreciate if you could please help me with that.
Dataset - Before removing NAs
A B C
1.0 2.2 3.3
NA 5.7 8.8
6.2 7.2 NA
3.0 NA 1.1
Dataset - After removing NAs
A A_NA B B_NA C C_NA
1.0 No 2.2 No 3.3 No
3.6 Yes 5.7 No 8.8 No
6.2 No 7.2 No 4.6 Yes
3.0 No 7.2 Yes 1.1 No
Sorry, I cannot share the original dataset, but if it is necessary I can prepare a better example.
Thanks
UPDATE
Sorry, I forgot to mention that in my original dataset, I have some columns in my dataset which are not numeric, also my dataset is a data.table.
Dataset - Before removing NAs
ID Trial Weight Height Depth
1 Wheat 1.0 2.2 3.3
1 Wheat NA 5.7 8.8
2 Maize 6.2 7.2 NA
4 Maize 3.0 NA 1.1
I have a dataset with multiple columns and more than 60k rows. I am using na.approx()
to replace NAs in each column. I would like to identify the rows which had NA and were replaced, something like "Dataset - After removing NAs". I would appreciate if you could please help me with that.
Dataset - Before removing NAs
A B C
1.0 2.2 3.3
NA 5.7 8.8
6.2 7.2 NA
3.0 NA 1.1
Dataset - After removing NAs
A A_NA B B_NA C C_NA
1.0 No 2.2 No 3.3 No
3.6 Yes 5.7 No 8.8 No
6.2 No 7.2 No 4.6 Yes
3.0 No 7.2 Yes 1.1 No
Sorry, I cannot share the original dataset, but if it is necessary I can prepare a better example.
Thanks
UPDATE
Sorry, I forgot to mention that in my original dataset, I have some columns in my dataset which are not numeric, also my dataset is a data.table.
Dataset - Before removing NAs
ID Trial Weight Height Depth
1 Wheat 1.0 2.2 3.3
1 Wheat NA 5.7 8.8
2 Maize 6.2 7.2 NA
4 Maize 3.0 NA 1.1
Share
Improve this question
edited Jan 9 at 1:21
Mee
asked Jan 8 at 12:17
MeeMee
1417 bronze badges
1
|
6 Answers
Reset to default 4You could do this in base R by indexing:
#new columns if it is NA or not
df[paste0(names(df),"_NA")] <- is.na(df)
# use zoo::na.approx
df[-grep("_NA", names(df))] <- zoo::na.approx(df[-grep("_NA", names(df))])
# reorder (if desired)
df <- df[,sort(names(df))]
Output:
A A_NA B B_NA C C_NA
1 1.0 FALSE 2.2 FALSE 3.30 FALSE
2 3.6 TRUE 5.7 FALSE 8.80 FALSE
3 6.2 FALSE 7.2 FALSE 4.95 TRUE
4 3.0 FALSE NA TRUE 1.10 FALSE
If you wanted "Yes" or "No" instead of T/F, you could just replace is.na(df)
with ifelse(is.na(df), "Yes", "No")
Data:
df <- read.table(text = "A B C
1.0 2.2 3.3
NA 5.7 8.8
6.2 7.2 NA
3.0 NA 1.1", header = TRUE)
EDIT Using data table, you can tweak the code by:
library(data.table)
# Identify numeric columns
idxcols <- sapply(dt, is.numeric)
dt[, paste0(names(idxcols)[idxcols], "_NA") := lapply(.SD, is.na), .SDcols = idxcols]
Data table data:
dt <- data.table::data.table(read.table(text = "ID Trial Weight Height Depth
1 Wheat 1.0 2.2 3.3
1 Wheat NA 5.7 8.8
2 Maize 6.2 7.2 NA
4 Maize 3.0 NA 1.1", header = TRUE))
Probably you should specify rule=2
when applying na.approx
cbind(
zoo::na.approx(df, rule = 2),
as.data.frame(
ifelse(is.na(df), "Yes", "No"),
col.names = paste0(names(df), "_NA")
)
)[order(rep(seq_along(df), 2))]
which gives
A A_NA B B_NA C C_NA
1 1.0 No 2.2 No 3.30 No
2 3.6 Yes 5.7 No 8.80 No
3 6.2 No 7.2 No 4.95 Yes
4 3.0 No 7.2 Yes 1.10 No
data
> dput(df)
structure(list(A = c(1, NA, 6.2, 3), B = c(2.2, 5.7, 7.2, NA),
C = c(3.3, 8.8, NA, 1.1)), class = "data.frame", row.names = c(NA,
-4L))
Write a function to take care of getting the NA
's before interpolation and of putting the two tables together.
The code comments should explain how the function does it.
Dataset <- read.table(text = "
A B C
1.0 2.2 3.3
NA 5.7 8.8
6.2 7.2 NA
3.0 NA 1.1
", header = TRUE)
fun <- function(x) {
# auxiliary variable
cn <- colnames(x)
# get the NA positions as a logical matrix
na <- sapply(x, is.na)
na_char <- matrix("", nrow = nrow(na), ncol = ncol(na))
# assign the values "No" and "Yes"
na_char[na] <- "Yes"
na_char[!na] <- "No"
# interpolate
y <- sapply(x, zoo::na.approx, na.rm = FALSE) |> as.data.frame()
# put the old and new columns in wanted order
colnames(na_char) <- paste(colnames(na), "NA", sep = "_")
i <- order(c(cn, colnames(na_char)))
new_names <- c(cn, colnames(na_char))[i]
# assign the new names and exit
setNames(cbind(y, na_char)[, i], new_names)
}
fun(Dataset)
#> A A_NA B B_NA C C_NA
#> 1 1.0 No 2.2 No 3.30 No
#> 2 3.6 Yes 5.7 No 8.80 No
#> 3 6.2 No 7.2 No 4.95 Yes
#> 4 3.0 No NA Yes 1.10 No
Created on 2025-01-08 with reprex v2.1.1
Alternative answer using only base R:
You can just create the *_NA columns before running na.approx. You can e.g. loop over all columns with is.na and assign the result as a new column. The ordering of the columns in the way you show in your example could be done by sorting the columns alphabetically afterwards.
df <- data.frame(
A = c(1.0, NA, 6.2, 3.0),
B = c(2.2, 5.7, 7.2, NA),
C = c(3.3, 8.8, NA, 1.1)
)
df_isna = lapply(df, is.na)
df_isna = lapply(df_isna, factor, labels = c("No","Yes"))
names(df_isna) = paste0(names(df_isna),"_NA")
df_new = as.data.frame(zoo::na.approx(df, rule = 2))
df_new = cbind(df_new, df_isna)
df_new = df_new[,sort(names(df_new))]
I'll add a dplyr
flavor to already existing answers.
library(dplyr)
df %>%
transmute(across(everything(), is.na, .names = "{col}_NA")) %>%
bind_cols(zoo::na.approx(df, rule = 2)) %>%
select(order(names(.)))
# A A_NA B B_NA C C_NA
#1 1.0 FALSE 2.2 FALSE 3.30 FALSE
#2 3.6 TRUE 5.7 FALSE 8.80 FALSE
#3 6.2 FALSE 7.2 FALSE 4.95 TRUE
#4 3.0 FALSE 7.2 TRUE 1.10 FALSE
1) Use na.approx
and then bind it to the Yes/No data frame. Note that the pipe to setNames
is within the second argument to cbind
whereas the final pipe is taking the result of cbind
.
library(zoo)
cbind(na.approx(DF), as.data.frame(ifelse(is.na(DF), "Yes", "No")) |>
setNames(paste0(names(DF), "_NA")) ) |>
( \(x) x[order(names(x))] )()
## A A_NA B B_NA C C_NA
## 1 1.0 No 2.2 No 3.30 No
## 2 3.6 Yes 5.7 No 8.80 No
## 3 6.2 No 7.2 No 4.95 Yes
## 4 3.0 No NA Yes 1.10 No
2) Translating that to tidyverse we have the following. This has the advantage that it goes strictly left to right and only explicitly references DF
once at the beginning so that if it changes to some other name it only has to be replaced in one spot.
library(dplyr, exclude = c("filter", "lag"))
library(rlang)
library(zoo)
DF %>%
{ bind_cols(na.approx(.), data.frame(ifelse(is.na(.), "Yes", "No")) %>%
set_names(~ paste0(.x, "_NA")))
} %>%
select(sort(names(.)))
## A A_NA B B_NA C C_NA
## 1 1.0 No 2.2 No 3.30 No
## 2 3.6 Yes 5.7 No 8.80 No
## 3 6.2 No 7.2 No 4.95 Yes
## 4 3.0 No NA Yes 1.10 No
Note
DF <- data.frame(
A = c(1, NA, 6.2, 3),
B = c(2.2, 5.7, 7.2, NA),
C = c(3.3, 8.8, NA, 1.1)
)
本文标签: rHow to mark rows with NAs which were replaced by naapprox()Stack Overflow
版权声明:本文标题:r - How to mark rows with NAs which were replaced by na.approx()? - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1736730295a1949953.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
is.na(df)
? It will return a data.frame where cells withNA
values will beTRUE
, and the othersFALSE
. – Billy34 Commented Jan 8 at 12:48