admin管理员组

文章数量:1309980

I have a df with multiple columns that look like this:

mydata <- data.frame("Code_1" = c("x000", "x001", "x002", "y003"), "Code_2" = c("y000", "y001", "y002", "y003"), "Code_3" = c("z000", "z001", "z002", "z003"))

I need to check the number of characters in each column (I have 24 in total) and would like to do this efficiently. Ideally I'd like the output in tablular form.

I have tried the following:

fun_nchar <- function(mydata, x) { mydata[[nchar(x, type = "chars", allowNA=TRUE, keepNA=NA]] <= table(x) }

When I try to run this with the first column I keep getting an error message saying: 'object 'Code_1' not found.

Ideally, I would like a table to check that all 24 columns contain either 3 or 4 characters (this is part of data cleaning).

I have a df with multiple columns that look like this:

mydata <- data.frame("Code_1" = c("x000", "x001", "x002", "y003"), "Code_2" = c("y000", "y001", "y002", "y003"), "Code_3" = c("z000", "z001", "z002", "z003"))

I need to check the number of characters in each column (I have 24 in total) and would like to do this efficiently. Ideally I'd like the output in tablular form.

I have tried the following:

fun_nchar <- function(mydata, x) { mydata[[nchar(x, type = "chars", allowNA=TRUE, keepNA=NA]] <= table(x) }

When I try to run this with the first column I keep getting an error message saying: 'object 'Code_1' not found.

Ideally, I would like a table to check that all 24 columns contain either 3 or 4 characters (this is part of data cleaning).

Share Improve this question edited Feb 3 at 11:03 Kerrie asked Feb 3 at 10:53 KerrieKerrie 193 bronze badges 3
  • What exactly does your expected output look like? A table for each column? In your example, all columns have four observations of character length 4. Have you tried something like lapply(mydata, \(x) table(nchar(x)))? – Maël Commented Feb 3 at 10:59
  • Thanks! I would like to get a table to check that all columns contain either 3 or 4 characters (this is part of data cleaning) - I will update the question to clarify. I haven't tried this lapply approach, how would I do that for all 24 columns? – Kerrie Commented Feb 3 at 11:02
  • *apply function runs the same function over a set of elements (for example, columns in a data.frame), it's useful when you need to repeat a function over multiple columns. sapply returns a simplified version of the output (e.g., a vector, if possible), while lapply always returns a list – Maël Commented Feb 3 at 11:17
Add a comment  | 

3 Answers 3

Reset to default 2

You could use a combination of all + nchar:

sapply(mydata, \(x) all(nchar(na.omit(x)) %in% c(3, 4)))
# Code_1 Code_2 Code_3 
# TRUE   TRUE   TRUE 

Or even, to check if all columns have either 3 or 4 characters, yet another all to ensure all columns meet the criterion:

all(sapply(mydata, \(x) all(nchar(na.omit(x)) %in% c(3, 4))))
#[1]  TRUE

You can try colMeans + nchar like below

> colMeans(`dim<-`(nchar(as.matrix(mydata)) %in% c(3, 4), dim(mydata))) == 1
[1] TRUE TRUE TRUE

Try this.

> sapply(mydata, nchar) |> {\(x) !is.na(x) & x >= 3 & x <= 4}() |> apply(2, var) == 0
Code_1 Code_2 Code_3 
  TRUE   TRUE   TRUE 

Test:

> mydata[1, 1] <- 'x0'
> sapply(mydata, nchar) |> {\(x) !is.na(x) & x >= 3 & x <= 4}() |> apply(2, var) == 0
Code_1 Code_2 Code_3 
 FALSE   TRUE   TRUE 

You can easily incorporate a check into the script:

> stopifnot(sapply(mydata, nchar) |> {\(x) !is.na(x) & x >= 3 & x <= 4}() |> apply(2, var) == 0)
Error: apply({ .... are not all TRUE

本文标签: