admin管理员组

文章数量:1316351

I have a custom S3 class which is for all intents and purposes a data.frame that has an attribute which has a value for each data frame row. For example,

my_df <- data.frame("a" = c(1, 2, 3), "b" = c(4, 5, 6))

class(my_df) <- c("foo", "data.frame")

attributes(my_df)$"row_notes" <- list("good", "bad", "good")

I have defined a subsetting function for my custom class as follows:

`[.foo` <- function(x, i, j, ...) {
    result <- NextMethod()
    if (!missing(i)) {
        attributes(result)$"row_notes" <- attributes(x)$"row_notes"[i]
    } else {
        attributes(result)$"row_notes" <- attributes(x)$"row_notes"
    }
    return(result)
}

This seems to work just fine so far. If I do something like my_df[1:2, 2], the "row_notes" attribute will get subsetted down to just c("good", "bad").

I would like to be able to make use of dplyr::select() and functions like dplyr::everything() and dplyr::all_of() on my custom data frame, but I can't seem to use these functions without destroying the "row_notes" attribute.

Intriguingly, the result of:

attributes(dplyr::select(my_df, dplyr::everything()))$"row_notes"

Is a two element list. I tried updating my subsetting function to debug what may be happening:

`[.foo` <- function(x, i, j, ...) {
    if (!missing(i)) {
        cat("i:", i, "\n")
    }
    if (!missing(j)) {
        cat("j:", j, "\n")
    }
    cat("-------\n")
    result <- NextMethod()
    if (!missing(i)) {
        attributes(result)$"row_notes" <- attributes(x)$"row_notes"[i]
    } else {
        attributes(result)$"row_notes" <- attributes(x)$"row_notes"
    }
    return(result)
}

And it turns out that for whatever reason, calling dplyr::select(df, dplyr::everything()) will at some point call [.foo with the an i value equal to the number of columns of the data frame, not the number of rows.

Is there a way to work around this that doesn't require telling users just not to try using dplyr::select on this essentially data frame-like object?

To be clear, I'd simply like to have a subsetting function where whatever subsetting happens to the rows of the data frame is correspondingly applied to the "row_notes" attribute. I will also add that this is a toy example, if it were not I would probably just keep "row_notes" as another column in the data frame - I cannot do this in my actual work as "row_notes" represents a variety of object types including a list of matrices.

Any help would be greatly appreciated, thank you!

I have a custom S3 class which is for all intents and purposes a data.frame that has an attribute which has a value for each data frame row. For example,

my_df <- data.frame("a" = c(1, 2, 3), "b" = c(4, 5, 6))

class(my_df) <- c("foo", "data.frame")

attributes(my_df)$"row_notes" <- list("good", "bad", "good")

I have defined a subsetting function for my custom class as follows:

`[.foo` <- function(x, i, j, ...) {
    result <- NextMethod()
    if (!missing(i)) {
        attributes(result)$"row_notes" <- attributes(x)$"row_notes"[i]
    } else {
        attributes(result)$"row_notes" <- attributes(x)$"row_notes"
    }
    return(result)
}

This seems to work just fine so far. If I do something like my_df[1:2, 2], the "row_notes" attribute will get subsetted down to just c("good", "bad").

I would like to be able to make use of dplyr::select() and functions like dplyr::everything() and dplyr::all_of() on my custom data frame, but I can't seem to use these functions without destroying the "row_notes" attribute.

Intriguingly, the result of:

attributes(dplyr::select(my_df, dplyr::everything()))$"row_notes"

Is a two element list. I tried updating my subsetting function to debug what may be happening:

`[.foo` <- function(x, i, j, ...) {
    if (!missing(i)) {
        cat("i:", i, "\n")
    }
    if (!missing(j)) {
        cat("j:", j, "\n")
    }
    cat("-------\n")
    result <- NextMethod()
    if (!missing(i)) {
        attributes(result)$"row_notes" <- attributes(x)$"row_notes"[i]
    } else {
        attributes(result)$"row_notes" <- attributes(x)$"row_notes"
    }
    return(result)
}

And it turns out that for whatever reason, calling dplyr::select(df, dplyr::everything()) will at some point call [.foo with the an i value equal to the number of columns of the data frame, not the number of rows.

Is there a way to work around this that doesn't require telling users just not to try using dplyr::select on this essentially data frame-like object?

To be clear, I'd simply like to have a subsetting function where whatever subsetting happens to the rows of the data frame is correspondingly applied to the "row_notes" attribute. I will also add that this is a toy example, if it were not I would probably just keep "row_notes" as another column in the data frame - I cannot do this in my actual work as "row_notes" represents a variety of object types including a list of matrices.

Any help would be greatly appreciated, thank you!

Share Improve this question asked Jan 29 at 20:02 pvelayudhanpvelayudhan 867 bronze badges 2
  • If it helps, note that a tibble can contain columns of type list which can store different data types including nested lists and would probably allow a place to store your attribute data: dplyr::tibble(x = list(1, "a", data.frame(y = 1), matrix(1:4, nrow = 2), list(a = 1, b = 2))) – the-mad-statter Commented Jan 29 at 20:15
  • Thanks, I did not know that and it is helpful. For time-related reasons I might not be able to work that in right away to my current R package, but perhaps during a later rework or a subsequent package I will make use of that fact to minimize some of the attribute hassle I'm experiencing here. – pvelayudhan Commented Jan 29 at 20:24
Add a comment  | 

1 Answer 1

Reset to default 2 +50

The main issue is that you need the [.foo data frame method to handle 1d subsetting (e.g. my_df[1] and my_df["a"]). You can see that your method currently fails in these cases and this is also why dplyr::select() fails:

library(dplyr)

my_df <- data.frame("a" = c(1, 2, 3), "b" = c(4, 5, 6))
class(my_df) <- c("foo", "data.frame")
# changed below to make unique and to a vector for prettier printing
attributes(my_df)$"row_notes" <- c("good", "bad", "worst") 

my_df[1] |> attr("row_notes")
# [1] "good"
my_df["a"] |> attr("row_notes")
# NULL

It will work if it's rewritten to check the number of arguments and that i is not missing:

`[.foo` <- function(x, i, j, ...) {
  result <- NextMethod()
  if (nargs() == 2 && !missing(i)) {
    attr(result, "row_notes") <- attr(x, "row_notes")
  }  else {
    attr(result, "row_notes") <- attr(x, "row_notes")[i]
  } 
  return(result)
}

Now it should work for base subsetting and with dplyr::select()

my_df[1] |> attr("row_notes")
# [1] "good"  "bad"   "worst"
my_df["a"] |> attr("row_notes")
# [1] "good"  "bad"   "worst"
my_df[1:2, ] |> attr("row_notes")
# [1] "good" "bad" 
my_df |> select(a) |> attr("row_notes")
# [1] "good"  "bad"   "worst"

However, because many dplyr functions strip off custom classes and attributes, if you want to use other dplyr verbs you need to provide methods for them. See help("dplyr_extending"). Providing a foo method for dplyr_row_slice will get arrange(), filter(), slice() (and the rest of the ⁠slice_*()⁠ family), semi_join(), and anti_join() working. You also need to provide methods for group_by() and ungroup() if you want to use these.

restore_foo <- function(data) {
  class(data) <- union("foo", class(data))
  data
}

dplyr_row_slice.foo <- function(data, i, ...) {
  result <- NextMethod()
  attr(result, "row_notes") <- attr(data, "row_notes")[i]
  restore_foo(result)
}

group_by.foo <- function(.data, ...) {
  restore_foo(NextMethod())
}

ungroup.foo <- function(x, ...) {
  result <- NextMethod()
  attributes(result) <- attributes(x)
  restore_foo(result)
}

Testing:

my_df |> slice(c(1, 3)) |> attr("row_notes")
# [1] "good"  "worst"
my_df |> mutate(grp = a == 2) |> filter(row_number() == 1, .by = grp) |> attr("row_notes")
# [1] "good" "bad" 
my_df |> group_by(a == 2) |> filter(row_number() == 1) |> ungroup() |> attr("row_notes")
# [1] "good" "bad" 
my_df |> arrange(desc(a)) |>  attr("row_notes")
# [1] "worst" "bad"   "good" 
my_df |> group_split(a == 2) |> lapply(attr, "row_notes")
# [[1]]
# [1] "good"  "worst"
# 
# [[2]]
# [1] "bad"

If you need more dplyr functions to work with your class you may need to provide further methods.

本文标签: rSubsetting attributes of custom S3 class during dplyrselectStack Overflow