admin管理员组

文章数量:1397117

I have two data.frames, nests, and nest_days, which contains more days for one nest. I want to do a join, and group by nests (i.e. group by the rows of the first data.frame). I want to keep all columns from the nest data.frame (there are many) and include some grouped summaries from the nest_days data.frame.

This is fairly easy and elegant in SQL:

sqldf("
select nests.*, min(Day), max(Day)
from nests join nest_days using (NestID)
group by NestID
")

But I am struggling to find solution in tidyverse (i.e. dplyr). I tried:

nests %>%
    inner_join(nest_days, join_by(NestID)) %>% 
    group_by(NestID) %>% 
    summarize(minDay = min(Day), maxDay = max(Day))

However, this doesn't preserve the columns from the nest data.frame. So, I tried several methods of how to preserve those columns, like using mutate(), or summarize(across(...)):

nests %>%
    inner_join(nest_days, join_by(NestID)) %>% 
    group_by(NestID) %>% 
    mutate(minDay = min(Day), maxDay = max(Day))

nests %>%
    inner_join(nest_days, join_by(NestID)) %>% 
    group_by(NestID) %>% 
    summarize(across(), minDay = min(Day), maxDay = max(Day))

nests %>%
    inner_join(nest_days, join_by(NestID)) %>% 
    group_by(NestID) %>% 
    summarize(across(everything()), minDay = min(Day), maxDay = max(Day))

However, these also fail, because they also preserve all columns from the second data.frame (nest_days), which means, they will ungroup the grouping.

Is there some easy and elegant way, in tidyverse/dplyr, how to preserve columns only of the first data.frame? Instantly, I see it would be possible to do in two steps, first do a group by in the nest_days and then join. However, this is much less elegant than in SQL. I expect tidyverse is here to provide us with means to write an elegant and simple, so I am looking for such solution.

EDIT: I found this solution using nest_join, but I don't like it much, I still expect the tidyverse must have something more concise:

nests %>%
    nest_join(nest_days, join_by(NestID)) %>% 
    mutate(
        minDay = map_dbl(nest_days, ~min(.$Day)), 
        maxDay = map_dbl(nest_days, ~max(.$Day))
    ) %>% 
    select(-nest_days)

EDIT: Here are some phony data to play with:

nests <- data.frame(NestID = 1:1000, Year = sample(2013:2017, 1000, TRUE), a = rnorm(1000), b = rnorm(1000), c = rnorm(1000))
nest_days <- expand.grid(NestID = 1:1000, Day = 1:12)
nest_days$a <- rnorm(nrow(nest_days))
nest_days$b <- rnorm(nrow(nest_days))
nest_days$c <- rnorm(nrow(nest_days))

I have two data.frames, nests, and nest_days, which contains more days for one nest. I want to do a join, and group by nests (i.e. group by the rows of the first data.frame). I want to keep all columns from the nest data.frame (there are many) and include some grouped summaries from the nest_days data.frame.

This is fairly easy and elegant in SQL:

sqldf("
select nests.*, min(Day), max(Day)
from nests join nest_days using (NestID)
group by NestID
")

But I am struggling to find solution in tidyverse (i.e. dplyr). I tried:

nests %>%
    inner_join(nest_days, join_by(NestID)) %>% 
    group_by(NestID) %>% 
    summarize(minDay = min(Day), maxDay = max(Day))

However, this doesn't preserve the columns from the nest data.frame. So, I tried several methods of how to preserve those columns, like using mutate(), or summarize(across(...)):

nests %>%
    inner_join(nest_days, join_by(NestID)) %>% 
    group_by(NestID) %>% 
    mutate(minDay = min(Day), maxDay = max(Day))

nests %>%
    inner_join(nest_days, join_by(NestID)) %>% 
    group_by(NestID) %>% 
    summarize(across(), minDay = min(Day), maxDay = max(Day))

nests %>%
    inner_join(nest_days, join_by(NestID)) %>% 
    group_by(NestID) %>% 
    summarize(across(everything()), minDay = min(Day), maxDay = max(Day))

However, these also fail, because they also preserve all columns from the second data.frame (nest_days), which means, they will ungroup the grouping.

Is there some easy and elegant way, in tidyverse/dplyr, how to preserve columns only of the first data.frame? Instantly, I see it would be possible to do in two steps, first do a group by in the nest_days and then join. However, this is much less elegant than in SQL. I expect tidyverse is here to provide us with means to write an elegant and simple, so I am looking for such solution.

EDIT: I found this solution using nest_join, but I don't like it much, I still expect the tidyverse must have something more concise:

nests %>%
    nest_join(nest_days, join_by(NestID)) %>% 
    mutate(
        minDay = map_dbl(nest_days, ~min(.$Day)), 
        maxDay = map_dbl(nest_days, ~max(.$Day))
    ) %>% 
    select(-nest_days)

EDIT: Here are some phony data to play with:

nests <- data.frame(NestID = 1:1000, Year = sample(2013:2017, 1000, TRUE), a = rnorm(1000), b = rnorm(1000), c = rnorm(1000))
nest_days <- expand.grid(NestID = 1:1000, Day = 1:12)
nest_days$a <- rnorm(nrow(nest_days))
nest_days$b <- rnorm(nrow(nest_days))
nest_days$c <- rnorm(nrow(nest_days))
Share Improve this question edited Mar 27 at 10:37 Tomas asked Mar 26 at 11:43 TomasTomas 59.7k54 gold badges250 silver badges382 bronze badges 8
  • Did you try grouping within the left_join like nests %>% left_join( nest_days %>% group_by(NestID) %>% summarize(minDay = min(Day),maxDay = max(Day)), by = "NestID") also can you please provide nests & nest_days? – Tim G Commented Mar 26 at 12:04
  • @TimG this is what I meant with the two step solution. I would like to avoid that - I suspect tidyverse must have more simple and elegant solution than that, as it is in the SQL example? Sorry I can't provide the data, it is not open. – Tomas Commented Mar 26 at 12:18
  • It probably has, some sample data would still be great - it does not have to sensitive data. I believe base R or data.table solutions can often also proove to be concise. – Tim G Commented Mar 26 at 12:23
  • You can. Just create some sample data. – Friede Commented Mar 26 at 12:26
  • 1 No, you cannot. – Friede Commented Mar 26 at 12:55
 |  Show 3 more comments

2 Answers 2

Reset to default 2

tidy-select - ?dplyr::dplyr_tidy_select - we use for grouping is quite flexible, for simpler cases we can just exclude a column or two, (e.g. .by = !Day, .by = !c(Day, c), but we might as well use ranges, name patterns, positions and set operations to build sets that either include or exclude any number columns.

library(dplyr, warn.conflicts = FALSE)

set.seed(123)
nests <- data.frame(NestID = 1:1000, Year = sample(2013:2017, 1000, TRUE), a = rnorm(1000), b = rnorm(1000), c = rnorm(1000))
nest_days <- expand.grid(NestID = 1:1000, Day = 1:12)
nest_days$a <- rnorm(nrow(nest_days))
nest_days$b <- rnorm(nrow(nest_days))
nest_days$c <- rnorm(nrow(nest_days))

inner_join(nests, nest_days, by = join_by(NestID), suffix = c("", ".y")) |> 
  summarise(
    minDay = min(Day), 
    maxDay = max(Day), 
    # group by everything except range a.y:c.y
    .by = !(a.y:c.y)
    # or everything except columns ending with ".y"
    # .by = !ends_with(".y")
    # or include columns from nests + everything that starts with "b" + 2 last columns
    # .by = c(all_of(names(nests)), starts_with("b"), last_col(1):last_col())
  ) |> 
  head()
#>   NestID Year        a          b           c Day minDay maxDay
#> 1      1 2015 1.148447 -0.7596014 -0.05709456   1      1      1
#> 2      1 2015 1.148447 -0.7596014 -0.05709456   2      2      2
#> 3      1 2015 1.148447 -0.7596014 -0.05709456   3      3      3
#> 4      1 2015 1.148447 -0.7596014 -0.05709456   4      4      4
#> 5      1 2015 1.148447 -0.7596014 -0.05709456   5      5      5
#> 6      1 2015 1.148447 -0.7596014 -0.05709456   6      6      6

Post-join state & col. names for reference:

#> Rows: 12,000
#> Columns: 9
#> $ NestID <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
#> $ Year   <int> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 201…
#> $ a      <dbl> 1.1484466, 1.1484466, 1.1484466, 1.1484466, 1.1484466, 1.148446…
#> $ b      <dbl> -0.7596014, -0.7596014, -0.7596014, -0.7596014, -0.7596014, -0.…
#> $ c      <dbl> -0.05709456, -0.05709456, -0.05709456, -0.05709456, -0.05709456…
#> $ Day    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, …
#> $ a.y    <dbl> -1.0492540, -1.8711882, -0.4101328, 0.2652801, 0.3414629, -0.94…
#> $ b.y    <dbl> 0.47593189, 0.60484758, -0.41593670, 0.12649302, -1.59760340, -…
#> $ c.y    <dbl> -1.4843105, 0.7043585, -0.2848173, 1.5588798, -2.1694400, 1.142…

You can use a tidyselect helper inside group_by(pick()), like so:

library(dplyr)
set.seed(13)

nests %>%
  inner_join(nest_days, join_by(NestID)) %>% 
  group_by(pick(!Day)) %>% 
  summarize(
    minDay = min(Day), 
    maxDay = max(Day),
    .groups = "drop"
  )

Or for a more generic approach, use

  # ... %>%
  group_by(pick(all_of(names(nests)))) %>%
  # ... 

Result:

# A tibble: 1,000 × 7
   NestID  Year       a        b      c minDay maxDay
    <int> <int>   <dbl>    <dbl>  <dbl>  <int>  <int>
 1      1  2015  0.0794 -0.00966 -0.662      1     12
 2      2  2017  0.313  -0.648    0.912      1     12
 3      3  2014  1.39    1.45     0.367      1     12
 4      4  2017  0.560   0.0942   1.47       1     12
 5      5  2016 -0.233   0.414    0.235      1     12
 6      6  2017 -1.02    0.176   -0.610      1     12
 7      7  2016 -0.324  -1.12    -0.950      1     12
 8      8  2015  2.44   -1.06    -0.161      1     12
 9      9  2013 -0.249   2.84    -1.58       1     12
10     10  2014  1.09   -2.28     2.16       1     12
# ℹ 990 more rows

本文标签: