admin管理员组文章数量:1397117
I have two data.frames, nests
, and nest_days
, which contains more days for one nest. I want to do a join, and group by nests (i.e. group by the rows of the first data.frame). I want to keep all columns from the nest
data.frame
(there are many) and include some grouped summaries from the nest_days
data.frame.
This is fairly easy and elegant in SQL:
sqldf("
select nests.*, min(Day), max(Day)
from nests join nest_days using (NestID)
group by NestID
")
But I am struggling to find solution in tidyverse (i.e. dplyr). I tried:
nests %>%
inner_join(nest_days, join_by(NestID)) %>%
group_by(NestID) %>%
summarize(minDay = min(Day), maxDay = max(Day))
However, this doesn't preserve the columns from the nest
data.frame. So, I tried several methods of how to preserve those columns, like using mutate()
, or summarize(across(...))
:
nests %>%
inner_join(nest_days, join_by(NestID)) %>%
group_by(NestID) %>%
mutate(minDay = min(Day), maxDay = max(Day))
nests %>%
inner_join(nest_days, join_by(NestID)) %>%
group_by(NestID) %>%
summarize(across(), minDay = min(Day), maxDay = max(Day))
nests %>%
inner_join(nest_days, join_by(NestID)) %>%
group_by(NestID) %>%
summarize(across(everything()), minDay = min(Day), maxDay = max(Day))
However, these also fail, because they also preserve all columns from the second data.frame (nest_days), which means, they will ungroup the grouping.
Is there some easy and elegant way, in tidyverse/dplyr, how to preserve columns only of the first data.frame? Instantly, I see it would be possible to do in two steps, first do a group by in the nest_days and then join. However, this is much less elegant than in SQL. I expect tidyverse
is here to provide us with means to write an elegant and simple, so I am looking for such solution.
EDIT: I found this solution using nest_join
, but I don't like it much, I still expect the tidyverse must have something more concise:
nests %>%
nest_join(nest_days, join_by(NestID)) %>%
mutate(
minDay = map_dbl(nest_days, ~min(.$Day)),
maxDay = map_dbl(nest_days, ~max(.$Day))
) %>%
select(-nest_days)
EDIT: Here are some phony data to play with:
nests <- data.frame(NestID = 1:1000, Year = sample(2013:2017, 1000, TRUE), a = rnorm(1000), b = rnorm(1000), c = rnorm(1000))
nest_days <- expand.grid(NestID = 1:1000, Day = 1:12)
nest_days$a <- rnorm(nrow(nest_days))
nest_days$b <- rnorm(nrow(nest_days))
nest_days$c <- rnorm(nrow(nest_days))
I have two data.frames, nests
, and nest_days
, which contains more days for one nest. I want to do a join, and group by nests (i.e. group by the rows of the first data.frame). I want to keep all columns from the nest
data.frame
(there are many) and include some grouped summaries from the nest_days
data.frame.
This is fairly easy and elegant in SQL:
sqldf("
select nests.*, min(Day), max(Day)
from nests join nest_days using (NestID)
group by NestID
")
But I am struggling to find solution in tidyverse (i.e. dplyr). I tried:
nests %>%
inner_join(nest_days, join_by(NestID)) %>%
group_by(NestID) %>%
summarize(minDay = min(Day), maxDay = max(Day))
However, this doesn't preserve the columns from the nest
data.frame. So, I tried several methods of how to preserve those columns, like using mutate()
, or summarize(across(...))
:
nests %>%
inner_join(nest_days, join_by(NestID)) %>%
group_by(NestID) %>%
mutate(minDay = min(Day), maxDay = max(Day))
nests %>%
inner_join(nest_days, join_by(NestID)) %>%
group_by(NestID) %>%
summarize(across(), minDay = min(Day), maxDay = max(Day))
nests %>%
inner_join(nest_days, join_by(NestID)) %>%
group_by(NestID) %>%
summarize(across(everything()), minDay = min(Day), maxDay = max(Day))
However, these also fail, because they also preserve all columns from the second data.frame (nest_days), which means, they will ungroup the grouping.
Is there some easy and elegant way, in tidyverse/dplyr, how to preserve columns only of the first data.frame? Instantly, I see it would be possible to do in two steps, first do a group by in the nest_days and then join. However, this is much less elegant than in SQL. I expect tidyverse
is here to provide us with means to write an elegant and simple, so I am looking for such solution.
EDIT: I found this solution using nest_join
, but I don't like it much, I still expect the tidyverse must have something more concise:
nests %>%
nest_join(nest_days, join_by(NestID)) %>%
mutate(
minDay = map_dbl(nest_days, ~min(.$Day)),
maxDay = map_dbl(nest_days, ~max(.$Day))
) %>%
select(-nest_days)
EDIT: Here are some phony data to play with:
nests <- data.frame(NestID = 1:1000, Year = sample(2013:2017, 1000, TRUE), a = rnorm(1000), b = rnorm(1000), c = rnorm(1000))
nest_days <- expand.grid(NestID = 1:1000, Day = 1:12)
nest_days$a <- rnorm(nrow(nest_days))
nest_days$b <- rnorm(nrow(nest_days))
nest_days$c <- rnorm(nrow(nest_days))
Share
Improve this question
edited Mar 27 at 10:37
Tomas
asked Mar 26 at 11:43
TomasTomas
59.7k54 gold badges250 silver badges382 bronze badges
8
|
Show 3 more comments
2 Answers
Reset to default 2tidy-select - ?dplyr::dplyr_tidy_select
- we use for grouping is quite flexible, for simpler cases we can just exclude a column or two, (e.g. .by = !Day
, .by = !c(Day, c)
, but we might as well use ranges, name patterns, positions and set operations to build sets that either include or exclude any number columns.
library(dplyr, warn.conflicts = FALSE)
set.seed(123)
nests <- data.frame(NestID = 1:1000, Year = sample(2013:2017, 1000, TRUE), a = rnorm(1000), b = rnorm(1000), c = rnorm(1000))
nest_days <- expand.grid(NestID = 1:1000, Day = 1:12)
nest_days$a <- rnorm(nrow(nest_days))
nest_days$b <- rnorm(nrow(nest_days))
nest_days$c <- rnorm(nrow(nest_days))
inner_join(nests, nest_days, by = join_by(NestID), suffix = c("", ".y")) |>
summarise(
minDay = min(Day),
maxDay = max(Day),
# group by everything except range a.y:c.y
.by = !(a.y:c.y)
# or everything except columns ending with ".y"
# .by = !ends_with(".y")
# or include columns from nests + everything that starts with "b" + 2 last columns
# .by = c(all_of(names(nests)), starts_with("b"), last_col(1):last_col())
) |>
head()
#> NestID Year a b c Day minDay maxDay
#> 1 1 2015 1.148447 -0.7596014 -0.05709456 1 1 1
#> 2 1 2015 1.148447 -0.7596014 -0.05709456 2 2 2
#> 3 1 2015 1.148447 -0.7596014 -0.05709456 3 3 3
#> 4 1 2015 1.148447 -0.7596014 -0.05709456 4 4 4
#> 5 1 2015 1.148447 -0.7596014 -0.05709456 5 5 5
#> 6 1 2015 1.148447 -0.7596014 -0.05709456 6 6 6
Post-join state & col. names for reference:
#> Rows: 12,000
#> Columns: 9
#> $ NestID <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
#> $ Year <int> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 201…
#> $ a <dbl> 1.1484466, 1.1484466, 1.1484466, 1.1484466, 1.1484466, 1.148446…
#> $ b <dbl> -0.7596014, -0.7596014, -0.7596014, -0.7596014, -0.7596014, -0.…
#> $ c <dbl> -0.05709456, -0.05709456, -0.05709456, -0.05709456, -0.05709456…
#> $ Day <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, …
#> $ a.y <dbl> -1.0492540, -1.8711882, -0.4101328, 0.2652801, 0.3414629, -0.94…
#> $ b.y <dbl> 0.47593189, 0.60484758, -0.41593670, 0.12649302, -1.59760340, -…
#> $ c.y <dbl> -1.4843105, 0.7043585, -0.2848173, 1.5588798, -2.1694400, 1.142…
You can use a tidyselect helper inside group_by(pick())
, like so:
library(dplyr)
set.seed(13)
nests %>%
inner_join(nest_days, join_by(NestID)) %>%
group_by(pick(!Day)) %>%
summarize(
minDay = min(Day),
maxDay = max(Day),
.groups = "drop"
)
Or for a more generic approach, use
# ... %>%
group_by(pick(all_of(names(nests)))) %>%
# ...
Result:
# A tibble: 1,000 × 7
NestID Year a b c minDay maxDay
<int> <int> <dbl> <dbl> <dbl> <int> <int>
1 1 2015 0.0794 -0.00966 -0.662 1 12
2 2 2017 0.313 -0.648 0.912 1 12
3 3 2014 1.39 1.45 0.367 1 12
4 4 2017 0.560 0.0942 1.47 1 12
5 5 2016 -0.233 0.414 0.235 1 12
6 6 2017 -1.02 0.176 -0.610 1 12
7 7 2016 -0.324 -1.12 -0.950 1 12
8 8 2015 2.44 -1.06 -0.161 1 12
9 9 2013 -0.249 2.84 -1.58 1 12
10 10 2014 1.09 -2.28 2.16 1 12
# ℹ 990 more rows
本文标签:
版权声明:本文标题:r - Join two dataframes, group by the first keeping all its columns, summarize the second - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1744147847a2592920.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
nests %>% left_join( nest_days %>% group_by(NestID) %>% summarize(minDay = min(Day),maxDay = max(Day)), by = "NestID")
also can you please providenests
&nest_days
? – Tim G Commented Mar 26 at 12:04