r - Join two dataframes, group by the first keeping all its columns, summarize the second - Stack Overflow-软件玩家

admin管理员组
文章数量:1397117

I have two data.frames, nests, and nest_days, which contains more days for one nest. I want to do a join, and group by nests (i.e. group by the rows of the first data.frame). I want to keep all columns from the nest data.frame (there are many) and include some grouped summaries from the nest_days data.frame.

This is fairly easy and elegant in SQL:

sqldf("
select nests.*, min(Day), max(Day)
from nests join nest_days using (NestID)
group by NestID
")

But I am struggling to find solution in tidyverse (i.e. dplyr). I tried:

nests %>%
    inner_join(nest_days, join_by(NestID)) %>% 
    group_by(NestID) %>% 
    summarize(minDay = min(Day), maxDay = max(Day))

However, this doesn't preserve the columns from the nest data.frame. So, I tried several methods of how to preserve those columns, like using mutate(), or summarize(across(...)):

nests %>%
    inner_join(nest_days, join_by(NestID)) %>% 
    group_by(NestID) %>% 
    mutate(minDay = min(Day), maxDay = max(Day))

nests %>%
    inner_join(nest_days, join_by(NestID)) %>% 
    group_by(NestID) %>% 
    summarize(across(), minDay = min(Day), maxDay = max(Day))

nests %>%
    inner_join(nest_days, join_by(NestID)) %>% 
    group_by(NestID) %>% 
    summarize(across(everything()), minDay = min(Day), maxDay = max(Day))

However, these also fail, because they also preserve all columns from the second data.frame (nest_days), which means, they will ungroup the grouping.

Is there some easy and elegant way, in tidyverse/dplyr, how to preserve columns only of the first data.frame? Instantly, I see it would be possible to do in two steps, first do a group by in the nest_days and then join. However, this is much less elegant than in SQL. I expect tidyverse is here to provide us with means to write an elegant and simple, so I am looking for such solution.

EDIT: I found this solution using nest_join, but I don't like it much, I still expect the tidyverse must have something more concise:

nests %>%
    nest_join(nest_days, join_by(NestID)) %>% 
    mutate(
        minDay = map_dbl(nest_days, ~min(.$Day)), 
        maxDay = map_dbl(nest_days, ~max(.$Day))
    ) %>% 
    select(-nest_days)

EDIT: Here are some phony data to play with:

nests <- data.frame(NestID = 1:1000, Year = sample(2013:2017, 1000, TRUE), a = rnorm(1000), b = rnorm(1000), c = rnorm(1000))
nest_days <- expand.grid(NestID = 1:1000, Day = 1:12)
nest_days$a <- rnorm(nrow(nest_days))
nest_days$b <- rnorm(nrow(nest_days))
nest_days$c <- rnorm(nrow(nest_days))

This is fairly easy and elegant in SQL:

sqldf("
select nests.*, min(Day), max(Day)
from nests join nest_days using (NestID)
group by NestID
")

But I am struggling to find solution in tidyverse (i.e. dplyr). I tried:

nests %>%
    inner_join(nest_days, join_by(NestID)) %>% 
    group_by(NestID) %>% 
    summarize(minDay = min(Day), maxDay = max(Day))

However, this doesn't preserve the columns from the nest data.frame. So, I tried several methods of how to preserve those columns, like using mutate(), or summarize(across(...)):

nests %>%
    inner_join(nest_days, join_by(NestID)) %>% 
    group_by(NestID) %>% 
    mutate(minDay = min(Day), maxDay = max(Day))

nests %>%
    inner_join(nest_days, join_by(NestID)) %>% 
    group_by(NestID) %>% 
    summarize(across(), minDay = min(Day), maxDay = max(Day))

nests %>%
    inner_join(nest_days, join_by(NestID)) %>% 
    group_by(NestID) %>% 
    summarize(across(everything()), minDay = min(Day), maxDay = max(Day))

However, these also fail, because they also preserve all columns from the second data.frame (nest_days), which means, they will ungroup the grouping.

EDIT: I found this solution using nest_join, but I don't like it much, I still expect the tidyverse must have something more concise:

nests %>%
    nest_join(nest_days, join_by(NestID)) %>% 
    mutate(
        minDay = map_dbl(nest_days, ~min(.$Day)), 
        maxDay = map_dbl(nest_days, ~max(.$Day))
    ) %>% 
    select(-nest_days)

EDIT: Here are some phony data to play with:

nests <- data.frame(NestID = 1:1000, Year = sample(2013:2017, 1000, TRUE), a = rnorm(1000), b = rnorm(1000), c = rnorm(1000))
nest_days <- expand.grid(NestID = 1:1000, Day = 1:12)
nest_days$a <- rnorm(nrow(nest_days))
nest_days$b <- rnorm(nrow(nest_days))
nest_days$c <- rnorm(nrow(nest_days))

Share Improve this question edited Mar 27 at 10:37 asked Mar 26 at 11:43 Tomas 59.7k54 gold badges250 silver badges382 bronze badges

Did you try grouping within the left_join like nests %>% left_join( nest_days %>% group_by(NestID) %>% summarize(minDay = min(Day),maxDay = max(Day)), by = "NestID") also can you please provide nests & nest_days? – Tim G Commented Mar 26 at 12:04
@TimG this is what I meant with the two step solution. I would like to avoid that - I suspect tidyverse must have more simple and elegant solution than that, as it is in the SQL example? Sorry I can't provide the data, it is not open. – Tomas Commented Mar 26 at 12:18
It probably has, some sample data would still be great - it does not have to sensitive data. I believe base R or data.table solutions can often also proove to be concise. – Tim G Commented Mar 26 at 12:23
You can. Just create some sample data. – Friede Commented Mar 26 at 12:26
1 No, you cannot. – Friede Commented Mar 26 at 12:55

| Show 3 more comments

2 Answers 2

Sorted by: Reset to default 2

tidy-select - ?dplyr::dplyr_tidy_select - we use for grouping is quite flexible, for simpler cases we can just exclude a column or two, (e.g. .by = !Day, .by = !c(Day, c), but we might as well use ranges, name patterns, positions and set operations to build sets that either include or exclude any number columns.

library(dplyr, warn.conflicts = FALSE)

set.seed(123)
nests <- data.frame(NestID = 1:1000, Year = sample(2013:2017, 1000, TRUE), a = rnorm(1000), b = rnorm(1000), c = rnorm(1000))
nest_days <- expand.grid(NestID = 1:1000, Day = 1:12)
nest_days$a <- rnorm(nrow(nest_days))
nest_days$b <- rnorm(nrow(nest_days))
nest_days$c <- rnorm(nrow(nest_days))

inner_join(nests, nest_days, by = join_by(NestID), suffix = c("", ".y")) |> 
  summarise(
    minDay = min(Day), 
    maxDay = max(Day), 
    # group by everything except range a.y:c.y
    .by = !(a.y:c.y)
    # or everything except columns ending with ".y"
    # .by = !ends_with(".y")
    # or include columns from nests + everything that starts with "b" + 2 last columns
    # .by = c(all_of(names(nests)), starts_with("b"), last_col(1):last_col())
  ) |> 
  head()
#>   NestID Year        a          b           c Day minDay maxDay
#> 1      1 2015 1.148447 -0.7596014 -0.05709456   1      1      1
#> 2      1 2015 1.148447 -0.7596014 -0.05709456   2      2      2
#> 3      1 2015 1.148447 -0.7596014 -0.05709456   3      3      3
#> 4      1 2015 1.148447 -0.7596014 -0.05709456   4      4      4
#> 5      1 2015 1.148447 -0.7596014 -0.05709456   5      5      5
#> 6      1 2015 1.148447 -0.7596014 -0.05709456   6      6      6

Post-join state & col. names for reference:

#> Rows: 12,000
#> Columns: 9
#> $ NestID <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
#> $ Year   <int> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 201…
#> $ a      <dbl> 1.1484466, 1.1484466, 1.1484466, 1.1484466, 1.1484466, 1.148446…
#> $ b      <dbl> -0.7596014, -0.7596014, -0.7596014, -0.7596014, -0.7596014, -0.…
#> $ c      <dbl> -0.05709456, -0.05709456, -0.05709456, -0.05709456, -0.05709456…
#> $ Day    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, …
#> $ a.y    <dbl> -1.0492540, -1.8711882, -0.4101328, 0.2652801, 0.3414629, -0.94…
#> $ b.y    <dbl> 0.47593189, 0.60484758, -0.41593670, 0.12649302, -1.59760340, -…
#> $ c.y    <dbl> -1.4843105, 0.7043585, -0.2848173, 1.5588798, -2.1694400, 1.142…

You can use a tidyselect helper inside group_by(pick()), like so:

library(dplyr)
set.seed(13)

nests %>%
  inner_join(nest_days, join_by(NestID)) %>% 
  group_by(pick(!Day)) %>% 
  summarize(
    minDay = min(Day), 
    maxDay = max(Day),
    .groups = "drop"
  )

Or for a more generic approach, use

  # ... %>%
  group_by(pick(all_of(names(nests)))) %>%
  # ...

Result:

# A tibble: 1,000 × 7
   NestID  Year       a        b      c minDay maxDay
    <int> <int>   <dbl>    <dbl>  <dbl>  <int>  <int>
 1      1  2015  0.0794 -0.00966 -0.662      1     12
 2      2  2017  0.313  -0.648    0.912      1     12
 3      3  2014  1.39    1.45     0.367      1     12
 4      4  2017  0.560   0.0942   1.47       1     12
 5      5  2016 -0.233   0.414    0.235      1     12
 6      6  2017 -1.02    0.176   -0.610      1     12
 7      7  2016 -0.324  -1.12    -0.950      1     12
 8      8  2015  2.44   -1.06    -0.161      1     12
 9      9  2013 -0.249   2.84    -1.58       1     12
10     10  2014  1.09   -2.28     2.16       1     12
# ℹ 990 more rows

本文标签：

版权声明：本文标题：r - Join two dataframes, group by the first keeping all its columns, summarize the second - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1744147847a2592920.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

r - Join two dataframes, group by the first keeping all its columns, summarize the second - Stack Overflow

2 Answers 2

更多相关文章

google pay - UPI intent flow payments are failing from angular web application - Stack Overflow

css - How to make video responsive in full height on mobile?

javascript - How do I make an element disappear and reappear when scrolling down? - Stack Overflow

add_action for save_post_{type} resulting in WSOD

javascript - Make hidden element appear for X seconds - Stack Overflow

theme development - How to escape html generate by a loop

regex - JavaScript Split without losing character - Stack Overflow

javascript - Get list of opened popup windows - Stack Overflow

jquery - Function to remove background-images for an element and all its children via javascript - Stack Overflow

Commented code in Wordpress template?

reactjs - React testing library form validation - Stack Overflow

wp query - Related categories order posts by category

javascript - React Native: How to test if element is focused? - Stack Overflow

How to find maximum between two numbers in javascript using switch case? - Stack Overflow

delphi - How do I restore the Toolbars (Menu) at the top of the PilotLogic&#39;s CodeTyphon IDE Screen (Linux version)? - St

javascript - Update state based on key in React - Stack Overflow

javascript - How to show OpenClosed based on time records with an html class - Stack Overflow

web applications - How to access windows certificate store in javascript? - Stack Overflow

c# - Using MSIX packaging for Visual Studio, App doesn&#39;t write in registry - Stack Overflow

How can I manually trigger a ForeSee survey dialog to display with javascript? or URL parameters? - Stack Overflow

发表评论

推荐文章

javascript - How to restrict the same character from being used consecutively? - Stack Overflow

Firefox Javascript Events Anonymous Function - Stack Overflow

javascript - how do i sum of two floats and get a float value in decimal places? - Stack Overflow

javascript - mediaelement.js - pauseplay onclick for video? - Stack Overflow

browser - Javascript: Alert user without using an alert box - Stack Overflow

热门文章

javascript - Node JS, How to properly store Base64 String in JSON string? - Stack Overflow

javascript - How to protect against Firebase Request Forgery? - Stack Overflow

node.js - The Javascript v8 engine and Web APIs - Stack Overflow

vuejs3 - Vue3, Vuetify3, Vitest, Teleports not rendering - Stack Overflow

javascript - How to add a background image to a plot created using D3 - Stack Overflow

javascript - Simulate the user clicking on a link - Stack Overflow

ios - Is in-app purchase verification valid with the VerifyReceipt endpoint? Apple documentation says deprecated - Stack Overflo

Redirect to custom post if custom archive page has just one result?

How to set a C# variable value from javascript - Stack Overflow

javascript - After refreshing a button, text is changing in jquery - Stack Overflow

最新文章

windows设置断电重启开机后自动输入锁屏密码登录

Windows系统设置开机默认开启数字小键盘

Windows11 开机自动同步时间（开机时间不更新问题）

windows配置开机自启动软件或脚本

【Redis】Windows设置Redis为开机自启动

r - Change output of the `purrr::map` function - Stack Overflow

How can I manually trigger a ForeSee survey dialog to display with javascript? or URL parameters? - Stack Overflow

javascript - Typescript Filter Array of Objects by Duplicate Properties - Stack Overflow

c# - Using MSIX packaging for Visual Studio, App doesn&#39;t write in registry - Stack Overflow

javascript - Close all child tabs when parent tab is closed - Stack Overflow

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价

delphi - How do I restore the Toolbars (Menu) at the top of the PilotLogic's CodeTyphon IDE Screen (Linux version)? - St

c# - Using MSIX packaging for Visual Studio, App doesn't write in registry - Stack Overflow

c# - Using MSIX packaging for Visual Studio, App doesn't write in registry - Stack Overflow