admin管理员组文章数量:1125772
I have a dataset that looks something like this:
name | party | count | year | likes | retweet |
---|---|---|---|---|---|
Tom | R | 1 | 2016 | 1357 | 23 |
Dave | R | 1 | 2016 | 1881 | 34 |
Larry | D | 1 | 2016 | 324 | 45 |
Tim | D | 1 | 2016 | 5587 | 56 |
Rob | R | 1 | 2016 | 9847 | 67 |
Sam | D | 1 | 2016 | 4466 | 78 |
Tom | R | 1 | 2017 | ### | ## |
Dave | R | 1 | 2017 | ### | ## |
Larry | D | 1 | 2017 | ### | ## |
Tim | D | 1 | 2017 | ### | ## |
Rob | R | 1 | 2017 | ### | ## |
Sam | D | 1 | 2017 | ### | ## |
Tom | R | 1 | 2018 | ### | ## |
Dave | R | 1 | 2018 | ### | ## |
Larry | D | 1 | 2018 | ### | ## |
Tim | D | 1 | 2018 | ### | ## |
Rob | R | 1 | 2018 | ### | ## |
Sam | D | 1 | 2018 | ### | ## |
I have a dataset that looks something like this:
name | party | count | year | likes | retweet |
---|---|---|---|---|---|
Tom | R | 1 | 2016 | 1357 | 23 |
Dave | R | 1 | 2016 | 1881 | 34 |
Larry | D | 1 | 2016 | 324 | 45 |
Tim | D | 1 | 2016 | 5587 | 56 |
Rob | R | 1 | 2016 | 9847 | 67 |
Sam | D | 1 | 2016 | 4466 | 78 |
Tom | R | 1 | 2017 | ### | ## |
Dave | R | 1 | 2017 | ### | ## |
Larry | D | 1 | 2017 | ### | ## |
Tim | D | 1 | 2017 | ### | ## |
Rob | R | 1 | 2017 | ### | ## |
Sam | D | 1 | 2017 | ### | ## |
Tom | R | 1 | 2018 | ### | ## |
Dave | R | 1 | 2018 | ### | ## |
Larry | D | 1 | 2018 | ### | ## |
Tim | D | 1 | 2018 | ### | ## |
Rob | R | 1 | 2018 | ### | ## |
Sam | D | 1 | 2018 | ### | ## |
I have several columns that I want to collapse and sum, so that I can have a simpler output of a long dataset. I you'll notice that there are a bunch of "###" symbols, but that was just because I didn't think it was necessary to add additional information, as I believe what I provided is enough to solve my problem.
But I want to reformat the data, so it's easier to create a plot. Basically, I want the variables below. So, I want the first column to be the year (one row for each year in the dataset). The second column should be the sum of rows that were collapsed to create that row (I created the "count" variable in Table 1, so that I could sum the row value.) I then want four new columns, each representing the sum total of likes and retweets for Democrats and Republicans for that year. It should look something like this:
year | count | likes_republicans | likes_democrats | retweets_republicans | retweets_democrats |
---|---|---|---|---|---|
2016 | 6 | 13085 | 10377 | 124 | 179 |
2017 | 6 | ### | ### | ### | ### |
2018 | 6 | ### | ### | ### | ### |
I feel like there is a simple command in dplyr to do this type of problem, but I don't remember what it is and I feel like I really struggle to understand it and how it works. Any help would be greatly appreciated.
Share Improve this question asked Jan 9 at 3:19 lwelwe 3811 silver badge10 bronze badges 2- Df |> summarise(count = n(), likes_R = sum(likes[party == “R”]), .by = year) etc. – Jon Spring Commented Jan 9 at 3:44
- Or df |> summarise(across(likes:retweet, sum), .by = c(year, party)) and tidy::pivot_wider both values by party if you like. – Jon Spring Commented Jan 9 at 3:57
2 Answers
Reset to default 1Sticking to data.table
.
> library(data.table)
> dt <- melt(dt, c(1, 2, 4), c(3, 5:6)) |>
+ dcast(year ~ party + variable, fun=sum)
> dt[, count := D_count + R_count]
> dt
Key: <year>
year D_count D_likes D_retweet R_count R_likes R_retweet count
<int> <int> <int> <int> <int> <int> <int> <int>
1: 2016 3 10377 6859 3 13085 4733 6
2: 2017 3 3766 5346 3 5944 7332 6
3: 2018 3 7426 2255 3 56041 6109 6
Data:
dt <- data.table::fread('
name party count year likes retweet
Tom R 1 2016 1357 23
Dave R 1 2016 1881 34
Larry D 1 2016 324 45
Tim D 1 2016 5587 56
Rob R 1 2016 9847 4676
Sam D 1 2016 4466 6758
Tom R 1 2017 3456 2543
Dave R 1 2017 2345 3546
Larry D 1 2017 675 1423
Tim D 1 2017 2345 3567
Rob R 1 2017 143 1243
Sam D 1 2017 746 356
Tom R 1 2018 134 1423
Dave R 1 2018 54673 563
Larry D 1 2018 5467 1243
Tim D 1 2018 1423 365
Rob R 1 2018 1234 4123
Sam D 1 2018 536 647
')
The following solution should work with dplyr
and tidyr
.
It would be worth testing this with missing data and other edge cases e.g. voters of other parties.
library(dplyr)
library(tidyr) #Allows pivot_wider
# Create a minimal reprex
data = tibble(name = c("Tom","Dave","Larry","Tim","Rob","Sam"),
party = c("R", "R", "D", "D", "R", "D"),
count = rep(1, 6),
year = c(rep(2016, 3), rep(2017, 3)),
likes = c(1357, 1881, 324, 5587, 9847, 4466),
retweet = c(23, 34, 45, 56, 67, 78))
# Create the likes data using summarise and group_by before pivot_wider
likes_data = data %>%
mutate(party = case_when(
party == "D" ~ "democrats",
party == "R" ~ "republicans"
)) %>%
group_by(year) %>%
mutate(count = n()) %>%
group_by(year, count, party) %>%
summarise(likes = sum(likes, na.rm = TRUE)) %>%
ungroup() %>%
pivot_wider(names_from = "party",
names_prefix = "likes_",
values_from = c(likes))
# Create the retweets data using summarise and group_by before pivot_wider
retweets_data = data %>%
mutate(party = case_when(
party == "D" ~ "democrats",
party == "R" ~ "republicans"
)) %>%
group_by(year) %>%
mutate(count = n()) %>%
group_by(year, count, party) %>%
summarise(retweets = sum(retweet, na.rm = TRUE)) %>%
ungroup() %>%
pivot_wider(names_from = "party",
names_prefix = "retweets_",
values_from = c(retweets))
# Join the data
summary_data = left_join(likes_data, retweets_data)
summary_data
Joining with `by = join_by(year, count)`
# A tibble: 2 × 6
year count likes_democrats likes_republicans retweets_democrats retweets_republicans
<dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 2016 3 324 3238 45 57
2 2017 3 10053 9847 134 67
本文标签: rReformat data to summarize and collapse rows into simple tableStack Overflow
版权声明:本文标题:r - Reformat data to summarize and collapse rows into simple table - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1736674845a1947124.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论