admin管理员组

文章数量:1125772

I have a dataset that looks something like this:

name party count year likes retweet
Tom R 1 2016 1357 23
Dave R 1 2016 1881 34
Larry D 1 2016 324 45
Tim D 1 2016 5587 56
Rob R 1 2016 9847 67
Sam D 1 2016 4466 78
Tom R 1 2017 ### ##
Dave R 1 2017 ### ##
Larry D 1 2017 ### ##
Tim D 1 2017 ### ##
Rob R 1 2017 ### ##
Sam D 1 2017 ### ##
Tom R 1 2018 ### ##
Dave R 1 2018 ### ##
Larry D 1 2018 ### ##
Tim D 1 2018 ### ##
Rob R 1 2018 ### ##
Sam D 1 2018 ### ##

I have a dataset that looks something like this:

name party count year likes retweet
Tom R 1 2016 1357 23
Dave R 1 2016 1881 34
Larry D 1 2016 324 45
Tim D 1 2016 5587 56
Rob R 1 2016 9847 67
Sam D 1 2016 4466 78
Tom R 1 2017 ### ##
Dave R 1 2017 ### ##
Larry D 1 2017 ### ##
Tim D 1 2017 ### ##
Rob R 1 2017 ### ##
Sam D 1 2017 ### ##
Tom R 1 2018 ### ##
Dave R 1 2018 ### ##
Larry D 1 2018 ### ##
Tim D 1 2018 ### ##
Rob R 1 2018 ### ##
Sam D 1 2018 ### ##

I have several columns that I want to collapse and sum, so that I can have a simpler output of a long dataset. I you'll notice that there are a bunch of "###" symbols, but that was just because I didn't think it was necessary to add additional information, as I believe what I provided is enough to solve my problem.

But I want to reformat the data, so it's easier to create a plot. Basically, I want the variables below. So, I want the first column to be the year (one row for each year in the dataset). The second column should be the sum of rows that were collapsed to create that row (I created the "count" variable in Table 1, so that I could sum the row value.) I then want four new columns, each representing the sum total of likes and retweets for Democrats and Republicans for that year. It should look something like this:

year count likes_republicans likes_democrats retweets_republicans retweets_democrats
2016 6 13085 10377 124 179
2017 6 ### ### ### ###
2018 6 ### ### ### ###

I feel like there is a simple command in dplyr to do this type of problem, but I don't remember what it is and I feel like I really struggle to understand it and how it works. Any help would be greatly appreciated.

Share Improve this question asked Jan 9 at 3:19 lwelwe 3811 silver badge10 bronze badges 2
  • Df |> summarise(count = n(), likes_R = sum(likes[party == “R”]), .by = year) etc. – Jon Spring Commented Jan 9 at 3:44
  • Or df |> summarise(across(likes:retweet, sum), .by = c(year, party)) and tidy::pivot_wider both values by party if you like. – Jon Spring Commented Jan 9 at 3:57
Add a comment  | 

2 Answers 2

Reset to default 1

Sticking to data.table.

> library(data.table)
> dt <- melt(dt, c(1, 2, 4), c(3, 5:6)) |> 
+   dcast(year ~ party + variable, fun=sum)
> dt[, count := D_count + R_count]
> dt
Key: <year>
    year D_count D_likes D_retweet R_count R_likes R_retweet count
   <int>   <int>   <int>     <int>   <int>   <int>     <int> <int>
1:  2016       3   10377      6859       3   13085      4733     6
2:  2017       3    3766      5346       3    5944      7332     6
3:  2018       3    7426      2255       3   56041      6109     6

Data:

dt <- data.table::fread('
  name  party   count   year    likes   retweet
Tom R   1   2016    1357    23
Dave    R   1   2016    1881    34
Larry   D   1   2016    324 45
Tim D   1   2016    5587    56
Rob R   1   2016    9847    4676
Sam D   1   2016    4466    6758
Tom R   1   2017    3456    2543
Dave    R   1   2017    2345    3546
Larry   D   1   2017    675 1423
Tim D   1   2017    2345    3567
Rob R   1   2017    143 1243
Sam D   1   2017    746 356
Tom R   1   2018    134 1423
Dave    R   1   2018    54673   563
Larry   D   1   2018    5467    1243
Tim D   1   2018    1423    365
Rob R   1   2018    1234    4123
Sam D   1   2018    536 647
')

The following solution should work with dplyr and tidyr.

It would be worth testing this with missing data and other edge cases e.g. voters of other parties.

library(dplyr)
library(tidyr) #Allows pivot_wider

# Create a minimal reprex
data = tibble(name = c("Tom","Dave","Larry","Tim","Rob","Sam"),
              party = c("R", "R", "D", "D", "R", "D"),
              count = rep(1, 6),
              year = c(rep(2016, 3), rep(2017, 3)),
              likes = c(1357, 1881, 324, 5587, 9847, 4466),
              retweet = c(23, 34, 45, 56, 67, 78))

# Create the likes data using summarise and group_by before pivot_wider
likes_data = data %>% 
  mutate(party = case_when(
    party == "D" ~ "democrats",
    party == "R" ~ "republicans"
  )) %>% 
  group_by(year) %>% 
  mutate(count = n()) %>% 
  group_by(year, count, party) %>% 
  summarise(likes = sum(likes, na.rm = TRUE)) %>% 
  ungroup() %>%
  pivot_wider(names_from = "party",
              names_prefix = "likes_",
              values_from = c(likes))
 
  
# Create the retweets data using summarise and group_by before pivot_wider
retweets_data = data %>% 
  mutate(party = case_when(
    party == "D" ~ "democrats",
    party == "R" ~ "republicans"
  )) %>% 
  group_by(year) %>% 
  mutate(count = n()) %>% 
  group_by(year, count, party) %>% 
  summarise(retweets = sum(retweet, na.rm = TRUE)) %>% 
  ungroup() %>%
  pivot_wider(names_from = "party",
              names_prefix = "retweets_",
              values_from = c(retweets))


# Join the data
summary_data = left_join(likes_data, retweets_data)
summary_data

Joining with `by = join_by(year, count)`
# A tibble: 2 × 6
   year count likes_democrats likes_republicans retweets_democrats retweets_republicans
  <dbl> <int>           <dbl>             <dbl>              <dbl>                <dbl>
1  2016     3             324              3238                 45                   57
2  2017     3           10053              9847                134                   67

本文标签: rReformat data to summarize and collapse rows into simple tableStack Overflow