admin管理员组

文章数量:1295844

I've found ways to create groupings from variables but not how to create a single sub-group.

df <- data.frame(days = c(0,1,2,3,4,5,6,8,10), n = c(408,51,103,112,35,17,7,6,1))

What I want to do is have an output like the one below in order to create a histogram with "days" on the x-axis. I can't figure out how to create a "6+" group that also sums the corresponding n values.

days n
0 408
1 51
2 103
3 112
4 35
5 17
6+ 14

I've found ways to create groupings from variables but not how to create a single sub-group.

df <- data.frame(days = c(0,1,2,3,4,5,6,8,10), n = c(408,51,103,112,35,17,7,6,1))

What I want to do is have an output like the one below in order to create a histogram with "days" on the x-axis. I can't figure out how to create a "6+" group that also sums the corresponding n values.

days n
0 408
1 51
2 103
3 112
4 35
5 17
6+ 14

I would then use ggplot2 and geom_bar() to ideally make a histogram.

This is what I tried but it gave me an error. I'm not sure if there is an easier way to group the numbers 6 and up together?

df$days <- ifelse(df$days > 5, "6+", df$days) %>%   summarise(df$n = sum(n),             .groups = 'drop')
Share asked Feb 12 at 1:14 user27294105user27294105 111 silver badge1 bronze badge
Add a comment  | 

4 Answers 4

Reset to default 1
library(dplyr) # I presume you're using its "summarise"
df |>
  mutate(grp = if_else(days >= 6, "6+", paste(days))) |>
  summarize(n = sum(n), .by = grp)

or shorter:

df |>
  count(grp = if_else(days >= 6, "6+", paste(days)), wt = n)

Result

grp   n
1   0 408
2   1  51
3   2 103
4   3 112
5   4  35
6   5  17
7  6+  14

Either way, I'm making a group called "6+" and making the other groups into text by using paste (so 1 becomes "1"), then combining the n's for each grp.

In some cases you might want to make the grouping into a factor to make it sort as you want. For instance, if you wanted a group of "up to 4" you could use

df |>
  count(grp = if_else(days <= 4, "up to 4", paste(days)) |>
          forcats::fct_reorder(days), wt = n)

to make the "up to 4" category appear first, even though by default it would otherwise appear alphabetically last.

      grp   n
1 up to 4 709
2       5  17
3       6   7
4       8   6
5      10   1

In base R, you can similarly use an ifelse statement in aggregate. I wrapped it in setNames to rename the columns, but thats just cosmetic:

setNames(
  aggregate(n ~ ifelse(days >= 6, "6+", days), df, sum),
         c("days", "n"))

Output:

  days   n
1    0 408
2    1  51
3    2 103
4    3 112
5    4  35
6    5  17
7   6+  14

It's great that you provided your attempted code and it may be worthwhile to walk through why your attempt returned an error, as there may be some confusion. Here is your attempt, reposted for convenience:

df$days <- ifelse(df$days > 5, "6+", df$days) %>%   
  summarise(df$n = sum(n), .groups = 'drop')

There are a few different types of problems. First, the pipe operator (%>%) takes the result of the left-hand side expression and passes it as the first argument to the function on the right statement (here, summarize). So ifelse(df$days > 5, "6+", df$days) returns a simple character vector ([1] "0" "1" "2" "3" "4" "5" "6+" "6+" "6+") which is then piped into the next step. Now summarize is (1) expecting a data frame, not a vector and so (2) can't find n in the input, since the only thing it is receiving is the "0", "1"..."6+" vector. Second, the use of $ notation is pipes is discouraged as you simply just refer to the name directly (i.e, n, not df$n).

A corrected version of your attempted code may be:

df %>%
  mutate(days = ifelse(days > 5, "6+", days)) %>%
  summarise(n = sum(n), .by = days)

However I would use @JonSpring's elegant solution if you wanted a dplyr approach. Hope this helps clarify, and happy coding!

Here comes a fast base R way.

> Map(c, df[df$days < 6, ], replace(lapply(df[df$days > 5, ], sum), 1, '6+')) |> 
+   as.data.frame()
  days   n
1    0 408
2    1  51
3    2 103
4    3 112
5    4  35
6    5  17
7   6+  14

Or if you like data.table.

> library(data.table)
> setDT(df)
> 
> df[, days := fifelse(days > 5, '6+', as.character(days))][, .(n = sum(n)), by = days] 
     days     n
   <char> <num>
1:      0   408
2:      1    51
3:      2   103
4:      3   112
5:      4    35
6:      5    17
7:     6+    14

You can do

local({
  i = df$days > 5
  rbind(df[!i, ], c('days' = '6+', 'n' = sum(df$n[i])))
})

or

with(df, {
  i = days > 5
  rbind(df[!i, ], c('days' = '6+', 'n' = sum(n[i])))
})
  days   n
1    0 408
2    1  51
3    2 103
4    3 112
5    4  35
6    5  17
7   6+  14

where the local({ .. })/with({ .. }) is not needed but prevents the index variable i from being recognised in the global environment.

Even simpler but "less" efficient.

rbind(df[df$days < 6, ], c('days' = '6+', 'n' = sum(df$n[df$days > 5])))

or

with(df, rbind(df[days < 6, ], c('days' = '6+', 'n' = sum(n[days > 5]))))

本文标签: groupGrouping only specific numbers and finding the sum of a separate column using RStack Overflow