admin管理员组

文章数量:1122846

I've been struggling to find a way to bin continuous variables as discrete without the ranges overlapping. I've looked at cut_interval/cut_number/cut_width from ggplot2, as well as other custom functions from StackOverflow, but what drives me nuts are when bins have overlapping values e.g. 1-5, 5-10, 10-15. For my purposes each bin needs to be exclusive e.g. 1-4, 5-10, 11-15.

Code chunk below is what I've worked out so far. It allows me to set X bins, roughly equal in value range, but I still get bins with overlapping values. I would end up using this when making a choropleth map for example.

make_bins <- function(x, bins){
  
  maxVal <- max(x, na.rm = TRUE)
  minVal <- min(x, na.rm = TRUE)
  
  if(maxVal %% 2 == 0){
    maxVal
  } else {
    maxVal <- maxVal + 1
  }
  
  width <- maxVal / bins
  
  breaks <- seq(minVal, maxVal, by = width)
  breaks <- c(breaks, Inf)
  
  labels <- sapply(1:bins, function(i) {
    sprintf("%s-%s", round(breaks[i], 2), round(breaks[i + 1], 2))
  })
  last_one <- max(length(labels))
  labels[last_one] <- sprintf(">%s", breaks[last_one])
  
  out <- cut(x, breaks = breaks, labels = labels, right = TRUE, include.lowest = TRUE)

  return(out)
}

Here's an example of how it could be used:

values <- runif(100, min = 0, max = 1)

values_binned <- make_bins(values, bins = 4)

table(values_binned)

I've been struggling to find a way to bin continuous variables as discrete without the ranges overlapping. I've looked at cut_interval/cut_number/cut_width from ggplot2, as well as other custom functions from StackOverflow, but what drives me nuts are when bins have overlapping values e.g. 1-5, 5-10, 10-15. For my purposes each bin needs to be exclusive e.g. 1-4, 5-10, 11-15.

Code chunk below is what I've worked out so far. It allows me to set X bins, roughly equal in value range, but I still get bins with overlapping values. I would end up using this when making a choropleth map for example.

make_bins <- function(x, bins){
  
  maxVal <- max(x, na.rm = TRUE)
  minVal <- min(x, na.rm = TRUE)
  
  if(maxVal %% 2 == 0){
    maxVal
  } else {
    maxVal <- maxVal + 1
  }
  
  width <- maxVal / bins
  
  breaks <- seq(minVal, maxVal, by = width)
  breaks <- c(breaks, Inf)
  
  labels <- sapply(1:bins, function(i) {
    sprintf("%s-%s", round(breaks[i], 2), round(breaks[i + 1], 2))
  })
  last_one <- max(length(labels))
  labels[last_one] <- sprintf(">%s", breaks[last_one])
  
  out <- cut(x, breaks = breaks, labels = labels, right = TRUE, include.lowest = TRUE)

  return(out)
}

Here's an example of how it could be used:

values <- runif(100, min = 0, max = 1)

values_binned <- make_bins(values, bins = 4)

table(values_binned)
Share Improve this question edited Nov 22, 2024 at 17:28 SeniorAthlete asked Nov 22, 2024 at 17:17 SeniorAthleteSeniorAthlete 153 bronze badges 2
  • If you use cut the intervals do not overlap, can you post an example where they do? See the help page on open and closed intervals' end points. – Rui Barradas Commented Nov 22, 2024 at 17:43
  • Related question, which also defends the default practice without giving you what you are asking for: stackoverflow.com/questions/41304960/… – Jon Spring Commented Nov 22, 2024 at 18:27
Add a comment  | 

2 Answers 2

Reset to default 1

I think what you want is offered by the Hmisc::cut2 function using the g ("g" for number of groups) parameter:

install.packages('Hmisc')
library(Hmisc)
values_binned <- cut2(values, g = 4)
table(values_binned)
values_binned
#[0.0103,0.214) [0.2143,0.437) [0.4374,0.694) [0.6943,0.984] 
#            25             25             25             25 

The cut functions do not make overlapping bins. In the following example, the bins are defined by

0.000625 <= value <= 0.332

0.332 < value <= 0.663

0.663 < value <= 0.994

That is, the ( symbol means >,

the [ symbol means >=,

the ] symbol means <=

Does that meet your needs?

set.seed(123)
values <- runif(100, min = 0, max = 1)
min(values)
#> [1] 0.0006247733
max(values)
#> [1] 0.9942698
CutValues <- ggplot2::cut_interval(values, n = 3)
table(CutValues)
#> CutValues
#> [0.000625,0.332]    (0.332,0.663]    (0.663,0.994] 
#>               33               33               34

Created on 2024-11-22 with reprex v2.1.1

本文标签: rBinning Continuous Variable to Discrete Without Overlapping ValuesStack Overflow