admin管理员组文章数量:1122846
I've been struggling to find a way to bin continuous variables as discrete without the ranges overlapping. I've looked at cut_interval/cut_number/cut_width from ggplot2, as well as other custom functions from StackOverflow, but what drives me nuts are when bins have overlapping values e.g. 1-5, 5-10, 10-15. For my purposes each bin needs to be exclusive e.g. 1-4, 5-10, 11-15.
Code chunk below is what I've worked out so far. It allows me to set X bins, roughly equal in value range, but I still get bins with overlapping values. I would end up using this when making a choropleth map for example.
make_bins <- function(x, bins){
maxVal <- max(x, na.rm = TRUE)
minVal <- min(x, na.rm = TRUE)
if(maxVal %% 2 == 0){
maxVal
} else {
maxVal <- maxVal + 1
}
width <- maxVal / bins
breaks <- seq(minVal, maxVal, by = width)
breaks <- c(breaks, Inf)
labels <- sapply(1:bins, function(i) {
sprintf("%s-%s", round(breaks[i], 2), round(breaks[i + 1], 2))
})
last_one <- max(length(labels))
labels[last_one] <- sprintf(">%s", breaks[last_one])
out <- cut(x, breaks = breaks, labels = labels, right = TRUE, include.lowest = TRUE)
return(out)
}
Here's an example of how it could be used:
values <- runif(100, min = 0, max = 1)
values_binned <- make_bins(values, bins = 4)
table(values_binned)
I've been struggling to find a way to bin continuous variables as discrete without the ranges overlapping. I've looked at cut_interval/cut_number/cut_width from ggplot2, as well as other custom functions from StackOverflow, but what drives me nuts are when bins have overlapping values e.g. 1-5, 5-10, 10-15. For my purposes each bin needs to be exclusive e.g. 1-4, 5-10, 11-15.
Code chunk below is what I've worked out so far. It allows me to set X bins, roughly equal in value range, but I still get bins with overlapping values. I would end up using this when making a choropleth map for example.
make_bins <- function(x, bins){
maxVal <- max(x, na.rm = TRUE)
minVal <- min(x, na.rm = TRUE)
if(maxVal %% 2 == 0){
maxVal
} else {
maxVal <- maxVal + 1
}
width <- maxVal / bins
breaks <- seq(minVal, maxVal, by = width)
breaks <- c(breaks, Inf)
labels <- sapply(1:bins, function(i) {
sprintf("%s-%s", round(breaks[i], 2), round(breaks[i + 1], 2))
})
last_one <- max(length(labels))
labels[last_one] <- sprintf(">%s", breaks[last_one])
out <- cut(x, breaks = breaks, labels = labels, right = TRUE, include.lowest = TRUE)
return(out)
}
Here's an example of how it could be used:
values <- runif(100, min = 0, max = 1)
values_binned <- make_bins(values, bins = 4)
table(values_binned)
Share
Improve this question
edited Nov 22, 2024 at 17:28
SeniorAthlete
asked Nov 22, 2024 at 17:17
SeniorAthleteSeniorAthlete
153 bronze badges
2
|
2 Answers
Reset to default 1I think what you want is offered by the Hmisc::cut2
function using the g
("g" for number of groups) parameter:
install.packages('Hmisc')
library(Hmisc)
values_binned <- cut2(values, g = 4)
table(values_binned)
values_binned
#[0.0103,0.214) [0.2143,0.437) [0.4374,0.694) [0.6943,0.984]
# 25 25 25 25
The cut
functions do not make overlapping bins. In the following example, the bins are defined by
0.000625 <= value <= 0.332
0.332 < value <= 0.663
0.663 < value <= 0.994
That is, the ( symbol means >,
the [ symbol means >=,
the ] symbol means <=
Does that meet your needs?
set.seed(123)
values <- runif(100, min = 0, max = 1)
min(values)
#> [1] 0.0006247733
max(values)
#> [1] 0.9942698
CutValues <- ggplot2::cut_interval(values, n = 3)
table(CutValues)
#> CutValues
#> [0.000625,0.332] (0.332,0.663] (0.663,0.994]
#> 33 33 34
Created on 2024-11-22 with reprex v2.1.1
本文标签: rBinning Continuous Variable to Discrete Without Overlapping ValuesStack Overflow
版权声明:本文标题:r - Binning Continuous Variable to Discrete Without Overlapping Values - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1736302019a1931383.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
cut
the intervals do not overlap, can you post an example where they do? See the help page on open and closed intervals' end points. – Rui Barradas Commented Nov 22, 2024 at 17:43