Plotting a histogram with a density curve that sums to 1 for non-standardized data is ridiculously difficult. There are many questions already about this, but none of their solutions work for my data. There needs to be a simple solution that just works. I can't find an answer with a simple solution that works.
用密度曲线绘制直方图对非标准化数据求和为1是非常困难的。关于这个问题已经有很多问题了,但是他们的解决方案对我的数据都不起作用。需要有一个简单的解决方案。我找不到一个简单有效的答案。
Some examples:
一些例子:
solution only works with standardized normal data ggplot2: Overlay histogram with density curve
解决方案只适用于标准化的正常数据ggplot2:覆盖直方图与密度曲线
with discrete data and no density curve ggplot2 density histogram with width=.5, vline and centered bar positions
具有离散数据,无密度曲线的ggplot2密度直方图,宽度=。5、vline和居中杆位
no answer Overlay density and histogram plot with ggplot2 using custom bins
没有答案覆盖密度和直方图与ggplot2使用定制箱子
densities do not sum to 1 on my data Creating a density histogram in ggplot2?
密度在我的数据中不等于1在ggplot2中创建密度直方图?
does not sum to 1 on my data ggplot2 density histogram with custom bin edges
我的数据上的ggplot2密度直方图和自定义的bin边不等于1
long explanation here with examples, but density is not 1 with my data "Density" curve overlay on histogram where vertical axis is frequency (aka count) or relative frequency?
这里用例子做了很长的解释,但是密度不是1,我的数据“密度”曲线叠加在直方图上,垂直轴是频率(又称计数)还是相对频率?
--
- - -
Some example code:
一些示例代码:
#Example code
set.seed(1)
t = data.frame(r = runif(100))
#first we try the obvious simple solution that should work
ggplot(t, aes(r)) +
geom_histogram() +
geom_density()
So, clearly the density does not sum to 1.
所以,显然密度不等于1。
#maybe geom_histogram needs a ..density.. ?
ggplot(t, aes(r)) +
geom_histogram(aes(y = ..density..)) +
geom_density()
It did change something, but not correctly.
它确实改变了一些东西,但不是正确的。
#maybe geom_density needs a ..density.. too ?
ggplot(t, aes(r)) +
geom_histogram(aes(y = ..density..)) +
geom_density(aes(y = ..density..))
No change there.
没有改变。
#maybe binwidth = 1?
ggplot(t, aes(r)) +
geom_histogram(aes(y = ..density..), binwidth=1) +
geom_density(aes(y = ..density..))
Still wrong density curve, but now the histogram is wrong too.
仍然是错误的密度曲线,但现在直方图也错了。
To be sure, I did spend 4 hours trying all kinds of combinations of ..count.. and ..sum.. and ..density.., but since I can't find any documentation about how these are supposed to work, it's semi-blind trial and error.
当然,我花了4个小时尝试各种各样的组合。和. .和. ... .. .. .. ..但是由于我找不到任何关于这些是如何工作的文件,这是半盲的尝试和错误。
So I gave up and avoided using ggplot2 to summarize the data.
所以我放弃使用ggplot2来总结数据。
So first we need to get the right proportions data.frame, and that wasn't so simple:
首先,我们需要得到正确的比例数据。
get_prop_table = function(x, breaks_=20){
library(magrittr)
library(plyr)
x_prop_table = cut(x, 20) %>% table(.) %>% prop.table %>% data.frame
colnames(x_prop_table) = c("interval", "density")
intervals = x_prop_table$interval %>% as.character
fetch_numbers = str_extract_all(intervals, "\\d\\.\\d*")
x_prop_table$means = laply(fetch_numbers, function(x) {
x %>% as.numeric %>% mean
})
return(x_prop_table)
}
t_df = get_prop_table(t$r)
This gives the kind of summary data we want:
这就给出了我们想要的总结数据:
> head(t_df)
interval density means
1 (0.00859,0.0585] 0.06 0.033545
2 (0.0585,0.107] 0.09 0.082750
3 (0.107,0.156] 0.07 0.131500
4 (0.156,0.205] 0.10 0.180500
5 (0.205,0.254] 0.08 0.229500
6 (0.254,0.303] 0.03 0.278500
Now we just have to plot it. Should be easy...
现在我们来画一下。应该很容易…
ggplot(t_df, aes(means, density)) +
geom_histogram(stat = "identity") +
geom_density(stat = "identity")
Umm, not quite what I wanted. To be sure, I did try without stat = "identity"
in geom_density, at which point it complained about not having a y.
嗯,不完全是我想要的。可以肯定的是,我确实尝试过在geom_density中不使用stat =“identity”,而此时它抱怨没有y。
#lets try adding ..density.. then
ggplot(t_df, aes(means, density)) +
geom_histogram(stat = "identity") +
geom_density(aes(y = ..density..))
Even more strange.
更奇怪。
Okay, maybe let's give up on getting the density curve from summary data. Maybe we need to mix the approaches a bit...
好吧,也许我们放弃从汇总数据中得到密度曲线。也许我们需要混合一下方法……
#adding together
ggplot(t_df, aes(means, density)) +
geom_bar(stat = "identity") +
geom_density(data=t, aes(r, y = ..density..), stat = 'density')
Ok, at least the shape is right now. Now, we need to somehow scale it down.
好的,至少现在的形状是正确的。现在,我们需要把它缩小。
#lets try dividing by the number of bins
ggplot(t_df, aes(means, density)) +
geom_bar(stat = "identity") +
geom_density(data=t, aes(r, y = ..density../20), stat = 'density')
Looks like we have a winner. Except that the number is hardcoded.
看来我们赢了。除了这个数字是硬编码的。
#removing the hardcoding?
divisor = nrow(t_df)
ggplot(t_df, aes(means, density)) +
geom_bar(stat = "identity") +
geom_density(data=t, aes(r, y = ..density../divisor), stat = 'density')
Error in eval(expr, envir, enclos) : object 'divisor' not found
Well, I almost expected it to work. Now I tried adding some ..'s here and there, also ..count.. and ..sum.., the first which gave another wrong result, the second which threw an error. I also tried using a multiplier (with 1/20), no luck.
好吧,我差点就猜到了。现在我试着添加一些。这儿那儿也有,伯爵和. .和. .,第一个给出了另一个错误的结果,第二个抛出了一个错误。我也试过使用乘数法(1/20),没有运气。
#salvation with get()
divisor = nrow(t_df)
ggplot(t_df, aes(means, density)) +
geom_bar(stat = "identity") +
geom_density(data=t, aes(r, y = ..density../get("divisor", pos = 1)), stat = 'density')
So, I finally got the right figure (I think; I hope).
所以,我最终得到了正确的数据(我认为;我希望)。
Please tell me there is an easier way of doing this.
请告诉我有更简单的办法。
PS. The get()
trick does apparently not work within a function. I would have put a working function here for future use, but that wasn't so easy either.
get()技巧显然在函数中不起作用。我本来会在这里放一个工作函数供将来使用,但这也不是那么容易。
1 个解决方案
#1
6
First, read Wickham on densities in R, noting the foibles and features of each package/function.
首先,阅读Wickham关于R中的密度的文章,并指出每个软件包/功能的缺点和特点。
The densities sum to 1, but that doesn't mean the curve line/points will not go above 1.
密度之和为1,但这并不意味着曲线/点不会超过1。
The following shows both this and the inaccuracy of (at least) the defaults of density
when compared to, say, KernSmooth::bkde
(using base plots for brevity of typing):
下面显示了这一点,以及(至少)密度的默认值与KernSmooth::bkde(使用基本图来简化输入)相比的不准确性:
library(KernSmooth)
library(flux)
library(sfsmisc)
# uniform dist
set.seed(1)
dat <- runif(100)
d1 <- density(dat)
d1_ks <- bkde(dat)
par(mfrow=c(2,1))
plot(d1)
plot(d1_ks, type="l")
auc(d1$x, d1$y)
## [1] 1.000921
integrate.xy(d1$x, d1$y)
## [1] 1.000921
auc(d1_ks$x, d1_ks$y)
## [1] 1
integrate.xy(d1_ks$x, d1_ks$y)
## [1] 1
Do the same for the beta distribution:
对beta分布做同样的处理:
# beta dist
set.seed(1)
dat <- rbeta(100, 0.5, 0.1)
d2 <- density(dat)
d2_ks <- bkde(dat)
par(mfrow=c(2,1))
plot(d2)
plot(d2_ks, typ="l")
auc(d2$x, d2$y)
## [1] 1.000187
integrate.xy(d2$x, d2$y)
## [1] 1.000188
auc(d2_ks$x, d2_ks$y)
## [1] 1
integrate.xy(d2_ks$x, d2_ks$y)
## [1] 1
auc
and integrate.xy
both use the trapezoid rule but I ran them to both show that and to show the results from two different functions.
auc和集成。xy都用梯形法则,但我让它们同时显示,并显示两个不同函数的结果。
The point is that the densities do in fact sum to 1, despite the y-axis values leading you to believe that they do not. I'm not sure what you are trying to solve with your manipulations.
关键是密度实际上是和1,尽管y轴的值让你相信它们不是。我不确定你用你的手法想要解决什么。
#1
6
First, read Wickham on densities in R, noting the foibles and features of each package/function.
首先,阅读Wickham关于R中的密度的文章,并指出每个软件包/功能的缺点和特点。
The densities sum to 1, but that doesn't mean the curve line/points will not go above 1.
密度之和为1,但这并不意味着曲线/点不会超过1。
The following shows both this and the inaccuracy of (at least) the defaults of density
when compared to, say, KernSmooth::bkde
(using base plots for brevity of typing):
下面显示了这一点,以及(至少)密度的默认值与KernSmooth::bkde(使用基本图来简化输入)相比的不准确性:
library(KernSmooth)
library(flux)
library(sfsmisc)
# uniform dist
set.seed(1)
dat <- runif(100)
d1 <- density(dat)
d1_ks <- bkde(dat)
par(mfrow=c(2,1))
plot(d1)
plot(d1_ks, type="l")
auc(d1$x, d1$y)
## [1] 1.000921
integrate.xy(d1$x, d1$y)
## [1] 1.000921
auc(d1_ks$x, d1_ks$y)
## [1] 1
integrate.xy(d1_ks$x, d1_ks$y)
## [1] 1
Do the same for the beta distribution:
对beta分布做同样的处理:
# beta dist
set.seed(1)
dat <- rbeta(100, 0.5, 0.1)
d2 <- density(dat)
d2_ks <- bkde(dat)
par(mfrow=c(2,1))
plot(d2)
plot(d2_ks, typ="l")
auc(d2$x, d2$y)
## [1] 1.000187
integrate.xy(d2$x, d2$y)
## [1] 1.000188
auc(d2_ks$x, d2_ks$y)
## [1] 1
integrate.xy(d2_ks$x, d2_ks$y)
## [1] 1
auc
and integrate.xy
both use the trapezoid rule but I ran them to both show that and to show the results from two different functions.
auc和集成。xy都用梯形法则,但我让它们同时显示,并显示两个不同函数的结果。
The point is that the densities do in fact sum to 1, despite the y-axis values leading you to believe that they do not. I'm not sure what you are trying to solve with your manipulations.
关键是密度实际上是和1,尽管y轴的值让你相信它们不是。我不确定你用你的手法想要解决什么。