密度曲线之和为1的ggplot2直方图

时间:2021-08-01 14:58:45

Plotting a histogram with a density curve that sums to 1 for non-standardized data is ridiculously difficult. There are many questions already about this, but none of their solutions work for my data. There needs to be a simple solution that just works. I can't find an answer with a simple solution that works.

用密度曲线绘制直方图对非标准化数据求和为1是非常困难的。关于这个问题已经有很多问题了,但是他们的解决方案对我的数据都不起作用。需要有一个简单的解决方案。我找不到一个简单有效的答案。

Some examples:

一些例子:

solution only works with standardized normal data ggplot2: Overlay histogram with density curve

解决方案只适用于标准化的正常数据ggplot2:覆盖直方图与密度曲线

with discrete data and no density curve ggplot2 density histogram with width=.5, vline and centered bar positions

具有离散数据,无密度曲线的ggplot2密度直方图,宽度=。5、vline和居中杆位

no answer Overlay density and histogram plot with ggplot2 using custom bins

没有答案覆盖密度和直方图与ggplot2使用定制箱子

densities do not sum to 1 on my data Creating a density histogram in ggplot2?

密度在我的数据中不等于1在ggplot2中创建密度直方图?

does not sum to 1 on my data ggplot2 density histogram with custom bin edges

我的数据上的ggplot2密度直方图和自定义的bin边不等于1

long explanation here with examples, but density is not 1 with my data "Density" curve overlay on histogram where vertical axis is frequency (aka count) or relative frequency?

这里用例子做了很长的解释,但是密度不是1,我的数据“密度”曲线叠加在直方图上,垂直轴是频率(又称计数)还是相对频率?

--

- - -

Some example code:

一些示例代码:

#Example code
set.seed(1)
t = data.frame(r = runif(100))

#first we try the obvious simple solution that should work
ggplot(t, aes(r)) + 
  geom_histogram() + 
  geom_density()

密度曲线之和为1的ggplot2直方图

So, clearly the density does not sum to 1.

所以,显然密度不等于1。

#maybe geom_histogram needs a ..density.. ?
ggplot(t, aes(r)) + 
  geom_histogram(aes(y = ..density..)) + 
  geom_density()

密度曲线之和为1的ggplot2直方图

It did change something, but not correctly.

它确实改变了一些东西,但不是正确的。

#maybe geom_density needs a ..density.. too ?
ggplot(t, aes(r)) + 
  geom_histogram(aes(y = ..density..)) + 
  geom_density(aes(y = ..density..))

No change there.

没有改变。

#maybe binwidth = 1?
ggplot(t, aes(r)) + 
  geom_histogram(aes(y = ..density..), binwidth=1) + 
  geom_density(aes(y = ..density..))

密度曲线之和为1的ggplot2直方图

Still wrong density curve, but now the histogram is wrong too.

仍然是错误的密度曲线,但现在直方图也错了。

To be sure, I did spend 4 hours trying all kinds of combinations of ..count.. and ..sum.. and ..density.., but since I can't find any documentation about how these are supposed to work, it's semi-blind trial and error.

当然,我花了4个小时尝试各种各样的组合。和. .和. ... .. .. .. ..但是由于我找不到任何关于这些是如何工作的文件,这是半盲的尝试和错误。

So I gave up and avoided using ggplot2 to summarize the data.

所以我放弃使用ggplot2来总结数据。

So first we need to get the right proportions data.frame, and that wasn't so simple:

首先,我们需要得到正确的比例数据。

get_prop_table = function(x, breaks_=20){
  library(magrittr)
  library(plyr)
  x_prop_table = cut(x, 20) %>% table(.) %>% prop.table %>% data.frame
  colnames(x_prop_table) = c("interval", "density")
  intervals = x_prop_table$interval %>% as.character
  fetch_numbers = str_extract_all(intervals, "\\d\\.\\d*")
  x_prop_table$means = laply(fetch_numbers, function(x) {
    x %>% as.numeric %>% mean
  })
  return(x_prop_table)
}

t_df = get_prop_table(t$r)

This gives the kind of summary data we want:

这就给出了我们想要的总结数据:

> head(t_df)
          interval density    means
1 (0.00859,0.0585]    0.06 0.033545
2   (0.0585,0.107]    0.09 0.082750
3    (0.107,0.156]    0.07 0.131500
4    (0.156,0.205]    0.10 0.180500
5    (0.205,0.254]    0.08 0.229500
6    (0.254,0.303]    0.03 0.278500

Now we just have to plot it. Should be easy...

现在我们来画一下。应该很容易…

ggplot(t_df, aes(means, density)) + 
  geom_histogram(stat = "identity") +
  geom_density(stat = "identity")

密度曲线之和为1的ggplot2直方图

Umm, not quite what I wanted. To be sure, I did try without stat = "identity" in geom_density, at which point it complained about not having a y.

嗯,不完全是我想要的。可以肯定的是,我确实尝试过在geom_density中不使用stat =“identity”,而此时它抱怨没有y。

#lets try adding ..density.. then
ggplot(t_df, aes(means, density)) + 
  geom_histogram(stat = "identity") +
  geom_density(aes(y = ..density..))

密度曲线之和为1的ggplot2直方图

Even more strange.

更奇怪。

Okay, maybe let's give up on getting the density curve from summary data. Maybe we need to mix the approaches a bit...

好吧,也许我们放弃从汇总数据中得到密度曲线。也许我们需要混合一下方法……

#adding together
ggplot(t_df, aes(means, density)) +
  geom_bar(stat = "identity") +
  geom_density(data=t, aes(r, y = ..density..), stat = 'density')

密度曲线之和为1的ggplot2直方图

Ok, at least the shape is right now. Now, we need to somehow scale it down.

好的,至少现在的形状是正确的。现在,我们需要把它缩小。

#lets try dividing by the number of bins
ggplot(t_df, aes(means, density)) +
  geom_bar(stat = "identity") +
  geom_density(data=t, aes(r, y = ..density../20), stat = 'density')

密度曲线之和为1的ggplot2直方图

Looks like we have a winner. Except that the number is hardcoded.

看来我们赢了。除了这个数字是硬编码的。

#removing the hardcoding?
divisor = nrow(t_df)
ggplot(t_df, aes(means, density)) +
  geom_bar(stat = "identity") +
  geom_density(data=t, aes(r, y = ..density../divisor), stat = 'density')

Error in eval(expr, envir, enclos) : object 'divisor' not found

Well, I almost expected it to work. Now I tried adding some ..'s here and there, also ..count.. and ..sum.., the first which gave another wrong result, the second which threw an error. I also tried using a multiplier (with 1/20), no luck.

好吧,我差点就猜到了。现在我试着添加一些。这儿那儿也有,伯爵和. .和. .,第一个给出了另一个错误的结果,第二个抛出了一个错误。我也试过使用乘数法(1/20),没有运气。

#salvation with get()
divisor = nrow(t_df)
ggplot(t_df, aes(means, density)) +
  geom_bar(stat = "identity") +
  geom_density(data=t, aes(r, y = ..density../get("divisor", pos = 1)), stat = 'density')

密度曲线之和为1的ggplot2直方图

So, I finally got the right figure (I think; I hope).

所以,我最终得到了正确的数据(我认为;我希望)。

Please tell me there is an easier way of doing this.

请告诉我有更简单的办法。

PS. The get() trick does apparently not work within a function. I would have put a working function here for future use, but that wasn't so easy either.

get()技巧显然在函数中不起作用。我本来会在这里放一个工作函数供将来使用,但这也不是那么容易。

1 个解决方案

#1


6  

First, read Wickham on densities in R, noting the foibles and features of each package/function.

首先,阅读Wickham关于R中的密度的文章,并指出每个软件包/功能的缺点和特点。

The densities sum to 1, but that doesn't mean the curve line/points will not go above 1.

密度之和为1,但这并不意味着曲线/点不会超过1。

The following shows both this and the inaccuracy of (at least) the defaults of density when compared to, say, KernSmooth::bkde (using base plots for brevity of typing):

下面显示了这一点,以及(至少)密度的默认值与KernSmooth::bkde(使用基本图来简化输入)相比的不准确性:

library(KernSmooth)
library(flux)
library(sfsmisc)

# uniform dist
set.seed(1)
dat <- runif(100)

d1 <- density(dat)
d1_ks <- bkde(dat)

par(mfrow=c(2,1))
plot(d1)
plot(d1_ks, type="l")

密度曲线之和为1的ggplot2直方图

auc(d1$x, d1$y)
## [1] 1.000921

integrate.xy(d1$x, d1$y)
## [1] 1.000921

auc(d1_ks$x, d1_ks$y)
## [1] 1

integrate.xy(d1_ks$x, d1_ks$y)
## [1] 1

Do the same for the beta distribution:

对beta分布做同样的处理:

# beta dist
set.seed(1)
dat <- rbeta(100, 0.5, 0.1)

d2 <- density(dat)
d2_ks <- bkde(dat)

par(mfrow=c(2,1))
plot(d2)
plot(d2_ks, typ="l")

密度曲线之和为1的ggplot2直方图

auc(d2$x, d2$y)
## [1] 1.000187

integrate.xy(d2$x, d2$y)
## [1] 1.000188

auc(d2_ks$x, d2_ks$y)
## [1] 1

integrate.xy(d2_ks$x, d2_ks$y)
## [1] 1

auc and integrate.xy both use the trapezoid rule but I ran them to both show that and to show the results from two different functions.

auc和集成。xy都用梯形法则,但我让它们同时显示,并显示两个不同函数的结果。

The point is that the densities do in fact sum to 1, despite the y-axis values leading you to believe that they do not. I'm not sure what you are trying to solve with your manipulations.

关键是密度实际上是和1,尽管y轴的值让你相信它们不是。我不确定你用你的手法想要解决什么。

#1


6  

First, read Wickham on densities in R, noting the foibles and features of each package/function.

首先,阅读Wickham关于R中的密度的文章,并指出每个软件包/功能的缺点和特点。

The densities sum to 1, but that doesn't mean the curve line/points will not go above 1.

密度之和为1,但这并不意味着曲线/点不会超过1。

The following shows both this and the inaccuracy of (at least) the defaults of density when compared to, say, KernSmooth::bkde (using base plots for brevity of typing):

下面显示了这一点,以及(至少)密度的默认值与KernSmooth::bkde(使用基本图来简化输入)相比的不准确性:

library(KernSmooth)
library(flux)
library(sfsmisc)

# uniform dist
set.seed(1)
dat <- runif(100)

d1 <- density(dat)
d1_ks <- bkde(dat)

par(mfrow=c(2,1))
plot(d1)
plot(d1_ks, type="l")

密度曲线之和为1的ggplot2直方图

auc(d1$x, d1$y)
## [1] 1.000921

integrate.xy(d1$x, d1$y)
## [1] 1.000921

auc(d1_ks$x, d1_ks$y)
## [1] 1

integrate.xy(d1_ks$x, d1_ks$y)
## [1] 1

Do the same for the beta distribution:

对beta分布做同样的处理:

# beta dist
set.seed(1)
dat <- rbeta(100, 0.5, 0.1)

d2 <- density(dat)
d2_ks <- bkde(dat)

par(mfrow=c(2,1))
plot(d2)
plot(d2_ks, typ="l")

密度曲线之和为1的ggplot2直方图

auc(d2$x, d2$y)
## [1] 1.000187

integrate.xy(d2$x, d2$y)
## [1] 1.000188

auc(d2_ks$x, d2_ks$y)
## [1] 1

integrate.xy(d2_ks$x, d2_ks$y)
## [1] 1

auc and integrate.xy both use the trapezoid rule but I ran them to both show that and to show the results from two different functions.

auc和集成。xy都用梯形法则,但我让它们同时显示,并显示两个不同函数的结果。

The point is that the densities do in fact sum to 1, despite the y-axis values leading you to believe that they do not. I'm not sure what you are trying to solve with your manipulations.

关键是密度实际上是和1,尽管y轴的值让你相信它们不是。我不确定你用你的手法想要解决什么。