如何绘制轮廓线，显示95％的值落在R和ggplot2中的位置

Say we have:

说我们有：

x <- rnorm(1000)
y <- rnorm(1000)

How do I use ggplot2 to produce a plot containing the two following geoms:

如何使用ggplot2生成包含以下两个geom的图：

The bivariate expectation of the two series of values
两个系列值的双变量期望
A contour line showing where 95% of the estimates fall within?
一条等高线显示95％的估算值在哪里？

I know how to do the first part:

我知道如何做第一部分：

 df <- data.frame(x=x, y=y)
 p <- ggplot(df, aes(x=x, y=y))
 p <- p + xlim(-10, 10) + ylim(-10, 10) # say
 p <- p + geom_point(x=mean(x), y=mean(y))

And I also know about the stat_contour() and stat_density2d() functions within ggplot2.

我也知道ggplot2中的stat_contour（）和stat_density2d（）函数。

And I also know that there are 'bins' options within stat_contour.

我也知道stat_contour中有'bins'选项。

However, I guess what I need is something like the probs argument within quantile, but over two dimensions rather than one.

但是，我想我需要的是像分位数中的probs参数，但是超过两个维度而不是一个维度。

I have also seen a solution within the graphics package. However, I would like to do this within ggplot.

我也在图形包中看到了一个解决方案。但是，我想在ggplot中这样做。

Help much appreciated,

非常感谢，

Jon

乔恩

4 个解决方案

#1

Unfortunately, the accepted answer currently fails with Error: Unknown parameters: breaks on ggplot2 2.1.0. I cobbled together an alternative approach based on the code in this answer, which uses the ks package for computing the kernel density estimate:

不幸的是，接受的答案目前失败，错误：未知参数：在ggplot2 2.1.0上中断。我根据本答案中的代码拼凑了一种替代方法，该方法使用ks包来计算内核密度估计：

library(ggplot2)

set.seed(1001)
d <- data.frame(x=rnorm(1000),y=rnorm(1000))

kd <- ks::kde(d, compute.cont=TRUE)
contour_95 <- with(kd, contourLines(x=eval.points[[1]], y=eval.points[[2]],
                                    z=estimate, levels=cont["5%"])[[1]])
contour_95 <- data.frame(contour_95)

ggplot(data=d, aes(x, y)) +
  geom_point() +
  geom_path(aes(x, y), data=contour_95) +
  theme_bw()

Here's the result:

这是结果：

TIP: The ks package depends on the rgl package, which can be a pain to compile manually. Even if you're on Linux, it's much easier to get a precompiled version, e.g. sudo apt install r-cran-rgl on Ubuntu if you have the appropriate CRAN repositories set up.

提示：ks包依赖于rgl包，这可能很难手动编译。即使您使用的是Linux，也可以更轻松地获得预编译版本，例如：如果你有相应的CRAN存储库设置，sudo apt在Ubuntu上安装r-cran-rgl。

#2

This works, but is quite inefficient because you actually have to compute the kernel density estimate three times.

这可行，但效率很低，因为您实际上必须计算三次内核密度估计值。

set.seed(1001)
d <- data.frame(x=rnorm(1000),y=rnorm(1000))
getLevel <- function(x,y,prob=0.95) {
    kk <- MASS::kde2d(x,y)
    dx <- diff(kk$x[1:2])
    dy <- diff(kk$y[1:2])
    sz <- sort(kk$z)
    c1 <- cumsum(sz) * dx * dy
    approx(c1, sz, xout = 1 - prob)$y
}
L95 <- getLevel(d$x,d$y)
library(ggplot2); theme_set(theme_bw())
ggplot(d,aes(x,y)) +
   stat_density2d(geom="tile", aes(fill = ..density..),
                  contour = FALSE)+
   stat_density2d(colour="red",breaks=L95)

(with help from http://comments.gmane.org/gmane.comp.lang.r.ggplot2/303)

（在http://comments.gmane.org/gmane.comp.lang.r.ggplot2/303的帮助下）

update: with a recent version of ggplot2 (2.1.0) it doesn't seem possible to pass breaks to stat_density2d (or at least I don't know how), but the method below with geom_contour still seems to work ...

更新：使用最新版本的ggplot2（2.1.0）似乎无法将中断传递给stat_density2d（或者至少我不知道如何），但下面使用geom_contour的方法似乎仍然可行...

You can make things a little more efficient by computing the kernel density estimate once and plotting the tiles and contours from the same grid:

您可以通过计算一次核密度估计并绘制来自同一网格的切片和轮廓来提高效率：

kk <- with(dd,MASS::kde2d(x,y))
library(reshape2)
dimnames(kk$z) <- list(kk$x,kk$y)
dc <- melt(kk$z)
ggplot(dc,aes(x=Var1,y=Var2))+
   geom_tile(aes(fill=value))+
   geom_contour(aes(z=value),breaks=L95,colour="red")

doing the 95% level computation from the kk grid (to reduce the number of kernel computations to 1) is left as an exercise
从kk网格执行95％级别的计算（以将内核计算的数量减少到1）留作练习
I'm not sure why stat_density2d(geom="tile") and geom_tile give slightly different results (the former is smoothed)
我不确定为什么stat_density2d（geom =“tile”）和geom_tile给出的结果略有不同（前者是平滑的）
I haven't added the bivariate mean, but something like annotate("point",x=mean(d$x),y=mean(d$y),colour="red") should work.
我没有添加双变量均值，但是注释（“点”，x =平均值（d $ x），y =平均值（d $ y），颜色=“红色”）应该有效。

#3

I had an example where the MASS::kde2d() bandwidth specifications were not flexible enough, so I ended up using the ks package and the ks::kde() function and, as an example, the ks::Hscv() function to estimate flexible bandwidths that captured the smoothness better. This computation can be a bit slow, but it has much better performance in some situations. Here is a version of the above code for that example:

我有一个例子，其中MASS :: kde2d（）带宽规范不够灵活，所以我最终使用了ks包和ks :: kde（）函数，例如，ks :: Hscv（）函数估计更好地捕获平滑度的灵活带宽。这种计算可能有点慢，但在某些情况下它具有更好的性能。以下是该示例的上述代码的一个版本：

set.seed(1001)
d <- data.frame(x=rnorm(1000),y=rnorm(1000))
getLevel <- function(x,y,prob=0.95) {
    kk <- MASS::kde2d(x,y)
    dx <- diff(kk$x[1:2])
    dy <- diff(kk$y[1:2])
    sz <- sort(kk$z)
    c1 <- cumsum(sz) * dx * dy
    approx(c1, sz, xout = 1 - prob)$y
}
L95 <- getLevel(d$x,d$y)
library(ggplot2); theme_set(theme_bw())
ggplot(d,aes(x,y)) +
    stat_density2d(geom="tile", aes(fill = ..density..),
                   contour = FALSE)+
    stat_density2d(colour="red",breaks=L95)

## using ks::kde
hscv1 <- Hscv(d)
fhat <- ks::kde(d, H=hscv1, compute.cont=TRUE)

dimnames(fhat[['estimate']]) <- list(fhat[["eval.points"]][[1]], 
                                     fhat[["eval.points"]][[2]])
library(reshape2)
aa <- melt(fhat[['estimate']])

ggplot(aa, aes(x=Var1, y=Var2)) +
    geom_tile(aes(fill=value)) +
    geom_contour(aes(z=value), breaks=fhat[["cont"]]["50%"], color="red") +
    geom_contour(aes(z=value), breaks=fhat[["cont"]]["5%"], color="purple")

For this particular example, the differences are minimal, but in an example where the bandwidth specification requires more flexibility, this modification may be important. Note that the 95% contour is specified using the breaks=fhat[["cont"]]["5%"], which I found a little bit counter-intuitive, because it is called here the "5% contour".

对于此特定示例，差异很小，但在带宽规范需要更多灵活性的示例中，此修改可能很重要。请注意，95％的轮廓是使用breaks = fhat [[“cont”]] [“5％”]指定的，我发现它有点反直觉，因为它在这里被称为“5％轮廓”。

#4

Riffing off of Ben Bolker's answer, a solution that can handle multiple levels and works with ggplot 2.2.1:

重复一下Ben Bolker的答案，一个可以处理多个级别并与ggplot 2.2.1一起使用的解决方案：

library(ggplot2)
library(MASS)
library(reshape2)
# create data:
set.seed(8675309)
Sigma <- matrix(c(0.1,0.3,0.3,4),2,2)
mv <- data.frame(mvrnorm(4000,c(1.5,16),Sigma))

# get the kde2d information: 
mv.kde <- kde2d(mv[,1], mv[,2], n = 400)
dx <- diff(mv.kde$x[1:2])  # lifted from emdbook::HPDregionplot()
dy <- diff(mv.kde$y[1:2])
sz <- sort(mv.kde$z)
c1 <- cumsum(sz) * dx * dy

# specify desired contour levels:
prob <- c(0.95,0.90,0.5)

# plot:
dimnames(mv.kde$z) <- list(mv.kde$x,mv.kde$y)
dc <- melt(mv.kde$z)
dc$prob <- approx(sz,1-c1,dc$value)$y
p <- ggplot(dc,aes(x=Var1,y=Var2))+
  geom_contour(aes(z=prob,color=..level..),breaks=prob)+
  geom_point(aes(x=X1,y=X2),data=mv,alpha=0.1,size=1)
print(p)

The result:

结果：

#1

library(ggplot2)

set.seed(1001)
d <- data.frame(x=rnorm(1000),y=rnorm(1000))

kd <- ks::kde(d, compute.cont=TRUE)
contour_95 <- with(kd, contourLines(x=eval.points[[1]], y=eval.points[[2]],
                                    z=estimate, levels=cont["5%"])[[1]])
contour_95 <- data.frame(contour_95)

ggplot(data=d, aes(x, y)) +
  geom_point() +
  geom_path(aes(x, y), data=contour_95) +
  theme_bw()