如何用R或Excel中的分组变量计算值的第95百分位数

i'm trying to calculate the 95th percentile for multiple water quality values grouped by watershed. for example...

我试着计算出由分水岭组成的多个水质值的第95百分位数。例如……

Watershed   WQ
50500101    62.370661
50500101    65.505046
50500101    58.741477
50500105    71.220034
50500105    57.917249

i reviewed this question posted - Percentile for Each Observation w/r/t Grouping Variable. it seems very close to what i want to do but it's for EACH observation. i need it for each grouping variable. so ideally,

我复习了每个w/r/t分组变量的百分位数。看起来和我想做的很接近但是对于每一个观察。每个分组变量都需要它。所以,理想情况下,

Watershed   WQ - 95th
50500101    x
50500105    y

thanks

谢谢

6 个解决方案

#1

This can be achieved using the plyr library. We specify the grouping variable Watershed and ask for the 95% quantile of WQ.

这可以使用plyr库实现。我们指定分组变量分水岭并要求WQ的95%分位数。

library(plyr)
#Random seed
set.seed(42)
#Sample data
dat <- data.frame(Watershed = sample(letters[1:2], 100, TRUE), WQ = rnorm(100))
#plyr call
ddply(dat, "Watershed", summarise, WQ95 = quantile(WQ, .95))

and the results

结果

  Watershed     WQ95
    1         a 1.353993
    2         b 1.461711

#2

I hope I understand your question correctly. Is this what you're looking for?

我希望我没弄错你的问题。这就是你要找的吗?

my.df <- data.frame(group = gl(3, 5), var = runif(15))
aggregate(my.df$var, by = list(my.df$group), FUN = function(x) quantile(x, probs = 0.95))

  Group.1         x
1       1 0.6913747
2       2 0.8067847
3       3 0.9643744

EDIT

编辑

Based on Vincent's answer,

基于文森特的回答,

aggregate(my.df$var, by = list(my.df$group), FUN = quantile, probs  = 0.95)

also works (you can skin a cat 1001 ways - I've been told). A side note, you can specify a vector of desired -iles, say c(0.1, 0.2, 0.3...) for deciles. Or you can try function summary for some predefined statistics.

同样有效(你可以用1001种方法来剥猫皮——我听说过)。注意，你可以指定一个期望的-iles的向量，比如c(0.1, 0.2, 0.3…)表示十分位数。或者您可以尝试函数摘要来获取一些预定义的统计信息。

aggregate(my.df$var, by = list(my.df$group), FUN = summary)

#3

Use a combination of the tapply and quantile functions. For example, if your dataset looks like this:

使用tapply和分位数函数的组合。例如，如果数据集是这样的:

DF <- data.frame('watershed'=sample(c('a','b','c','d'), 1000, replace=T), wq=rnorm(1000))

Use this:

用这个:

with(DF, tapply(wq, watershed, quantile, probs=0.95))

#4

In Excel, you're going to want to use an array formula to make this easy. I suggest the following:

在Excel中，您需要使用数组公式来简化这一过程。我建议以下几点:

{=PERCENTILE(IF($A2:$A6 = Watershed ID, $B$2:$B$6), 0.95)}

Column A would be the Watershed ids, and Column B would be the WQ values.

A列是分水岭id, B列是WQ值。

Also, be sure to enter the formula as an array formula. Do so by pressing Ctrl+Shift+Enter when entering the formula.

同样，要确保将公式作为数组公式输入。在输入公式时按Ctrl+Shift+Enter键。

#5

Using the data.table-package you can do:

使用数据。table-package你能做什么:

set.seed(42)
#Sample data
dt <- data.table(Watershed = sample(letters[1:2], 100, TRUE), WQ = rnorm(100))

dt[ ,
    j = .(WQ95 = quantile(WQ, .95, na.rm = TRUE),
    by = Watershed]

#6

-1

Based on Chase's answer, here is a solution using the dplyr package. Of course a matter of preference as far as the solution and I like the relative clarity (for me) of the "piping" (%>%) method used in dplyr :

根据Chase的回答，这里有一个使用dplyr包的解决方案。当然，对于解决方案，我比较喜欢dplyr中使用的“管道”方法(%>%)的相对清晰度:

library(dplyr)
#Random seed
set.seed(42)
#Sample data
dat <- data.frame(Watershed = sample(letters[1:2], 100, TRUE), WQ = rnorm(100))
#dplyr call
dat %>% group_by(Watershed) %>% summarise(WQ95 = quantile(slc, 0.95))

#1