I have a data frame and I'm trying to create a new variable in the data frame that has the quantiles of a continuous variable var1
, for each level of a factor strata
.
我有一个数据框,我正在尝试在数据框中创建一个新变量,该变量具有连续变量var1的分位数,对于因子层的每个级别。
# some data
set.seed(472)
dat <- data.frame(var1 = rnorm(50, 10, 3)^2,
strata = factor(sample(LETTERS[1:5], size = 50, replace = TRUE))
)
# function to get quantiles
qfun <- function(x, q = 5) {
quantile <- cut(x, breaks = quantile(x, probs = 0:q/q),
include.lowest = TRUE, labels = 1:q)
quantile
}
I tried using two methods, neither of which produce a usable result. Firstly, I tried using aggregate
to apply qfun
to each level of strata
:
我尝试使用两种方法,两种方法都不会产生可用的结果。首先,我尝试使用聚合将qfun应用于每个级别的层次:
qdat <- with(dat, aggregate(var1, list(strata), FUN = qfun))
This returns the quantiles by factor level, but the output is hard to coerce back into a data frame (e.g., using unlist
does not line the new variable values up with the correct rows in the data frame).
这通过因子级别返回分位数,但输出很难强制回到数据帧中(例如,使用unlist不会将新变量值与数据帧中的正确行对齐)。
A second approach was to do this in steps:
第二种方法是按步骤执行此操作:
tmp1 <- with(dat, split(var1, strata))
tmp2 <- lapply(tmp1, qfun)
tmp3 <- unlist(tmp2)
dat$quintiles <- tmp3
Again, this calculates the quantiles correctly for each factor level, but obviously, as with aggregate
they aren't in the correct order in the data frame. We can check this by putting the quantile "bins" into the data frame.
同样,这会为每个因子级别正确计算分位数,但显然,与聚合一样,它们在数据帧中的顺序不正确。我们可以通过将分位数“bin”放入数据框来检查这一点。
# get quantile bins
qfun2 <- function(x, q = 5) {
quantile <- cut(x, breaks = quantile(x, probs = 0:q/q),
include.lowest = TRUE)
quantile
}
tmp11 <- with(dat, split(var1, strata))
tmp22 <- lapply(tmp11, qfun2)
tmp33 <- unlist(tmp22)
dat$quintiles2 <- tmp33
Many of the values of var1
are outside of the bins of quantile2
. I feel like i'm missing something simple. Any suggestions would be greatly appreciated.
var1的许多值都在quantile2的bin之外。我觉得我错过了一些简单的东西。任何建议将不胜感激。
2 个解决方案
#1
8
I think your issue is that you don't really want to aggregate, but use ave
, (or data.table
or plyr
)
我认为你的问题是你真的不想聚合,而是使用ave,(或data.table或plyr)
qdat <- transform(dat, qq = ave(var1, strata, FUN = qfun))
#using plyr
library(plyr)
qdat <- ddply(dat, .(strata), mutate, qq = qfun(var1))
#using data.table (my preference)
dat[, qq := qfun(var1), by = strata]
Aggregate usually implies returning an object that is smaller that the original. (inthis case you were getting a data.frame where x
was a list
of 1 element for each strata.
聚合通常意味着返回一个小于原始对象的对象。 (在这种情况下,你得到一个data.frame,其中x是每个层的1个元素的列表。
#2
1
Use ave
on your dat
data frame. Full example with your simulated data and qfun
function:
在dat数据框上使用ave。您的模拟数据和qfun函数的完整示例:
# some data
set.seed(472)
dat <- data.frame(var1 = rnorm(50, 10, 3)^2,
strata = factor(sample(LETTERS[1:5], size = 50, replace = TRUE))
)
# function to get quantiles
qfun <- function(x, q = 5) {
quantile <- cut(x, breaks = quantile(x, probs = 0:q/q),
include.lowest = TRUE, labels = 1:q)
quantile
}
And my addition...
而我的补充......
dat$q <- ave(dat$var1,dat$strata,FUN=qfun)
#1
8
I think your issue is that you don't really want to aggregate, but use ave
, (or data.table
or plyr
)
我认为你的问题是你真的不想聚合,而是使用ave,(或data.table或plyr)
qdat <- transform(dat, qq = ave(var1, strata, FUN = qfun))
#using plyr
library(plyr)
qdat <- ddply(dat, .(strata), mutate, qq = qfun(var1))
#using data.table (my preference)
dat[, qq := qfun(var1), by = strata]
Aggregate usually implies returning an object that is smaller that the original. (inthis case you were getting a data.frame where x
was a list
of 1 element for each strata.
聚合通常意味着返回一个小于原始对象的对象。 (在这种情况下,你得到一个data.frame,其中x是每个层的1个元素的列表。
#2
1
Use ave
on your dat
data frame. Full example with your simulated data and qfun
function:
在dat数据框上使用ave。您的模拟数据和qfun函数的完整示例:
# some data
set.seed(472)
dat <- data.frame(var1 = rnorm(50, 10, 3)^2,
strata = factor(sample(LETTERS[1:5], size = 50, replace = TRUE))
)
# function to get quantiles
qfun <- function(x, q = 5) {
quantile <- cut(x, breaks = quantile(x, probs = 0:q/q),
include.lowest = TRUE, labels = 1:q)
quantile
}
And my addition...
而我的补充......
dat$q <- ave(dat$var1,dat$strata,FUN=qfun)