I have a large data.frame that was generated by a process outside my control, which may or may not contain variables with zero variance (i.e. all the observations are the same). I would like to build a predictive model based on this data, and obviously these variables are of no use.
我有一个大的数据。框架是由我控制之外的一个过程生成的,它可能包含或不包含零方差的变量(即所有的观察结果都是相同的)。我想建立一个基于这些数据的预测模型,显然这些变量是无用的。
Here's the function I'm currently using to remove such variables from the data.frame. It's currently based on apply
, and I was wondering if there are any obvious ways to speed this function up, so that it works quickly on very large datasets, with a large number (400 or 500) of variables?
下面是我现在使用的函数来从数据框中删除这些变量。它目前基于应用,我想知道是否有任何明显的方法来加快这个功能,以便它能在非常大的数据集上快速运行,有大量(400或500)个变量?
set.seed(1)
dat <- data.frame(
A=factor(rep("X",10),levels=c('X','Y')),
B=round(runif(10)*10),
C=rep(10,10),
D=c(rep(10,9),1),
E=factor(rep("A",10)),
F=factor(rep(c("I","J"),5)),
G=c(rep(10,9),NA)
)
zeroVar <- function(data, useNA = 'ifany') {
out <- apply(data, 2, function(x) {length(table(x, useNA = useNA))})
which(out==1)
}
And here's the result of the process:
这是过程的结果:
> dat
A B C D E F G
1 X 3 10 10 A I 10
2 X 4 10 10 A J 10
3 X 6 10 10 A I 10
4 X 9 10 10 A J 10
5 X 2 10 10 A I 10
6 X 9 10 10 A J 10
7 X 9 10 10 A I 10
8 X 7 10 10 A J 10
9 X 6 10 10 A I 10
10 X 1 10 1 A J NA
> dat[,-zeroVar(dat)]
B D F G
1 3 10 I 10
2 4 10 J 10
3 6 10 I 10
4 9 10 J 10
5 2 10 I 10
6 9 10 J 10
7 9 10 I 10
8 7 10 J 10
9 6 10 I 10
10 1 1 J NA
> dat[,-zeroVar(dat, useNA = 'no')]
B D F
1 3 10 I
2 4 10 J
3 6 10 I
4 9 10 J
5 2 10 I
6 9 10 J
7 9 10 I
8 7 10 J
9 6 10 I
10 1 1 J
7 个解决方案
#1
14
Don't use table()
- very slow for such things. One option is length(unique(x))
:
不要用表()——这样的东西很慢。一种选择是长度(独特的(x)):
foo <- function(dat) {
out <- lapply(dat, function(x) length(unique(x)))
want <- which(!out > 1)
unlist(want)
}
system.time(replicate(1000, zeroVar(dat)))
system.time(replicate(1000, foo(dat)))
Which is an order magnitude faster than yours on the example data set whilst giving similar output:
在提供类似输出的示例数据集上,这比您的速度要快得多:
> system.time(replicate(1000, zeroVar(dat)))
user system elapsed
3.334 0.000 3.335
> system.time(replicate(1000, foo(dat)))
user system elapsed
0.324 0.000 0.324
Simon's solution here is similarly quick on this example:
在这个例子中,西蒙的解决方案同样迅速:
> system.time(replicate(1000, which(!unlist(lapply(dat,
+ function(x) 0 == var(if (is.factor(x)) as.integer(x) else x))))))
user system elapsed
0.392 0.000 0.395
but you'll have to see if they scale similarly to real problem sizes.
但你必须看看它们的规模是否与实际问题的规模相似。
#2
20
You may also want to look into the nearZeroVar()
function in the caret package.
您还可以在caret包中查看靠近zerovar()函数。
If you have one event out of 1000, it might be a good idea to discard these data (but this depends on the model). nearZeroVar()
can do that.
如果在1000个事件中有一个事件,那么丢弃这些数据可能是个好主意(但这取决于模型)。nearZeroVar()可以这样做。
#3
9
Simply don't use table
- it's extremely slow on numeric vectors since it converts them to strings. I would probably use something like
简单地不要使用表——它在数字矢量上非常慢,因为它将它们转换成字符串。我可能会用一些类似的东西。
var0 <- unlist(lapply(df, function(x) 0 == var(if (is.factor(x)) as.integer(x) else x)))
It will be TRUE
for 0-variance, NA
for columns with NAs and FALSE
for non-zero variance
对于0-方差,NA列和非零方差的列是成立的。
#4
3
Use the Caret
Package and the function nearZeroVar
使用Caret程序包和函数。
require(caret)
NZV<- nearZeroVar(dataset, saveMetrics = TRUE)
NZV[NZV[,"zeroVar"] > 0, ]
NZV[NZV[,"zeroVar"] + NZV[,"nzv"] > 0, ]
#5
2
Well, save yourself some coding time:
好吧,给自己留点编码时间:
Rgames: foo
[,1] [,2] [,3]
[1,] 1 1e+00 1
[2,] 1 2e+00 1
[3,] 1 3e+00 1
[4,] 1 4e+00 1
[5,] 1 5e+00 1
[6,] 1 6e+00 2
[7,] 1 7e+00 3
[8,] 1 8e+00 1
[9,] 1 9e+00 1
[10,] 1 1e+01 1
Rgames: sd(foo)
[1] 0.000000e+00 3.027650e+00 6.749486e-01
Warning message:
sd(<matrix>) is deprecated.
Use apply(*, 2, sd) instead.
To avoid nasty floating-point roundoffs, take that output vector, which I'll call "bar," and do something like bar[bar< 2*.Machine$double.eps] <- 0
and then finally your data frame dat[,as.logical(bar)]
should do the trick.
为了避免令人讨厌的浮点数,取这个输出向量,我将它称为“bar”,并做一些类似于bar (bar< 2*. machine $double)的操作。(eps) <- 0,最后你的数据帧dat[,as.logical(bar)]应该做这个魔术。
#6
2
How about using factor
to count the number of unique elements and looping with sapply
:
如何使用因子来计算独特元素的数量并使用sapply进行循环:
dat[sapply(dat, function(x) length(levels(factor(x)))>1)]
B D F
1 3 10 I
2 4 10 J
3 6 10 I
4 9 10 J
5 2 10 I
6 9 10 J
7 9 10 I
8 7 10 J
9 6 10 I
10 1 1 J
NAs are excluded by default, but this can be changed with the exclude
parameter of factor
:
默认情况下,NAs被排除在外,但这可以通过因子的排除参数来更改:
dat[sapply(dat, function(x) length(levels(factor(x,exclude=NULL)))>1)]
B D F G
1 3 10 I 10
2 4 10 J 10
3 6 10 I 10
4 9 10 J 10
5 2 10 I 10
6 9 10 J 10
7 9 10 I 10
8 7 10 J 10
9 6 10 I 10
10 1 1 J NA
#7
0
I think having zero variance is equivalent to being constant and one can get around without doing any arithmetic operations at all. I would expect that range() outperforms var(), but I have not verified this:
我认为零方差等于常数,一个人可以不做任何算术运算。我希望range()超过var(),但是我没有验证这个:
removeConstantColumns <- function(a_dataframe, verbose=FALSE) {
notConstant <- function(x) {
if (is.factor(x)) x <- as.integer(x)
return (0 != diff(range(x, na.rm=TRUE)))
}
bkeep <- sapply(a_dataframe, notConstant)
if (verbose) {
cat('removeConstantColumns: '
, ifelse(all(bkeep)
, 'nothing'
, paste(names(a_dataframe)[!bkeep], collapse=',')
, ' removed', '\n')
}
return (a_dataframe[, bkeep])
}
#1
14
Don't use table()
- very slow for such things. One option is length(unique(x))
:
不要用表()——这样的东西很慢。一种选择是长度(独特的(x)):
foo <- function(dat) {
out <- lapply(dat, function(x) length(unique(x)))
want <- which(!out > 1)
unlist(want)
}
system.time(replicate(1000, zeroVar(dat)))
system.time(replicate(1000, foo(dat)))
Which is an order magnitude faster than yours on the example data set whilst giving similar output:
在提供类似输出的示例数据集上,这比您的速度要快得多:
> system.time(replicate(1000, zeroVar(dat)))
user system elapsed
3.334 0.000 3.335
> system.time(replicate(1000, foo(dat)))
user system elapsed
0.324 0.000 0.324
Simon's solution here is similarly quick on this example:
在这个例子中,西蒙的解决方案同样迅速:
> system.time(replicate(1000, which(!unlist(lapply(dat,
+ function(x) 0 == var(if (is.factor(x)) as.integer(x) else x))))))
user system elapsed
0.392 0.000 0.395
but you'll have to see if they scale similarly to real problem sizes.
但你必须看看它们的规模是否与实际问题的规模相似。
#2
20
You may also want to look into the nearZeroVar()
function in the caret package.
您还可以在caret包中查看靠近zerovar()函数。
If you have one event out of 1000, it might be a good idea to discard these data (but this depends on the model). nearZeroVar()
can do that.
如果在1000个事件中有一个事件,那么丢弃这些数据可能是个好主意(但这取决于模型)。nearZeroVar()可以这样做。
#3
9
Simply don't use table
- it's extremely slow on numeric vectors since it converts them to strings. I would probably use something like
简单地不要使用表——它在数字矢量上非常慢,因为它将它们转换成字符串。我可能会用一些类似的东西。
var0 <- unlist(lapply(df, function(x) 0 == var(if (is.factor(x)) as.integer(x) else x)))
It will be TRUE
for 0-variance, NA
for columns with NAs and FALSE
for non-zero variance
对于0-方差,NA列和非零方差的列是成立的。
#4
3
Use the Caret
Package and the function nearZeroVar
使用Caret程序包和函数。
require(caret)
NZV<- nearZeroVar(dataset, saveMetrics = TRUE)
NZV[NZV[,"zeroVar"] > 0, ]
NZV[NZV[,"zeroVar"] + NZV[,"nzv"] > 0, ]
#5
2
Well, save yourself some coding time:
好吧,给自己留点编码时间:
Rgames: foo
[,1] [,2] [,3]
[1,] 1 1e+00 1
[2,] 1 2e+00 1
[3,] 1 3e+00 1
[4,] 1 4e+00 1
[5,] 1 5e+00 1
[6,] 1 6e+00 2
[7,] 1 7e+00 3
[8,] 1 8e+00 1
[9,] 1 9e+00 1
[10,] 1 1e+01 1
Rgames: sd(foo)
[1] 0.000000e+00 3.027650e+00 6.749486e-01
Warning message:
sd(<matrix>) is deprecated.
Use apply(*, 2, sd) instead.
To avoid nasty floating-point roundoffs, take that output vector, which I'll call "bar," and do something like bar[bar< 2*.Machine$double.eps] <- 0
and then finally your data frame dat[,as.logical(bar)]
should do the trick.
为了避免令人讨厌的浮点数,取这个输出向量,我将它称为“bar”,并做一些类似于bar (bar< 2*. machine $double)的操作。(eps) <- 0,最后你的数据帧dat[,as.logical(bar)]应该做这个魔术。
#6
2
How about using factor
to count the number of unique elements and looping with sapply
:
如何使用因子来计算独特元素的数量并使用sapply进行循环:
dat[sapply(dat, function(x) length(levels(factor(x)))>1)]
B D F
1 3 10 I
2 4 10 J
3 6 10 I
4 9 10 J
5 2 10 I
6 9 10 J
7 9 10 I
8 7 10 J
9 6 10 I
10 1 1 J
NAs are excluded by default, but this can be changed with the exclude
parameter of factor
:
默认情况下,NAs被排除在外,但这可以通过因子的排除参数来更改:
dat[sapply(dat, function(x) length(levels(factor(x,exclude=NULL)))>1)]
B D F G
1 3 10 I 10
2 4 10 J 10
3 6 10 I 10
4 9 10 J 10
5 2 10 I 10
6 9 10 J 10
7 9 10 I 10
8 7 10 J 10
9 6 10 I 10
10 1 1 J NA
#7
0
I think having zero variance is equivalent to being constant and one can get around without doing any arithmetic operations at all. I would expect that range() outperforms var(), but I have not verified this:
我认为零方差等于常数,一个人可以不做任何算术运算。我希望range()超过var(),但是我没有验证这个:
removeConstantColumns <- function(a_dataframe, verbose=FALSE) {
notConstant <- function(x) {
if (is.factor(x)) x <- as.integer(x)
return (0 != diff(range(x, na.rm=TRUE)))
}
bkeep <- sapply(a_dataframe, notConstant)
if (verbose) {
cat('removeConstantColumns: '
, ifelse(all(bkeep)
, 'nothing'
, paste(names(a_dataframe)[!bkeep], collapse=',')
, ' removed', '\n')
}
return (a_dataframe[, bkeep])
}