快速从数据a中删除零方差变量。

时间:2021-06-20 20:07:39

I have a large data.frame that was generated by a process outside my control, which may or may not contain variables with zero variance (i.e. all the observations are the same). I would like to build a predictive model based on this data, and obviously these variables are of no use.

我有一个大的数据。框架是由我控制之外的一个过程生成的,它可能包含或不包含零方差的变量(即所有的观察结果都是相同的)。我想建立一个基于这些数据的预测模型,显然这些变量是无用的。

Here's the function I'm currently using to remove such variables from the data.frame. It's currently based on apply, and I was wondering if there are any obvious ways to speed this function up, so that it works quickly on very large datasets, with a large number (400 or 500) of variables?

下面是我现在使用的函数来从数据框中删除这些变量。它目前基于应用,我想知道是否有任何明显的方法来加快这个功能,以便它能在非常大的数据集上快速运行,有大量(400或500)个变量?

set.seed(1)
dat <- data.frame(
    A=factor(rep("X",10),levels=c('X','Y')),
    B=round(runif(10)*10),
    C=rep(10,10),
    D=c(rep(10,9),1),
    E=factor(rep("A",10)),
    F=factor(rep(c("I","J"),5)),
    G=c(rep(10,9),NA)
)
zeroVar <- function(data, useNA = 'ifany') {
    out <- apply(data, 2, function(x) {length(table(x, useNA = useNA))})
    which(out==1)
}

And here's the result of the process:

这是过程的结果:

> dat
   A B  C  D E F  G
1  X 3 10 10 A I 10
2  X 4 10 10 A J 10
3  X 6 10 10 A I 10
4  X 9 10 10 A J 10
5  X 2 10 10 A I 10
6  X 9 10 10 A J 10
7  X 9 10 10 A I 10
8  X 7 10 10 A J 10
9  X 6 10 10 A I 10
10 X 1 10  1 A J NA

> dat[,-zeroVar(dat)]
   B  D F  G
1  3 10 I 10
2  4 10 J 10
3  6 10 I 10
4  9 10 J 10
5  2 10 I 10
6  9 10 J 10
7  9 10 I 10
8  7 10 J 10
9  6 10 I 10
10 1  1 J NA

> dat[,-zeroVar(dat, useNA = 'no')]
   B  D F
1  3 10 I
2  4 10 J
3  6 10 I
4  9 10 J
5  2 10 I
6  9 10 J
7  9 10 I
8  7 10 J
9  6 10 I
10 1  1 J

7 个解决方案

#1


14  

Don't use table() - very slow for such things. One option is length(unique(x)):

不要用表()——这样的东西很慢。一种选择是长度(独特的(x)):

foo <- function(dat) {
    out <- lapply(dat, function(x) length(unique(x)))
    want <- which(!out > 1)
    unlist(want)
}

system.time(replicate(1000, zeroVar(dat)))
system.time(replicate(1000, foo(dat)))

Which is an order magnitude faster than yours on the example data set whilst giving similar output:

在提供类似输出的示例数据集上,这比您的速度要快得多:

> system.time(replicate(1000, zeroVar(dat)))
   user  system elapsed 
  3.334   0.000   3.335 
> system.time(replicate(1000, foo(dat)))
   user  system elapsed 
  0.324   0.000   0.324

Simon's solution here is similarly quick on this example:

在这个例子中,西蒙的解决方案同样迅速:

> system.time(replicate(1000, which(!unlist(lapply(dat, 
+             function(x) 0 == var(if (is.factor(x)) as.integer(x) else x))))))
   user  system elapsed 
  0.392   0.000   0.395

but you'll have to see if they scale similarly to real problem sizes.

但你必须看看它们的规模是否与实际问题的规模相似。

#2


20  

You may also want to look into the nearZeroVar() function in the caret package.

您还可以在caret包中查看靠近zerovar()函数。

If you have one event out of 1000, it might be a good idea to discard these data (but this depends on the model). nearZeroVar() can do that.

如果在1000个事件中有一个事件,那么丢弃这些数据可能是个好主意(但这取决于模型)。nearZeroVar()可以这样做。

#3


9  

Simply don't use table - it's extremely slow on numeric vectors since it converts them to strings. I would probably use something like

简单地不要使用表——它在数字矢量上非常慢,因为它将它们转换成字符串。我可能会用一些类似的东西。

var0 <- unlist(lapply(df, function(x) 0 == var(if (is.factor(x)) as.integer(x) else x)))

It will be TRUE for 0-variance, NA for columns with NAs and FALSE for non-zero variance

对于0-方差,NA列和非零方差的列是成立的。

#4


3  

Use the Caret Package and the function nearZeroVar

使用Caret程序包和函数。

require(caret)
NZV<- nearZeroVar(dataset, saveMetrics = TRUE)
NZV[NZV[,"zeroVar"] > 0, ] 
NZV[NZV[,"zeroVar"] + NZV[,"nzv"] > 0, ]

#5


2  

Well, save yourself some coding time:

好吧,给自己留点编码时间:

Rgames: foo
      [,1]  [,2] [,3]
 [1,]    1 1e+00    1
 [2,]    1 2e+00    1
 [3,]    1 3e+00    1
 [4,]    1 4e+00    1
 [5,]    1 5e+00    1
 [6,]    1 6e+00    2
 [7,]    1 7e+00    3
 [8,]    1 8e+00    1
 [9,]    1 9e+00    1
 [10,]    1 1e+01    1
Rgames: sd(foo)
[1] 0.000000e+00 3.027650e+00 6.749486e-01
Warning message:
sd(<matrix>) is deprecated.
 Use apply(*, 2, sd) instead.   

To avoid nasty floating-point roundoffs, take that output vector, which I'll call "bar," and do something like bar[bar< 2*.Machine$double.eps] <- 0 and then finally your data frame dat[,as.logical(bar)] should do the trick.

为了避免令人讨厌的浮点数,取这个输出向量,我将它称为“bar”,并做一些类似于bar (bar< 2*. machine $double)的操作。(eps) <- 0,最后你的数据帧dat[,as.logical(bar)]应该做这个魔术。

#6


2  

How about using factor to count the number of unique elements and looping with sapply:

如何使用因子来计算独特元素的数量并使用sapply进行循环:

dat[sapply(dat, function(x) length(levels(factor(x)))>1)]
   B  D F
1  3 10 I
2  4 10 J
3  6 10 I
4  9 10 J
5  2 10 I
6  9 10 J
7  9 10 I
8  7 10 J
9  6 10 I
10 1  1 J

NAs are excluded by default, but this can be changed with the exclude parameter of factor:

默认情况下,NAs被排除在外,但这可以通过因子的排除参数来更改:

dat[sapply(dat, function(x) length(levels(factor(x,exclude=NULL)))>1)]
   B  D F  G
1  3 10 I 10
2  4 10 J 10
3  6 10 I 10
4  9 10 J 10
5  2 10 I 10
6  9 10 J 10
7  9 10 I 10
8  7 10 J 10
9  6 10 I 10
10 1  1 J NA

#7


0  

I think having zero variance is equivalent to being constant and one can get around without doing any arithmetic operations at all. I would expect that range() outperforms var(), but I have not verified this:

我认为零方差等于常数,一个人可以不做任何算术运算。我希望range()超过var(),但是我没有验证这个:

removeConstantColumns <- function(a_dataframe, verbose=FALSE) {
  notConstant <- function(x) {
    if (is.factor(x)) x <- as.integer(x)
    return (0 != diff(range(x, na.rm=TRUE)))
  }
  bkeep <- sapply(a_dataframe, notConstant)
  if (verbose) {
    cat('removeConstantColumns: '
      , ifelse(all(bkeep)
        , 'nothing'
        , paste(names(a_dataframe)[!bkeep], collapse=',')
      , ' removed',  '\n')
  }
  return (a_dataframe[, bkeep])
}

#1


14  

Don't use table() - very slow for such things. One option is length(unique(x)):

不要用表()——这样的东西很慢。一种选择是长度(独特的(x)):

foo <- function(dat) {
    out <- lapply(dat, function(x) length(unique(x)))
    want <- which(!out > 1)
    unlist(want)
}

system.time(replicate(1000, zeroVar(dat)))
system.time(replicate(1000, foo(dat)))

Which is an order magnitude faster than yours on the example data set whilst giving similar output:

在提供类似输出的示例数据集上,这比您的速度要快得多:

> system.time(replicate(1000, zeroVar(dat)))
   user  system elapsed 
  3.334   0.000   3.335 
> system.time(replicate(1000, foo(dat)))
   user  system elapsed 
  0.324   0.000   0.324

Simon's solution here is similarly quick on this example:

在这个例子中,西蒙的解决方案同样迅速:

> system.time(replicate(1000, which(!unlist(lapply(dat, 
+             function(x) 0 == var(if (is.factor(x)) as.integer(x) else x))))))
   user  system elapsed 
  0.392   0.000   0.395

but you'll have to see if they scale similarly to real problem sizes.

但你必须看看它们的规模是否与实际问题的规模相似。

#2


20  

You may also want to look into the nearZeroVar() function in the caret package.

您还可以在caret包中查看靠近zerovar()函数。

If you have one event out of 1000, it might be a good idea to discard these data (but this depends on the model). nearZeroVar() can do that.

如果在1000个事件中有一个事件,那么丢弃这些数据可能是个好主意(但这取决于模型)。nearZeroVar()可以这样做。

#3


9  

Simply don't use table - it's extremely slow on numeric vectors since it converts them to strings. I would probably use something like

简单地不要使用表——它在数字矢量上非常慢,因为它将它们转换成字符串。我可能会用一些类似的东西。

var0 <- unlist(lapply(df, function(x) 0 == var(if (is.factor(x)) as.integer(x) else x)))

It will be TRUE for 0-variance, NA for columns with NAs and FALSE for non-zero variance

对于0-方差,NA列和非零方差的列是成立的。

#4


3  

Use the Caret Package and the function nearZeroVar

使用Caret程序包和函数。

require(caret)
NZV<- nearZeroVar(dataset, saveMetrics = TRUE)
NZV[NZV[,"zeroVar"] > 0, ] 
NZV[NZV[,"zeroVar"] + NZV[,"nzv"] > 0, ]

#5


2  

Well, save yourself some coding time:

好吧,给自己留点编码时间:

Rgames: foo
      [,1]  [,2] [,3]
 [1,]    1 1e+00    1
 [2,]    1 2e+00    1
 [3,]    1 3e+00    1
 [4,]    1 4e+00    1
 [5,]    1 5e+00    1
 [6,]    1 6e+00    2
 [7,]    1 7e+00    3
 [8,]    1 8e+00    1
 [9,]    1 9e+00    1
 [10,]    1 1e+01    1
Rgames: sd(foo)
[1] 0.000000e+00 3.027650e+00 6.749486e-01
Warning message:
sd(<matrix>) is deprecated.
 Use apply(*, 2, sd) instead.   

To avoid nasty floating-point roundoffs, take that output vector, which I'll call "bar," and do something like bar[bar< 2*.Machine$double.eps] <- 0 and then finally your data frame dat[,as.logical(bar)] should do the trick.

为了避免令人讨厌的浮点数,取这个输出向量,我将它称为“bar”,并做一些类似于bar (bar< 2*. machine $double)的操作。(eps) <- 0,最后你的数据帧dat[,as.logical(bar)]应该做这个魔术。

#6


2  

How about using factor to count the number of unique elements and looping with sapply:

如何使用因子来计算独特元素的数量并使用sapply进行循环:

dat[sapply(dat, function(x) length(levels(factor(x)))>1)]
   B  D F
1  3 10 I
2  4 10 J
3  6 10 I
4  9 10 J
5  2 10 I
6  9 10 J
7  9 10 I
8  7 10 J
9  6 10 I
10 1  1 J

NAs are excluded by default, but this can be changed with the exclude parameter of factor:

默认情况下,NAs被排除在外,但这可以通过因子的排除参数来更改:

dat[sapply(dat, function(x) length(levels(factor(x,exclude=NULL)))>1)]
   B  D F  G
1  3 10 I 10
2  4 10 J 10
3  6 10 I 10
4  9 10 J 10
5  2 10 I 10
6  9 10 J 10
7  9 10 I 10
8  7 10 J 10
9  6 10 I 10
10 1  1 J NA

#7


0  

I think having zero variance is equivalent to being constant and one can get around without doing any arithmetic operations at all. I would expect that range() outperforms var(), but I have not verified this:

我认为零方差等于常数,一个人可以不做任何算术运算。我希望range()超过var(),但是我没有验证这个:

removeConstantColumns <- function(a_dataframe, verbose=FALSE) {
  notConstant <- function(x) {
    if (is.factor(x)) x <- as.integer(x)
    return (0 != diff(range(x, na.rm=TRUE)))
  }
  bkeep <- sapply(a_dataframe, notConstant)
  if (verbose) {
    cat('removeConstantColumns: '
      , ifelse(all(bkeep)
        , 'nothing'
        , paste(names(a_dataframe)[!bkeep], collapse=',')
      , ' removed',  '\n')
  }
  return (a_dataframe[, bkeep])
}