This question already has an answer here:
这个问题在这里已有答案:
- Remove rows with all or some NAs (missing values) in data.frame 15 answers
删除data.frame 15答案中包含全部或部分NA(缺失值)的行
I'd like to remove all rows of a data.table
that contain Inf
in any of its columns. So far, I've been using this approach:
我想删除其任何列中包含Inf的data.table的所有行。到目前为止,我一直在使用这种方法:
DT <- data.table(col1 = c(1,2,3), col2 = c(4,Inf,5))
DT[,drop := apply(.SD, 1, function(x) any(is.infinite(x))), by = 1:nrow(DT)]
DT <- DT[(!drop)][,drop:=NULL]
which comes from this * question. However, this approach is not well scalable to large amounts of data. Is there a better way to remove the rows with Inf
?
来自这个*问题。但是,这种方法不能很好地扩展到大量数据。有没有更好的方法来删除Inf的行?
1 个解决方案
#1
17
You can use rowSums
to check if any element of a row is not finite.
您可以使用rowSums检查行的任何元素是否不是有限的。
DT[is.finite(rowSums(DT))]
OR you can use the fact that Inf * 0
is NA
and use complete.cases
或者您可以使用Inf * 0为NA并使用complete.cases的事实
DT[complete.cases(DT*0)]
Some benchmarking shows that the rowSums
is fastest for smaller datasets and complete.cases
is the fastest solution for larger datasets.
一些基准测试表明,rowSums对于较小的数据集来说速度最快,而complete.cases是较大数据集的最快解决方案。
require(microbenchmark)
microbenchmark(
DT[is.finite(rowSums(DT))]
,
DT[complete.cases(DT*0)]
,
DT[DT[, Reduce('&', lapply(.SD, is.finite))]]
)
##
## nrow(DT) = 3000
## Unit: microseconds
## expr min lq mean median uq max neval cld
## DT[is.finite(rowSums(DT))] 786.797 839.235 864.0215 852.8465 884.756 1021.988 100 a
## DT[complete.cases(DT * 0)] 1265.658 1326.575 1363.3985 1350.0055 1386.377 1898.040 100 c
## DT[DT[, Reduce("&", lapply(.SD, is.finite))]] 1220.137 1275.030 1319.6226 1308.0555 1348.443 1624.023 100 b
##
## nrow(DT) = 300000
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## DT[is.finite(rowSums(DT))] 21.617935 22.687452 26.698070 25.75765 26.07942 87.56290 100 c
## DT[complete.cases(DT * 0)] 7.209252 7.567393 9.908503 10.17569 10.37473 71.31375 100 a
## DT[DT[, Reduce("&", lapply(.SD, is.finite))]] 11.786773 12.647652 14.128624 14.78512 15.05089 15.39542 100 b
#1
17
You can use rowSums
to check if any element of a row is not finite.
您可以使用rowSums检查行的任何元素是否不是有限的。
DT[is.finite(rowSums(DT))]
OR you can use the fact that Inf * 0
is NA
and use complete.cases
或者您可以使用Inf * 0为NA并使用complete.cases的事实
DT[complete.cases(DT*0)]
Some benchmarking shows that the rowSums
is fastest for smaller datasets and complete.cases
is the fastest solution for larger datasets.
一些基准测试表明,rowSums对于较小的数据集来说速度最快,而complete.cases是较大数据集的最快解决方案。
require(microbenchmark)
microbenchmark(
DT[is.finite(rowSums(DT))]
,
DT[complete.cases(DT*0)]
,
DT[DT[, Reduce('&', lapply(.SD, is.finite))]]
)
##
## nrow(DT) = 3000
## Unit: microseconds
## expr min lq mean median uq max neval cld
## DT[is.finite(rowSums(DT))] 786.797 839.235 864.0215 852.8465 884.756 1021.988 100 a
## DT[complete.cases(DT * 0)] 1265.658 1326.575 1363.3985 1350.0055 1386.377 1898.040 100 c
## DT[DT[, Reduce("&", lapply(.SD, is.finite))]] 1220.137 1275.030 1319.6226 1308.0555 1348.443 1624.023 100 b
##
## nrow(DT) = 300000
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## DT[is.finite(rowSums(DT))] 21.617935 22.687452 26.698070 25.75765 26.07942 87.56290 100 c
## DT[complete.cases(DT * 0)] 7.209252 7.567393 9.908503 10.17569 10.37473 71.31375 100 a
## DT[DT[, Reduce("&", lapply(.SD, is.finite))]] 11.786773 12.647652 14.128624 14.78512 15.05089 15.39542 100 b