I'd like to remove the lines in this data frame that:
我想删除此数据框中的行:
a) contain NA
s across all columns. Below is my example data frame.
a)包含所有列的NA。下面是我的示例数据框。
gene hsap mmul mmus rnor cfam
1 ENSG00000208234 0 NA NA NA NA
2 ENSG00000199674 0 2 2 2 2
3 ENSG00000221622 0 NA NA NA NA
4 ENSG00000207604 0 NA NA 1 2
5 ENSG00000207431 0 NA NA NA NA
6 ENSG00000221312 0 1 2 3 2
Basically, I'd like to get a data frame such as the following.
基本上,我想获得如下的数据框。
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
6 ENSG00000221312 0 1 2 3 2
b) contain NA
s in only some columns, so I can also get this result:
b)只在某些列中包含NA,所以我也可以得到这个结果:
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2
15 个解决方案
#1
831
Also check complete.cases
:
还要检查complete.cases:
> final[complete.cases(final), ]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
6 ENSG00000221312 0 1 2 3 2
na.omit
is nicer for just removing all NA
's. complete.cases
allows partial selection by including only certain columns of the dataframe:
na.omit更适合删除所有NA。 complete.cases允许通过仅包含数据帧的某些列来进行部分选择:
> final[complete.cases(final[ , 5:6]),]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2
Your solution can't work. If you insist on using is.na
, then you have to do something like:
您的解决方案无法运作。如果您坚持使用is.na,那么您必须执行以下操作:
> final[rowSums(is.na(final[ , 5:6])) == 0, ]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2
but using complete.cases
is quite a lot more clear, and faster.
但是使用complete.cases要清晰得多,而且速度更快。
#2
200
Try na.omit(your.data.frame)
. As for the second question, try posting it as another question (for clarity).
试试na.omit(your.data.frame)。至于第二个问题,请尝试将其作为另一个问题发布(为清楚起见)。
#3
78
I prefer following way to check whether rows contain any NAs:
我更喜欢按照以下方式检查行是否包含任何NA:
row.has.na <- apply(final, 1, function(x){any(is.na(x))})
This returns logical vector with values denoting whether there is any NA in a row. You can use it to see how many rows you'll have to drop:
这将返回逻辑向量,其值表示行中是否存在任何NA。您可以使用它来查看要删除的行数:
sum(row.has.na)
and eventually drop them
并最终放弃他们
final.filtered <- final[!row.has.na,]
For filtering rows with certain part of NAs it becomes a little trickier (for example, you can feed 'final[,5:6]' to 'apply'). Generally, Joris Meys' solution seems to be more elegant.
为了过滤具有某些NA的行,它变得有点棘手(例如,你可以将'final [,5:6]'提供给'apply')。一般来说,Joris Meys的解决方案似乎更优雅。
#4
53
If you like pipes (%>%
), tidyr
's new drop_na
is your friend:
如果你喜欢管道(%>%),tidyr的新drop_na就是你的朋友:
library(tidyr)
df %>% drop_na()
# gene hsap mmul mmus rnor cfam
# 2 ENSG00000199674 0 2 2 2 2
# 6 ENSG00000221312 0 1 2 3 2
df %>% drop_na(rnor, cfam)
# gene hsap mmul mmus rnor cfam
# 2 ENSG00000199674 0 2 2 2 2
# 4 ENSG00000207604 0 NA NA 1 2
# 6 ENSG00000221312 0 1 2 3 2
#5
36
Another option if you want greater control over how rows are deemed to be invalid is
如果您想要更好地控制行被视为无效的另一个选项是
final <- final[!(is.na(final$rnor)) | !(is.na(rawdata$cfam)),]
Using the above, this:
使用上面的,这个:
gene hsap mmul mmus rnor cfam
1 ENSG00000208234 0 NA NA NA 2
2 ENSG00000199674 0 2 2 2 2
3 ENSG00000221622 0 NA NA 2 NA
4 ENSG00000207604 0 NA NA 1 2
5 ENSG00000207431 0 NA NA NA NA
6 ENSG00000221312 0 1 2 3 2
Becomes:
变为:
gene hsap mmul mmus rnor cfam
1 ENSG00000208234 0 NA NA NA 2
2 ENSG00000199674 0 2 2 2 2
3 ENSG00000221622 0 NA NA 2 NA
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2
...where only row 5 is removed since it is the only row containing NAs for both rnor
AND cfam
. The boolean logic can then be changed to fit specific requirements.
...只删除第5行,因为它是唯一包含rnor和cfam的NA的行。然后可以更改布尔逻辑以满足特定要求。
#6
30
If you want control over how many NAs are valid for each row, try this function. For many survey data sets, too many blank question responses can ruin the results. So they are deleted after a certain threshold. This function will allow you to choose how many NAs the row can have before it's deleted:
如果要控制每行有效的NA数,请尝试此功能。对于许多调查数据集,太多空白问题反应可能会破坏结果。所以在一定的阈值后删除它们。此功能允许您选择在删除行之前可以拥有多少个NAs:
delete.na <- function(DF, n=0) {
DF[rowSums(is.na(DF)) <= n,]
}
By default, it will eliminate all NAs:
默认情况下,它将消除所有NAs:
delete.na(final)
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
6 ENSG00000221312 0 1 2 3 2
Or specify the maximum number of NAs allowed:
或者指定允许的最大NA数:
delete.na(final, 2)
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2
#7
14
This will return the rows that have at least ONE non-NA value.
这将返回至少具有一个非NA值的行。
final[rowSums(is.na(final))<length(final),]
This will return the rows that have at least TWO non-NA value.
这将返回至少具有两个非NA值的行。
final[rowSums(is.na(final))<(length(final)-1),]
#8
12
We can also use the subset function for this.
我们也可以使用子集函数。
finalData<-subset(data,!(is.na(data["mmul"]) | is.na(data["rnor"])))
This will give only those rows that do not have NA in both mmul and rnor
这将只给出mmul和rnor中没有NA的那些行
#9
12
Using dplyr package we can filter NA as follows:
使用dplyr包我们可以按如下方式过滤NA:
dplyr::filter(df, !is.na(columnname))
#10
11
For your first question, I have a code that I am comfortable with to get rid of all NAs. Thanks for @Gregor to make it simpler.
对于你的第一个问题,我有一个代码,我很乐意摆脱所有的NA。感谢@Gregor让它变得更简单。
final[!(rowSums(is.na(final))),]
For the second question, the code is just an alternation from the previous solution.
对于第二个问题,代码只是前一个解决方案的替代。
final[as.logical((rowSums(is.na(final))-5)),]
Notice the -5 is the number of columns in your data. This will eliminate rows with all NAs, since the rowSums adds up to 5 and they become zeroes after subtraction. This time, as.logical is necessary.
请注意,-5是数据中的列数。这将消除所有NA的行,因为rowSums总计最多为5,并且它们在减法后变为零。这一次,as.logical是必要的。
#11
10
If performance is a priority, use data.table
and na.omit()
with optional param cols=
.
na.omit.data.table
is the fastest on my benchmark (see below), whether for all columns or for select columns (OP question part 2).
na.omit.data.table是我的基准测试中最快的(见下文),无论是所有列还是选择列(OP问题第2部分)。
If you don't want to use data.table
, use complete.cases()
.
On a vanilla data.frame
, complete.cases
is faster than na.omit()
or dplyr::drop_na()
. Notice that na.omit.data.frame
does not support cols=
.
在vanilla data.frame上,complete.cases比na.omit()或dplyr :: drop_na()更快。请注意,na.omit.data.frame不支持cols =。
Benchmark result
Here is a comparison of base (blue), dplyr
(pink), and data.table
(yellow) methods for dropping either all or select missing observations, on notional dataset of 1 million observations of 20 numeric variables with independent 5% likelihood of being missing, and a subset of 4 variables for part 2.
下面是基数(蓝色),dplyr(粉红色)和data.table(黄色)方法的比较,用于丢弃全部或选择缺失的观察值,在20个数值变量的100万个观测值的概念数据集中,具有独立的5%可能性缺失,以及第2部分的4个变量的子集。
Your results may vary based on length, width, and sparsity of your particular dataset.
您的结果可能会因特定数据集的长度,宽度和稀疏度而异。
Note log scale on y axis.
注意y轴上的对数刻度。
Benchmark script
#------- Adjust these assumptions for your own use case ------------
row_size <- 1e6L
col_size <- 20 # not including ID column
p_missing <- 0.05 # likelihood of missing observation (except ID col)
col_subset <- 18:21 # second part of question: filter on select columns
#------- System info for benchmark ----------------------------------
R.version # R version 3.4.3 (2017-11-30), platform = x86_64-w64-mingw32
library(data.table); packageVersion('data.table') # 1.10.4.3
library(dplyr); packageVersion('dplyr') # 0.7.4
library(tidyr); packageVersion('tidyr') # 0.8.0
library(microbenchmark)
#------- Example dataset using above assumptions --------------------
fakeData <- function(m, n, p){
set.seed(123)
m <- matrix(runif(m*n), nrow=m, ncol=n)
m[m<p] <- NA
return(m)
}
df <- cbind( data.frame(id = paste0('ID',seq(row_size)),
stringsAsFactors = FALSE),
data.frame(fakeData(row_size, col_size, p_missing) )
)
dt <- data.table(df)
par(las=3, mfcol=c(1,2), mar=c(22,4,1,1)+0.1)
boxplot(
microbenchmark(
df[complete.cases(df), ],
na.omit(df),
df %>% drop_na,
dt[complete.cases(dt), ],
na.omit(dt)
), xlab='',
main = 'Performance: Drop any NA observation',
col=c(rep('lightblue',2),'salmon',rep('beige',2))
)
boxplot(
microbenchmark(
df[complete.cases(df[,col_subset]), ],
#na.omit(df), # col subset not supported in na.omit.data.frame
df %>% drop_na(col_subset),
dt[complete.cases(dt[,col_subset,with=FALSE]), ],
na.omit(dt, cols=col_subset) # see ?na.omit.data.table
), xlab='',
main = 'Performance: Drop NA obs. in select cols',
col=c('lightblue','salmon',rep('beige',2))
)
#12
8
I am a synthesizer:). Here I combined the answers into one function:
我是合成器:)。在这里,我将答案合并为一个函数:
#' keep rows that have a certain number (range) of NAs anywhere/somewhere and delete others
#' @param df a data frame
#' @param col restrict to the columns where you would like to search for NA; eg, 3, c(3), 2:5, "place", c("place","age")
#' \cr default is NULL, search for all columns
#' @param n integer or vector, 0, c(3,5), number/range of NAs allowed.
#' \cr If a number, the exact number of NAs kept
#' \cr Range includes both ends 3<=n<=5
#' \cr Range could be -Inf, Inf
#' @return returns a new df with rows that have NA(s) removed
#' @export
ez.na.keep = function(df, col=NULL, n=0){
if (!is.null(col)) {
# R converts a single row/col to a vector if the parameter col has only one col
# see https://radfordneal.wordpress.com/2008/08/20/design-flaws-in-r-2-%E2%80%94-dropped-dimensions/#comments
df.temp = df[,col,drop=FALSE]
} else {
df.temp = df
}
if (length(n)==1){
if (n==0) {
# simply call complete.cases which might be faster
result = df[complete.cases(df.temp),]
} else {
# credit: http://*.com/a/30461945/2292993
log <- apply(df.temp, 2, is.na)
logindex <- apply(log, 1, function(x) sum(x) == n)
result = df[logindex, ]
}
}
if (length(n)==2){
min = n[1]; max = n[2]
log <- apply(df.temp, 2, is.na)
logindex <- apply(log, 1, function(x) {sum(x) >= min && sum(x) <= max})
result = df[logindex, ]
}
return(result)
}
#13
5
Assuming dat
as your dataframe, the expected output can be achieved using
假设dat为您的数据帧,可以使用
1.rowSums
1.rowSums
> dat[!rowSums((is.na(dat))),]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
6 ENSG00000221312 0 1 2 3 2
2.lapply
2.lapply
> dat[!Reduce('|',lapply(dat,is.na)),]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
6 ENSG00000221312 0 1 2 3 2
#14
1
delete.dirt <- function(DF, dart=c('NA')) {
dirty_rows <- apply(DF, 1, function(r) !any(r %in% dart))
DF <- DF[dirty_rows, ]
}
mydata <- delete.dirt(mydata)
Above function deletes all the rows from the data frame that has 'NA' in any column and returns the resultant data. If you want to check for multiple values like NA
and ?
change dart=c('NA')
in function param to dart=c('NA', '?')
上面的函数删除任何列中具有“NA”的数据帧中的所有行,并返回结果数据。如果要检查NA和?等多个值?将函数参数中的dart = c('NA')更改为dart = c('NA','?')
#15
0
My guess is that this could be more elegantly solved in this way
我的猜测是,用这种方式可以更优雅地解决这个问题
m <- matrix(1:25, ncol = 5)
m[c(1, 6, 13, 25)] <- NA
df <- data.frame(m)
library(dplyr)
df %>%
filter_all(any_vars(is.na(.)))
#> X1 X2 X3 X4 X5
#> 1 NA NA 11 16 21
#> 2 3 8 NA 18 23
#> 3 5 10 15 20 NA
#1
831
Also check complete.cases
:
还要检查complete.cases:
> final[complete.cases(final), ]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
6 ENSG00000221312 0 1 2 3 2
na.omit
is nicer for just removing all NA
's. complete.cases
allows partial selection by including only certain columns of the dataframe:
na.omit更适合删除所有NA。 complete.cases允许通过仅包含数据帧的某些列来进行部分选择:
> final[complete.cases(final[ , 5:6]),]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2
Your solution can't work. If you insist on using is.na
, then you have to do something like:
您的解决方案无法运作。如果您坚持使用is.na,那么您必须执行以下操作:
> final[rowSums(is.na(final[ , 5:6])) == 0, ]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2
but using complete.cases
is quite a lot more clear, and faster.
但是使用complete.cases要清晰得多,而且速度更快。
#2
200
Try na.omit(your.data.frame)
. As for the second question, try posting it as another question (for clarity).
试试na.omit(your.data.frame)。至于第二个问题,请尝试将其作为另一个问题发布(为清楚起见)。
#3
78
I prefer following way to check whether rows contain any NAs:
我更喜欢按照以下方式检查行是否包含任何NA:
row.has.na <- apply(final, 1, function(x){any(is.na(x))})
This returns logical vector with values denoting whether there is any NA in a row. You can use it to see how many rows you'll have to drop:
这将返回逻辑向量,其值表示行中是否存在任何NA。您可以使用它来查看要删除的行数:
sum(row.has.na)
and eventually drop them
并最终放弃他们
final.filtered <- final[!row.has.na,]
For filtering rows with certain part of NAs it becomes a little trickier (for example, you can feed 'final[,5:6]' to 'apply'). Generally, Joris Meys' solution seems to be more elegant.
为了过滤具有某些NA的行,它变得有点棘手(例如,你可以将'final [,5:6]'提供给'apply')。一般来说,Joris Meys的解决方案似乎更优雅。
#4
53
If you like pipes (%>%
), tidyr
's new drop_na
is your friend:
如果你喜欢管道(%>%),tidyr的新drop_na就是你的朋友:
library(tidyr)
df %>% drop_na()
# gene hsap mmul mmus rnor cfam
# 2 ENSG00000199674 0 2 2 2 2
# 6 ENSG00000221312 0 1 2 3 2
df %>% drop_na(rnor, cfam)
# gene hsap mmul mmus rnor cfam
# 2 ENSG00000199674 0 2 2 2 2
# 4 ENSG00000207604 0 NA NA 1 2
# 6 ENSG00000221312 0 1 2 3 2
#5
36
Another option if you want greater control over how rows are deemed to be invalid is
如果您想要更好地控制行被视为无效的另一个选项是
final <- final[!(is.na(final$rnor)) | !(is.na(rawdata$cfam)),]
Using the above, this:
使用上面的,这个:
gene hsap mmul mmus rnor cfam
1 ENSG00000208234 0 NA NA NA 2
2 ENSG00000199674 0 2 2 2 2
3 ENSG00000221622 0 NA NA 2 NA
4 ENSG00000207604 0 NA NA 1 2
5 ENSG00000207431 0 NA NA NA NA
6 ENSG00000221312 0 1 2 3 2
Becomes:
变为:
gene hsap mmul mmus rnor cfam
1 ENSG00000208234 0 NA NA NA 2
2 ENSG00000199674 0 2 2 2 2
3 ENSG00000221622 0 NA NA 2 NA
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2
...where only row 5 is removed since it is the only row containing NAs for both rnor
AND cfam
. The boolean logic can then be changed to fit specific requirements.
...只删除第5行,因为它是唯一包含rnor和cfam的NA的行。然后可以更改布尔逻辑以满足特定要求。
#6
30
If you want control over how many NAs are valid for each row, try this function. For many survey data sets, too many blank question responses can ruin the results. So they are deleted after a certain threshold. This function will allow you to choose how many NAs the row can have before it's deleted:
如果要控制每行有效的NA数,请尝试此功能。对于许多调查数据集,太多空白问题反应可能会破坏结果。所以在一定的阈值后删除它们。此功能允许您选择在删除行之前可以拥有多少个NAs:
delete.na <- function(DF, n=0) {
DF[rowSums(is.na(DF)) <= n,]
}
By default, it will eliminate all NAs:
默认情况下,它将消除所有NAs:
delete.na(final)
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
6 ENSG00000221312 0 1 2 3 2
Or specify the maximum number of NAs allowed:
或者指定允许的最大NA数:
delete.na(final, 2)
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2
#7
14
This will return the rows that have at least ONE non-NA value.
这将返回至少具有一个非NA值的行。
final[rowSums(is.na(final))<length(final),]
This will return the rows that have at least TWO non-NA value.
这将返回至少具有两个非NA值的行。
final[rowSums(is.na(final))<(length(final)-1),]
#8
12
We can also use the subset function for this.
我们也可以使用子集函数。
finalData<-subset(data,!(is.na(data["mmul"]) | is.na(data["rnor"])))
This will give only those rows that do not have NA in both mmul and rnor
这将只给出mmul和rnor中没有NA的那些行
#9
12
Using dplyr package we can filter NA as follows:
使用dplyr包我们可以按如下方式过滤NA:
dplyr::filter(df, !is.na(columnname))
#10
11
For your first question, I have a code that I am comfortable with to get rid of all NAs. Thanks for @Gregor to make it simpler.
对于你的第一个问题,我有一个代码,我很乐意摆脱所有的NA。感谢@Gregor让它变得更简单。
final[!(rowSums(is.na(final))),]
For the second question, the code is just an alternation from the previous solution.
对于第二个问题,代码只是前一个解决方案的替代。
final[as.logical((rowSums(is.na(final))-5)),]
Notice the -5 is the number of columns in your data. This will eliminate rows with all NAs, since the rowSums adds up to 5 and they become zeroes after subtraction. This time, as.logical is necessary.
请注意,-5是数据中的列数。这将消除所有NA的行,因为rowSums总计最多为5,并且它们在减法后变为零。这一次,as.logical是必要的。
#11
10
If performance is a priority, use data.table
and na.omit()
with optional param cols=
.
na.omit.data.table
is the fastest on my benchmark (see below), whether for all columns or for select columns (OP question part 2).
na.omit.data.table是我的基准测试中最快的(见下文),无论是所有列还是选择列(OP问题第2部分)。
If you don't want to use data.table
, use complete.cases()
.
On a vanilla data.frame
, complete.cases
is faster than na.omit()
or dplyr::drop_na()
. Notice that na.omit.data.frame
does not support cols=
.
在vanilla data.frame上,complete.cases比na.omit()或dplyr :: drop_na()更快。请注意,na.omit.data.frame不支持cols =。
Benchmark result
Here is a comparison of base (blue), dplyr
(pink), and data.table
(yellow) methods for dropping either all or select missing observations, on notional dataset of 1 million observations of 20 numeric variables with independent 5% likelihood of being missing, and a subset of 4 variables for part 2.
下面是基数(蓝色),dplyr(粉红色)和data.table(黄色)方法的比较,用于丢弃全部或选择缺失的观察值,在20个数值变量的100万个观测值的概念数据集中,具有独立的5%可能性缺失,以及第2部分的4个变量的子集。
Your results may vary based on length, width, and sparsity of your particular dataset.
您的结果可能会因特定数据集的长度,宽度和稀疏度而异。
Note log scale on y axis.
注意y轴上的对数刻度。
Benchmark script
#------- Adjust these assumptions for your own use case ------------
row_size <- 1e6L
col_size <- 20 # not including ID column
p_missing <- 0.05 # likelihood of missing observation (except ID col)
col_subset <- 18:21 # second part of question: filter on select columns
#------- System info for benchmark ----------------------------------
R.version # R version 3.4.3 (2017-11-30), platform = x86_64-w64-mingw32
library(data.table); packageVersion('data.table') # 1.10.4.3
library(dplyr); packageVersion('dplyr') # 0.7.4
library(tidyr); packageVersion('tidyr') # 0.8.0
library(microbenchmark)
#------- Example dataset using above assumptions --------------------
fakeData <- function(m, n, p){
set.seed(123)
m <- matrix(runif(m*n), nrow=m, ncol=n)
m[m<p] <- NA
return(m)
}
df <- cbind( data.frame(id = paste0('ID',seq(row_size)),
stringsAsFactors = FALSE),
data.frame(fakeData(row_size, col_size, p_missing) )
)
dt <- data.table(df)
par(las=3, mfcol=c(1,2), mar=c(22,4,1,1)+0.1)
boxplot(
microbenchmark(
df[complete.cases(df), ],
na.omit(df),
df %>% drop_na,
dt[complete.cases(dt), ],
na.omit(dt)
), xlab='',
main = 'Performance: Drop any NA observation',
col=c(rep('lightblue',2),'salmon',rep('beige',2))
)
boxplot(
microbenchmark(
df[complete.cases(df[,col_subset]), ],
#na.omit(df), # col subset not supported in na.omit.data.frame
df %>% drop_na(col_subset),
dt[complete.cases(dt[,col_subset,with=FALSE]), ],
na.omit(dt, cols=col_subset) # see ?na.omit.data.table
), xlab='',
main = 'Performance: Drop NA obs. in select cols',
col=c('lightblue','salmon',rep('beige',2))
)
#12
8
I am a synthesizer:). Here I combined the answers into one function:
我是合成器:)。在这里,我将答案合并为一个函数:
#' keep rows that have a certain number (range) of NAs anywhere/somewhere and delete others
#' @param df a data frame
#' @param col restrict to the columns where you would like to search for NA; eg, 3, c(3), 2:5, "place", c("place","age")
#' \cr default is NULL, search for all columns
#' @param n integer or vector, 0, c(3,5), number/range of NAs allowed.
#' \cr If a number, the exact number of NAs kept
#' \cr Range includes both ends 3<=n<=5
#' \cr Range could be -Inf, Inf
#' @return returns a new df with rows that have NA(s) removed
#' @export
ez.na.keep = function(df, col=NULL, n=0){
if (!is.null(col)) {
# R converts a single row/col to a vector if the parameter col has only one col
# see https://radfordneal.wordpress.com/2008/08/20/design-flaws-in-r-2-%E2%80%94-dropped-dimensions/#comments
df.temp = df[,col,drop=FALSE]
} else {
df.temp = df
}
if (length(n)==1){
if (n==0) {
# simply call complete.cases which might be faster
result = df[complete.cases(df.temp),]
} else {
# credit: http://*.com/a/30461945/2292993
log <- apply(df.temp, 2, is.na)
logindex <- apply(log, 1, function(x) sum(x) == n)
result = df[logindex, ]
}
}
if (length(n)==2){
min = n[1]; max = n[2]
log <- apply(df.temp, 2, is.na)
logindex <- apply(log, 1, function(x) {sum(x) >= min && sum(x) <= max})
result = df[logindex, ]
}
return(result)
}
#13
5
Assuming dat
as your dataframe, the expected output can be achieved using
假设dat为您的数据帧,可以使用
1.rowSums
1.rowSums
> dat[!rowSums((is.na(dat))),]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
6 ENSG00000221312 0 1 2 3 2
2.lapply
2.lapply
> dat[!Reduce('|',lapply(dat,is.na)),]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
6 ENSG00000221312 0 1 2 3 2
#14
1
delete.dirt <- function(DF, dart=c('NA')) {
dirty_rows <- apply(DF, 1, function(r) !any(r %in% dart))
DF <- DF[dirty_rows, ]
}
mydata <- delete.dirt(mydata)
Above function deletes all the rows from the data frame that has 'NA' in any column and returns the resultant data. If you want to check for multiple values like NA
and ?
change dart=c('NA')
in function param to dart=c('NA', '?')
上面的函数删除任何列中具有“NA”的数据帧中的所有行,并返回结果数据。如果要检查NA和?等多个值?将函数参数中的dart = c('NA')更改为dart = c('NA','?')
#15
0
My guess is that this could be more elegantly solved in this way
我的猜测是,用这种方式可以更优雅地解决这个问题
m <- matrix(1:25, ncol = 5)
m[c(1, 6, 13, 25)] <- NA
df <- data.frame(m)
library(dplyr)
df %>%
filter_all(any_vars(is.na(.)))
#> X1 X2 X3 X4 X5
#> 1 NA NA 11 16 21
#> 2 3 8 NA 18 23
#> 3 5 10 15 20 NA