在data.frame中删除带有NAs(缺失值)的行。

时间:2021-11-26 09:17:59

I'd like to remove the lines in this data frame that contain NAs across all columns. Below is my example data frame.

我想删除这个数据帧中的行,它包含所有列的NAs。下面是我的示例数据框架。

             gene hsap mmul mmus rnor cfam
1 ENSG00000208234    0   NA   NA   NA   NA
2 ENSG00000199674    0   2    2    2    2
3 ENSG00000221622    0   NA   NA   NA   NA
4 ENSG00000207604    0   NA   NA   1    2
5 ENSG00000207431    0   NA   NA   NA   NA
6 ENSG00000221312    0   1    2    3    2

Basically, I'd like to get a data frame such as the following.

基本上,我希望得到如下的数据框架。

             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0   2    2    2    2
6 ENSG00000221312    0   1    2    3    2

Also, I'd like to know how to only filter for some columns, so I can also get a data frame like this:

另外,我想知道如何只对一些列进行筛选,这样我也可以得到这样的数据帧:

             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0   2    2    2    2
4 ENSG00000207604    0   NA   NA   1    2
6 ENSG00000221312    0   1    2    3    2

14 个解决方案

#1


774  

Also check complete.cases :

还要检查完成。例:

> final[complete.cases(final), ]
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0    2    2    2    2
6 ENSG00000221312    0    1    2    3    2

na.omit is nicer for just removing all NA's. complete.cases allows partial selection by including only certain columns of the dataframe:

na。省略对删除所有NA是更好的。完成了。案例允许部分选择,只包含了dataframe的某些列:

> final[complete.cases(final[ , 5:6]),]
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0    2    2    2    2
4 ENSG00000207604    0   NA   NA    1    2
6 ENSG00000221312    0    1    2    3    2

Your solution can't work. If you insist on using is.na, then you have to do something like:

您的解决方案不能工作。如果你坚持使用的话。那你就得做点什么:

> final[rowSums(is.na(final[ , 5:6])) == 0, ]
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0    2    2    2    2
4 ENSG00000207604    0   NA   NA    1    2
6 ENSG00000221312    0    1    2    3    2

but using complete.cases is quite a lot more clear, and faster.

但使用完成。案例更加清晰,而且更快。

#2


191  

Try na.omit(your.data.frame). As for the second question, try posting it as another question (for clarity).

尝试na.omit(your.data.frame)。至于第二个问题,试着把它作为另一个问题(为了清楚起见)。

#3


74  

I prefer following way to check whether rows contain any NAs:

我比较喜欢以下方式来检查行是否包含任何NAs:

row.has.na <- apply(final, 1, function(x){any(is.na(x))})

This returns logical vector with values denoting whether there is any NA in a row. You can use it to see how many rows you'll have to drop:

这将返回具有值的逻辑向量,表示行中是否有任何NA。您可以使用它来查看需要删除多少行:

sum(row.has.na)

and eventually drop them

和最终

final.filtered <- final[!row.has.na,]

For filtering rows with certain part of NAs it becomes a little trickier (for example, you can feed 'final[,5:6]' to 'apply'). Generally, Joris Meys' solution seems to be more elegant.

对于带有特定部分NAs的行,它变得有点棘手(例如,您可以输入“final[,5:6]”来“应用”)。总的来说,Joris Meys的解决方案似乎更加优雅。

#4


38  

If you like pipes (%>%), tidyr's new drop_na is your friend:

如果您喜欢管道(%>%),tidyr的新drop_na是您的朋友:

library(tidyr)
df %>% drop_na()
#              gene hsap mmul mmus rnor cfam
# 2 ENSG00000199674    0    2    2    2    2
# 6 ENSG00000221312    0    1    2    3    2
df %>% drop_na(rnor, cfam)
#              gene hsap mmul mmus rnor cfam
# 2 ENSG00000199674    0    2    2    2    2
# 4 ENSG00000207604    0   NA   NA    1    2
# 6 ENSG00000221312    0    1    2    3    2

#5


34  

Another option if you want greater control over how rows are deemed to be invalid is

另一个选项是,如果您想要更大的控制行被认为是无效的。

final <- final[!(is.na(final$rnor)) | !(is.na(rawdata$cfam)),]

Using the above, this:

使用上面的,这个:

             gene hsap mmul mmus rnor cfam
1 ENSG00000208234    0   NA   NA   NA   2
2 ENSG00000199674    0   2    2    2    2
3 ENSG00000221622    0   NA   NA   2   NA
4 ENSG00000207604    0   NA   NA   1    2
5 ENSG00000207431    0   NA   NA   NA   NA
6 ENSG00000221312    0   1    2    3    2

Becomes:

就变成:

             gene hsap mmul mmus rnor cfam
1 ENSG00000208234    0   NA   NA   NA   2
2 ENSG00000199674    0   2    2    2    2
3 ENSG00000221622    0   NA   NA   2   NA
4 ENSG00000207604    0   NA   NA   1    2
6 ENSG00000221312    0   1    2    3    2

...where only row 5 is removed since it is the only row containing NAs for both rnor AND cfam. The boolean logic can then be changed to fit specific requirements.

…只有第5行被删除,因为它是rnor和cfam中唯一包含NAs的行。然后可以修改布尔逻辑以适应特定的需求。

#6


27  

If you want control over how many NAs are valid for each row, try this function. For many survey data sets, too many blank question responses can ruin the results. So they are deleted after a certain threshold. This function will allow you to choose how many NAs the row can have before it's deleted:

如果你想要控制每一行有多少个NAs,试试这个函数。对于许多调查数据集来说,太多空白的问题回答会破坏结果。因此,它们在某个阈值后被删除。这个函数将允许您选择行在删除之前可以拥有多少个NAs:

delete.na <- function(DF, n=0) {
  DF[rowSums(is.na(DF)) <= n,]
}

By default, it will eliminate all NAs:

默认情况下,它将消除所有NAs:

delete.na(final)
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0    2    2    2    2
6 ENSG00000221312    0    1    2    3    2

Or specify the maximum number of NAs allowed:

或指定允许的NAs的最大数目:

delete.na(final, 2)
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0    2    2    2    2
4 ENSG00000207604    0   NA   NA    1    2
6 ENSG00000221312    0    1    2    3    2

#7


12  

This will return the rows that have at least ONE non-NA value.

这将返回至少有一个非na值的行。

final[rowSums(is.na(final))<length(final),]

This will return the rows that have at least TWO non-NA value.

这将返回至少有两个非na值的行。

final[rowSums(is.na(final))<(length(final)-1),]

#8


10  

We can also use the subset function for this.

我们也可以用这个子集函数。

finalData<-subset(data,!(is.na(data["mmul"]) | is.na(data["rnor"])))

This will give only those rows that do not have NA in both mmul and rnor

这将只给出那些在mmul和rnor中没有NA的行。

#9


9  

Using dplyr package we can filter NA as follows:

使用dplyr包,我们可以过滤NA如下:

dplyr::filter(df,  !is.na(columnname))

#10


9  

For your first question, I have a code that I am comfortable with to get rid of all NAs. Thanks for @Gregor to make it simpler.

对于您的第一个问题,我有一段代码,我很乐意处理掉所有的NAs。感谢@Gregor让它变得更简单。

final[!(rowSums(is.na(final))),]

For the second question, the code is just an alternation from the previous solution.

对于第二个问题,代码只是从以前的解决方案中变更的。

final[as.logical((rowSums(is.na(final))-5)),]

Notice the -5 is the number of columns in your data. This will eliminate rows with all NAs, since the rowSums adds up to 5 and they become zeroes after subtraction. This time, as.logical is necessary.

注意-5是数据中的列数。这将消除所有NAs的行,因为rowsum加起来是5,它们在减法之后变成零。这一次,。逻辑是必要的。

#11


7  

I am a synthesizer:). Here I combined the answers into one function:

我是一个合成器:)。这里我把答案组合成一个函数:

#' keep rows that have a certain number (range) of NAs anywhere/somewhere and delete others
#' @param df a data frame
#' @param col restrict to the columns where you would like to search for NA; eg, 3, c(3), 2:5, "place", c("place","age")
#' \cr default is NULL, search for all columns
#' @param n integer or vector, 0, c(3,5), number/range of NAs allowed.
#' \cr If a number, the exact number of NAs kept
#' \cr Range includes both ends 3<=n<=5
#' \cr Range could be -Inf, Inf
#' @return returns a new df with rows that have NA(s) removed
#' @export
ez.na.keep = function(df, col=NULL, n=0){
    if (!is.null(col)) {
        # R converts a single row/col to a vector if the parameter col has only one col
        # see https://radfordneal.wordpress.com/2008/08/20/design-flaws-in-r-2-%E2%80%94-dropped-dimensions/#comments
        df.temp = df[,col,drop=FALSE]
    } else {
        df.temp = df
    }

    if (length(n)==1){
        if (n==0) {
            # simply call complete.cases which might be faster
            result = df[complete.cases(df.temp),]
        } else {
            # credit: http://*.com/a/30461945/2292993
            log <- apply(df.temp, 2, is.na)
            logindex <- apply(log, 1, function(x) sum(x) == n)
            result = df[logindex, ]
        }
    }

    if (length(n)==2){
        min = n[1]; max = n[2]
        log <- apply(df.temp, 2, is.na)
        logindex <- apply(log, 1, function(x) {sum(x) >= min && sum(x) <= max})
        result = df[logindex, ]
    }

    return(result)
}

#12


4  

Assuming dat as your dataframe, the expected output can be achieved using

假设dat作为您的dataframe,可以使用预期的输出。

1.rowSums

1. rowsums

> dat[!rowSums((is.na(dat))),]
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0   2    2    2    2
6 ENSG00000221312    0   1    2    3    2

2.lapply

2.拉普兰人

> dat[!Reduce('|',lapply(dat,is.na)),]
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0   2    2    2    2
6 ENSG00000221312    0   1    2    3    2

#13


3  

If performance is a priority, use data.table and na.omit() with optional param cols=.

na.omit.data.table is the fastest on my benchmark (see below), whether for all columns or for select columns (OP question part 2).

na.omit.data。在我的基准测试中,表是最快的(见下文),无论是对于所有列还是选择列(OP问题第2部分)。

If you don't want to use data.table, use complete.cases().

On a vanilla data.frame, complete.cases is faster than na.omit() or dplyr::drop_na(). Notice that na.omit.data.frame does not support cols=.

在一个香草的数据上,完成。用例比na_ .省略()或dplyr::drop_na()更快。请注意,na.omit.data.frame不支持cols=。

Benchmark result

Here is a comparison of base (blue), dplyr (pink), and data.table (yellow) methods for dropping either all or select missing observations, on notional dataset of 1 million observations of 20 numeric variables with independent 5% likelihood of being missing, and a subset of 4 variables for part 2.

这里是base (blue)、dplyr (pink)和data的比较。表(黄色)方法用于删除所有或选择缺失的观测值,在100万次观测数据集上,对20个数值变量的观测值与独立的5%可能缺失,以及第2部分的4个变量的子集。

Your results may vary based on length, width, and sparsity of your particular dataset.

您的结果可能根据您的特定数据集的长度、宽度和稀疏性而有所不同。

Note log scale on y axis.

注意在y轴上的对数刻度。

在data.frame中删除带有NAs(缺失值)的行。

Benchmark script

#-------  Adjust these assumptions for your own use case  ------------
row_size   <- 1e6L 
col_size   <- 20    # not including ID column
p_missing  <- 0.05   # likelihood of missing observation (except ID col)
col_subset <- 18:21  # second part of question: filter on select columns

#-------  System info for benchmark  ----------------------------------
R.version # R version 3.4.3 (2017-11-30), platform = x86_64-w64-mingw32
library(data.table); packageVersion('data.table') # 1.10.4.3
library(dplyr);      packageVersion('dplyr')      # 0.7.4
library(tidyr);      packageVersion('tidyr')      # 0.8.0
library(microbenchmark)

#-------  Example dataset using above assumptions  --------------------
fakeData <- function(m, n, p){
  set.seed(123)
  m <-  matrix(runif(m*n), nrow=m, ncol=n)
  m[m<p] <- NA
  return(m)
}
df <- cbind( data.frame(id = paste0('ID',seq(row_size)), 
                        stringsAsFactors = FALSE),
             data.frame(fakeData(row_size, col_size, p_missing) )
             )
dt <- data.table(df)

par(las=3, mfcol=c(1,2), mar=c(22,4,1,1)+0.1)
boxplot(
  microbenchmark(
    df[complete.cases(df), ],
    na.omit(df),
    df %>% drop_na,
    dt[complete.cases(dt), ],
    na.omit(dt)
  ), xlab='', 
  main = 'Performance: Drop any NA observation',
  col=c(rep('lightblue',2),'salmon',rep('beige',2))
)
boxplot(
  microbenchmark(
    df[complete.cases(df[,col_subset]), ],
    #na.omit(df), # col subset not supported in na.omit.data.frame
    df %>% drop_na(col_subset),
    dt[complete.cases(dt[,col_subset,with=FALSE]), ],
    na.omit(dt, cols=col_subset) # see ?na.omit.data.table
  ), xlab='', 
  main = 'Performance: Drop NA obs. in select cols',
  col=c('lightblue','salmon',rep('beige',2))
)

#14


1  

delete.dirt <- function(DF, dart=c('NA')) {
  dirty_rows <- apply(DF, 1, function(r) !any(r %in% dart))
  DF <- DF[dirty_rows, ]
}

mydata <- delete.dirt(mydata)

Above function deletes all the rows from the data frame that has 'NA' in any column and returns the resultant data. If you want to check for multiple values like NA and ? change dart=c('NA') in function param to dart=c('NA', '?')

上面的函数删除任何列中有“NA”的数据框中的所有行,并返回结果数据。如果你想检查像NA和?在函数param中改变dart=c('NA')到dart=c('NA', '?')

#1


774  

Also check complete.cases :

还要检查完成。例:

> final[complete.cases(final), ]
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0    2    2    2    2
6 ENSG00000221312    0    1    2    3    2

na.omit is nicer for just removing all NA's. complete.cases allows partial selection by including only certain columns of the dataframe:

na。省略对删除所有NA是更好的。完成了。案例允许部分选择,只包含了dataframe的某些列:

> final[complete.cases(final[ , 5:6]),]
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0    2    2    2    2
4 ENSG00000207604    0   NA   NA    1    2
6 ENSG00000221312    0    1    2    3    2

Your solution can't work. If you insist on using is.na, then you have to do something like:

您的解决方案不能工作。如果你坚持使用的话。那你就得做点什么:

> final[rowSums(is.na(final[ , 5:6])) == 0, ]
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0    2    2    2    2
4 ENSG00000207604    0   NA   NA    1    2
6 ENSG00000221312    0    1    2    3    2

but using complete.cases is quite a lot more clear, and faster.

但使用完成。案例更加清晰,而且更快。

#2


191  

Try na.omit(your.data.frame). As for the second question, try posting it as another question (for clarity).

尝试na.omit(your.data.frame)。至于第二个问题,试着把它作为另一个问题(为了清楚起见)。

#3


74  

I prefer following way to check whether rows contain any NAs:

我比较喜欢以下方式来检查行是否包含任何NAs:

row.has.na <- apply(final, 1, function(x){any(is.na(x))})

This returns logical vector with values denoting whether there is any NA in a row. You can use it to see how many rows you'll have to drop:

这将返回具有值的逻辑向量,表示行中是否有任何NA。您可以使用它来查看需要删除多少行:

sum(row.has.na)

and eventually drop them

和最终

final.filtered <- final[!row.has.na,]

For filtering rows with certain part of NAs it becomes a little trickier (for example, you can feed 'final[,5:6]' to 'apply'). Generally, Joris Meys' solution seems to be more elegant.

对于带有特定部分NAs的行,它变得有点棘手(例如,您可以输入“final[,5:6]”来“应用”)。总的来说,Joris Meys的解决方案似乎更加优雅。

#4


38  

If you like pipes (%>%), tidyr's new drop_na is your friend:

如果您喜欢管道(%>%),tidyr的新drop_na是您的朋友:

library(tidyr)
df %>% drop_na()
#              gene hsap mmul mmus rnor cfam
# 2 ENSG00000199674    0    2    2    2    2
# 6 ENSG00000221312    0    1    2    3    2
df %>% drop_na(rnor, cfam)
#              gene hsap mmul mmus rnor cfam
# 2 ENSG00000199674    0    2    2    2    2
# 4 ENSG00000207604    0   NA   NA    1    2
# 6 ENSG00000221312    0    1    2    3    2

#5


34  

Another option if you want greater control over how rows are deemed to be invalid is

另一个选项是,如果您想要更大的控制行被认为是无效的。

final <- final[!(is.na(final$rnor)) | !(is.na(rawdata$cfam)),]

Using the above, this:

使用上面的,这个:

             gene hsap mmul mmus rnor cfam
1 ENSG00000208234    0   NA   NA   NA   2
2 ENSG00000199674    0   2    2    2    2
3 ENSG00000221622    0   NA   NA   2   NA
4 ENSG00000207604    0   NA   NA   1    2
5 ENSG00000207431    0   NA   NA   NA   NA
6 ENSG00000221312    0   1    2    3    2

Becomes:

就变成:

             gene hsap mmul mmus rnor cfam
1 ENSG00000208234    0   NA   NA   NA   2
2 ENSG00000199674    0   2    2    2    2
3 ENSG00000221622    0   NA   NA   2   NA
4 ENSG00000207604    0   NA   NA   1    2
6 ENSG00000221312    0   1    2    3    2

...where only row 5 is removed since it is the only row containing NAs for both rnor AND cfam. The boolean logic can then be changed to fit specific requirements.

…只有第5行被删除,因为它是rnor和cfam中唯一包含NAs的行。然后可以修改布尔逻辑以适应特定的需求。

#6


27  

If you want control over how many NAs are valid for each row, try this function. For many survey data sets, too many blank question responses can ruin the results. So they are deleted after a certain threshold. This function will allow you to choose how many NAs the row can have before it's deleted:

如果你想要控制每一行有多少个NAs,试试这个函数。对于许多调查数据集来说,太多空白的问题回答会破坏结果。因此,它们在某个阈值后被删除。这个函数将允许您选择行在删除之前可以拥有多少个NAs:

delete.na <- function(DF, n=0) {
  DF[rowSums(is.na(DF)) <= n,]
}

By default, it will eliminate all NAs:

默认情况下,它将消除所有NAs:

delete.na(final)
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0    2    2    2    2
6 ENSG00000221312    0    1    2    3    2

Or specify the maximum number of NAs allowed:

或指定允许的NAs的最大数目:

delete.na(final, 2)
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0    2    2    2    2
4 ENSG00000207604    0   NA   NA    1    2
6 ENSG00000221312    0    1    2    3    2

#7


12  

This will return the rows that have at least ONE non-NA value.

这将返回至少有一个非na值的行。

final[rowSums(is.na(final))<length(final),]

This will return the rows that have at least TWO non-NA value.

这将返回至少有两个非na值的行。

final[rowSums(is.na(final))<(length(final)-1),]

#8


10  

We can also use the subset function for this.

我们也可以用这个子集函数。

finalData<-subset(data,!(is.na(data["mmul"]) | is.na(data["rnor"])))

This will give only those rows that do not have NA in both mmul and rnor

这将只给出那些在mmul和rnor中没有NA的行。

#9


9  

Using dplyr package we can filter NA as follows:

使用dplyr包,我们可以过滤NA如下:

dplyr::filter(df,  !is.na(columnname))

#10


9  

For your first question, I have a code that I am comfortable with to get rid of all NAs. Thanks for @Gregor to make it simpler.

对于您的第一个问题,我有一段代码,我很乐意处理掉所有的NAs。感谢@Gregor让它变得更简单。

final[!(rowSums(is.na(final))),]

For the second question, the code is just an alternation from the previous solution.

对于第二个问题,代码只是从以前的解决方案中变更的。

final[as.logical((rowSums(is.na(final))-5)),]

Notice the -5 is the number of columns in your data. This will eliminate rows with all NAs, since the rowSums adds up to 5 and they become zeroes after subtraction. This time, as.logical is necessary.

注意-5是数据中的列数。这将消除所有NAs的行,因为rowsum加起来是5,它们在减法之后变成零。这一次,。逻辑是必要的。

#11


7  

I am a synthesizer:). Here I combined the answers into one function:

我是一个合成器:)。这里我把答案组合成一个函数:

#' keep rows that have a certain number (range) of NAs anywhere/somewhere and delete others
#' @param df a data frame
#' @param col restrict to the columns where you would like to search for NA; eg, 3, c(3), 2:5, "place", c("place","age")
#' \cr default is NULL, search for all columns
#' @param n integer or vector, 0, c(3,5), number/range of NAs allowed.
#' \cr If a number, the exact number of NAs kept
#' \cr Range includes both ends 3<=n<=5
#' \cr Range could be -Inf, Inf
#' @return returns a new df with rows that have NA(s) removed
#' @export
ez.na.keep = function(df, col=NULL, n=0){
    if (!is.null(col)) {
        # R converts a single row/col to a vector if the parameter col has only one col
        # see https://radfordneal.wordpress.com/2008/08/20/design-flaws-in-r-2-%E2%80%94-dropped-dimensions/#comments
        df.temp = df[,col,drop=FALSE]
    } else {
        df.temp = df
    }

    if (length(n)==1){
        if (n==0) {
            # simply call complete.cases which might be faster
            result = df[complete.cases(df.temp),]
        } else {
            # credit: http://*.com/a/30461945/2292993
            log <- apply(df.temp, 2, is.na)
            logindex <- apply(log, 1, function(x) sum(x) == n)
            result = df[logindex, ]
        }
    }

    if (length(n)==2){
        min = n[1]; max = n[2]
        log <- apply(df.temp, 2, is.na)
        logindex <- apply(log, 1, function(x) {sum(x) >= min && sum(x) <= max})
        result = df[logindex, ]
    }

    return(result)
}

#12


4  

Assuming dat as your dataframe, the expected output can be achieved using

假设dat作为您的dataframe,可以使用预期的输出。

1.rowSums

1. rowsums

> dat[!rowSums((is.na(dat))),]
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0   2    2    2    2
6 ENSG00000221312    0   1    2    3    2

2.lapply

2.拉普兰人

> dat[!Reduce('|',lapply(dat,is.na)),]
             gene hsap mmul mmus rnor cfam
2 ENSG00000199674    0   2    2    2    2
6 ENSG00000221312    0   1    2    3    2

#13


3  

If performance is a priority, use data.table and na.omit() with optional param cols=.

na.omit.data.table is the fastest on my benchmark (see below), whether for all columns or for select columns (OP question part 2).

na.omit.data。在我的基准测试中,表是最快的(见下文),无论是对于所有列还是选择列(OP问题第2部分)。

If you don't want to use data.table, use complete.cases().

On a vanilla data.frame, complete.cases is faster than na.omit() or dplyr::drop_na(). Notice that na.omit.data.frame does not support cols=.

在一个香草的数据上,完成。用例比na_ .省略()或dplyr::drop_na()更快。请注意,na.omit.data.frame不支持cols=。

Benchmark result

Here is a comparison of base (blue), dplyr (pink), and data.table (yellow) methods for dropping either all or select missing observations, on notional dataset of 1 million observations of 20 numeric variables with independent 5% likelihood of being missing, and a subset of 4 variables for part 2.

这里是base (blue)、dplyr (pink)和data的比较。表(黄色)方法用于删除所有或选择缺失的观测值,在100万次观测数据集上,对20个数值变量的观测值与独立的5%可能缺失,以及第2部分的4个变量的子集。

Your results may vary based on length, width, and sparsity of your particular dataset.

您的结果可能根据您的特定数据集的长度、宽度和稀疏性而有所不同。

Note log scale on y axis.

注意在y轴上的对数刻度。

在data.frame中删除带有NAs(缺失值)的行。

Benchmark script

#-------  Adjust these assumptions for your own use case  ------------
row_size   <- 1e6L 
col_size   <- 20    # not including ID column
p_missing  <- 0.05   # likelihood of missing observation (except ID col)
col_subset <- 18:21  # second part of question: filter on select columns

#-------  System info for benchmark  ----------------------------------
R.version # R version 3.4.3 (2017-11-30), platform = x86_64-w64-mingw32
library(data.table); packageVersion('data.table') # 1.10.4.3
library(dplyr);      packageVersion('dplyr')      # 0.7.4
library(tidyr);      packageVersion('tidyr')      # 0.8.0
library(microbenchmark)

#-------  Example dataset using above assumptions  --------------------
fakeData <- function(m, n, p){
  set.seed(123)
  m <-  matrix(runif(m*n), nrow=m, ncol=n)
  m[m<p] <- NA
  return(m)
}
df <- cbind( data.frame(id = paste0('ID',seq(row_size)), 
                        stringsAsFactors = FALSE),
             data.frame(fakeData(row_size, col_size, p_missing) )
             )
dt <- data.table(df)

par(las=3, mfcol=c(1,2), mar=c(22,4,1,1)+0.1)
boxplot(
  microbenchmark(
    df[complete.cases(df), ],
    na.omit(df),
    df %>% drop_na,
    dt[complete.cases(dt), ],
    na.omit(dt)
  ), xlab='', 
  main = 'Performance: Drop any NA observation',
  col=c(rep('lightblue',2),'salmon',rep('beige',2))
)
boxplot(
  microbenchmark(
    df[complete.cases(df[,col_subset]), ],
    #na.omit(df), # col subset not supported in na.omit.data.frame
    df %>% drop_na(col_subset),
    dt[complete.cases(dt[,col_subset,with=FALSE]), ],
    na.omit(dt, cols=col_subset) # see ?na.omit.data.table
  ), xlab='', 
  main = 'Performance: Drop NA obs. in select cols',
  col=c('lightblue','salmon',rep('beige',2))
)

#14


1  

delete.dirt <- function(DF, dart=c('NA')) {
  dirty_rows <- apply(DF, 1, function(r) !any(r %in% dart))
  DF <- DF[dirty_rows, ]
}

mydata <- delete.dirt(mydata)

Above function deletes all the rows from the data frame that has 'NA' in any column and returns the resultant data. If you want to check for multiple values like NA and ? change dart=c('NA') in function param to dart=c('NA', '?')

上面的函数删除任何列中有“NA”的数据框中的所有行,并返回结果数据。如果你想检查像NA和?在函数param中改变dart=c('NA')到dart=c('NA', '?')