使用来自不同列的值替换数据框中的NA

I would like to replace NAs in my data frame with values from another column. For example:

我想用我的数据框中的NA替换来自另一列的值。例如:

a1 <- c(1, 2, 4, NA, 2, NA)
b1 <- c(3, NA, 4, 4, 4, 3)
c1 <- c(NA, 3, 3, 4, 2, 3)
a2 <- c(2, 3, 5, 5, 3, 4)
b2 <- c(1, 2, 4, 5, 6, 3)
c2 <- c(3, 3, 2, 3, 4, 3)
df <- as.data.frame(cbind(a1, b1, c1, a2, b2, c2))
df
> df
  a1 b1 c1 a2 b2 c2
1  1  3 NA  2  1  3
2  2 NA  3  3  2  3
3  4  4  3  5  4  2
4 NA  4  4  5  5  3
5  2  4  2  3  6  4
6 NA  3  3  4  3  3

I would like replace the NAs in df$a1 with the values from the corresponding row in df$a2, the NAs in df$b1 with the values from the corresponding row in df$b2, and the NAs in df$c1 with the values from the corresponding row in df$c2 so that the new data frame looks like:

我想用df $ a1中相应行的值替换df $ a1中的NAs,df $ b1中的nAs与df $ b2中相应行的值,以及df $ c1中的NAs值从df $ c2中的相应行开始,以便新数据框看起来像:

How can I do this? I have a large data frame with many columns, so it would be great to find an efficient way to do this (I've already seen Replace missing values with a value from another column). Thank you!

我怎样才能做到这一点?我有一个包含许多列的大型数据框,因此找到一种有效的方法(我已经看到用另一列中的值替换缺失值)会很棒。谢谢!

4 个解决方案

#1

An extensible option:

可扩展的选项:

df2 <- df[c('a1','b1','c1')]
df2[] <- mapply(function(x,y) ifelse(is.na(x), y, x),
                df[c('a1','b1','c1')], df[c('a2','b2','c2')],
                SIMPLIFY=FALSE)
df2
#   a1 b1 c1
# 1  1  3  3
# 2  2  2  3
# 3  4  4  3
# 4  5  4  4
# 5  2  4  2
# 6  4  3  3

It's easy enough to extend this to arbitrary column pairs: the first column in the first subset (df[c('a1','b1','c1')]) is paired with the first column of the second subset; second column first subset, second column second subset; etc. It can even be generalized with df[grepl('1$',colnames(df))] and df[grepl('2$',colnames(df))], assuming they don't mis-match.

很容易将其扩展到任意列对:第一个子集中的第一列(df [c('a1','b1','c1')])与第二个子集的第一列配对;第二列第一子集,第二列第二子集;它甚至可以用df [grepl('1 $',colnames(df))]和df [grepl('2 $',colnames(df))]来推广,假设它们不匹配。

#2

coalesce in dplyr is meant to do exactly this (replace NAs in a first vector with not NA elements of a later one). e.g.

dplyr中的coalesce意味着要做到这一点(在第一个向量中替换NA而不是后者的NA元素)。例如

coalesce(df$a1,df$a2)
[1] 1 2 4 5 2 4

It can be used with sapply to do the whole dataset in an efficient and easily extensible manner:

它可以与sapply一起使用,以高效且易于扩展的方式完成整个数据集:

sapply(c("a","b","c"),function(x) coalesce(df[,paste0(x,1)],df[,paste0(x,2)]))
     a b c
[1,] 1 3 3
[2,] 2 2 3
[3,] 4 4 3
[4,] 5 4 4
[5,] 2 4 2
[6,] 4 3 3

#3

dfnew<- ifelse(is.na(df$a1) == T, df$a2, df$a1)

dfnew < - ifelse(is.na(df $ a1)== T,df $ a2,df $ a1)

as.data.frame(dfnew)

this is just for a1 col, you'll have to run this for all a,b and c and cbind it. if there are too many columns, running a loop will be the best option imo

这只是针对a1 col,你必须为所有a,b和c运行它并且cbind它。如果列太多,运行循环将是最好的选择imo

#4

You can use hutils::coalesce. It should be slightly faster, especially if it can 'cheat' -- if any columns have no NAs and so don't need to change, coalesce will skip them:

你可以使用hutils :: coalesce。它应该稍快一些,特别是如果它可以“作弊” - 如果任何列没有NA而且不需要更改,coalesce将跳过它们:

a1 <- c(1, 2, 4, NA, 2, NA)
b1 <- c(3, NA, 4, 4, 4, 3)
c1 <- c(NA, 3, 3, 4, 2, 3)
a2 <- c(2, 3, 5, 5, 3, 4)
b2 <- c(1, 2, 4, 5, 6, 3)
c2 <- c(3, 3, 2, 3, 4, 3)

s <- function(x) {
  sample(x, size = 1e6, replace = TRUE)
}
df <- as.data.frame(cbind(a1 = s(a1), b1 = s(b1), c1 = s(c1),
                          a2 = s(a2), b2 = s(b2), c2 = s(c2)))

library(microbenchmark)
library(hutils)
library(data.table)

dt <- as.data.table(df)
old <- paste0(letters[1:3], "1") # you will need to specify
new <- paste0(letters[1:3], "2") 

dplyr_coalesce <- function(df) {
  ans <- df
  for (j in seq_along(old)) {
    o <- old[j]
    n <- new[j]
    ans[[o]] <- dplyr::coalesce(ans[[o]], df[[n]])
  }
  ans
}

hutils_coalesce <- function(df) {
  ans <- df
  for (j in seq_along(old)) {
    o <- old[j]
    n <- new[j]
    ans[[o]] <- hutils::coalesce(ans[[o]], df[[n]])
  }
  ans
}

microbenchmark(dplyr = dplyr_coalesce(df),
               hutils = hutils_coalesce(df))
#> Unit: milliseconds
#>    expr      min       lq     mean   median       uq       max neval cld
#>   dplyr 45.78123 61.76857 95.10870 69.21561 87.84774 1452.0800   100   b
#>  hutils 36.48602 46.76336 63.46643 52.95736 64.53066  252.5608   100  a

Created on 2018-03-29 by the reprex package (v0.2.0).

由reprex包(v0.2.0)于2018-03-29创建。

#1