I would like to replace NAs in my data frame with values from another column. For example:
我想用我的数据框中的NA替换来自另一列的值。例如:
a1 <- c(1, 2, 4, NA, 2, NA)
b1 <- c(3, NA, 4, 4, 4, 3)
c1 <- c(NA, 3, 3, 4, 2, 3)
a2 <- c(2, 3, 5, 5, 3, 4)
b2 <- c(1, 2, 4, 5, 6, 3)
c2 <- c(3, 3, 2, 3, 4, 3)
df <- as.data.frame(cbind(a1, b1, c1, a2, b2, c2))
df
> df
a1 b1 c1 a2 b2 c2
1 1 3 NA 2 1 3
2 2 NA 3 3 2 3
3 4 4 3 5 4 2
4 NA 4 4 5 5 3
5 2 4 2 3 6 4
6 NA 3 3 4 3 3
I would like replace the NAs in df$a1
with the values from the corresponding row in df$a2
, the NAs in df$b1
with the values from the corresponding row in df$b2
, and the NAs in df$c1
with the values from the corresponding row in df$c2
so that the new data frame looks like:
我想用df $ a1中相应行的值替换df $ a1中的NAs,df $ b1中的nAs与df $ b2中相应行的值,以及df $ c1中的NAs值从df $ c2中的相应行开始,以便新数据框看起来像:
> df
a1 b1 c1
1 1 3 3
2 2 2 3
3 4 4 3
4 5 4 4
5 2 4 2
6 4 3 3
How can I do this? I have a large data frame with many columns, so it would be great to find an efficient way to do this (I've already seen Replace missing values with a value from another column). Thank you!
我怎样才能做到这一点?我有一个包含许多列的大型数据框,因此找到一种有效的方法(我已经看到用另一列中的值替换缺失值)会很棒。谢谢!
4 个解决方案
#1
1
An extensible option:
可扩展的选项:
df2 <- df[c('a1','b1','c1')]
df2[] <- mapply(function(x,y) ifelse(is.na(x), y, x),
df[c('a1','b1','c1')], df[c('a2','b2','c2')],
SIMPLIFY=FALSE)
df2
# a1 b1 c1
# 1 1 3 3
# 2 2 2 3
# 3 4 4 3
# 4 5 4 4
# 5 2 4 2
# 6 4 3 3
It's easy enough to extend this to arbitrary column pairs: the first column in the first subset (df[c('a1','b1','c1')]
) is paired with the first column of the second subset; second column first subset, second column second subset; etc. It can even be generalized with df[grepl('1$',colnames(df))]
and df[grepl('2$',colnames(df))]
, assuming they don't mis-match.
很容易将其扩展到任意列对:第一个子集中的第一列(df [c('a1','b1','c1')])与第二个子集的第一列配对;第二列第一子集,第二列第二子集;它甚至可以用df [grepl('1 $',colnames(df))]和df [grepl('2 $',colnames(df))]来推广,假设它们不匹配。
#2
1
coalesce
in dplyr
is meant to do exactly this (replace NAs in a first vector with not NA elements of a later one). e.g.
dplyr中的coalesce意味着要做到这一点(在第一个向量中替换NA而不是后者的NA元素)。例如
coalesce(df$a1,df$a2)
[1] 1 2 4 5 2 4
It can be used with sapply to do the whole dataset in an efficient and easily extensible manner:
它可以与sapply一起使用,以高效且易于扩展的方式完成整个数据集:
sapply(c("a","b","c"),function(x) coalesce(df[,paste0(x,1)],df[,paste0(x,2)]))
a b c
[1,] 1 3 3
[2,] 2 2 3
[3,] 4 4 3
[4,] 5 4 4
[5,] 2 4 2
[6,] 4 3 3
#3
0
dfnew<- ifelse(is.na(df$a1) == T, df$a2, df$a1)
dfnew < - ifelse(is.na(df $ a1)== T,df $ a2,df $ a1)
as.data.frame(dfnew)
this is just for a1 col, you'll have to run this for all a,b and c and cbind it. if there are too many columns, running a loop will be the best option imo
这只是针对a1 col,你必须为所有a,b和c运行它并且cbind它。如果列太多,运行循环将是最好的选择imo
#4
0
You can use hutils::coalesce
. It should be slightly faster, especially if it can 'cheat' -- if any columns have no NA
s and so don't need to change, coalesce
will skip them:
你可以使用hutils :: coalesce。它应该稍快一些,特别是如果它可以“作弊” - 如果任何列没有NA而且不需要更改,coalesce将跳过它们:
a1 <- c(1, 2, 4, NA, 2, NA)
b1 <- c(3, NA, 4, 4, 4, 3)
c1 <- c(NA, 3, 3, 4, 2, 3)
a2 <- c(2, 3, 5, 5, 3, 4)
b2 <- c(1, 2, 4, 5, 6, 3)
c2 <- c(3, 3, 2, 3, 4, 3)
s <- function(x) {
sample(x, size = 1e6, replace = TRUE)
}
df <- as.data.frame(cbind(a1 = s(a1), b1 = s(b1), c1 = s(c1),
a2 = s(a2), b2 = s(b2), c2 = s(c2)))
library(microbenchmark)
library(hutils)
library(data.table)
dt <- as.data.table(df)
old <- paste0(letters[1:3], "1") # you will need to specify
new <- paste0(letters[1:3], "2")
dplyr_coalesce <- function(df) {
ans <- df
for (j in seq_along(old)) {
o <- old[j]
n <- new[j]
ans[[o]] <- dplyr::coalesce(ans[[o]], df[[n]])
}
ans
}
hutils_coalesce <- function(df) {
ans <- df
for (j in seq_along(old)) {
o <- old[j]
n <- new[j]
ans[[o]] <- hutils::coalesce(ans[[o]], df[[n]])
}
ans
}
microbenchmark(dplyr = dplyr_coalesce(df),
hutils = hutils_coalesce(df))
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> dplyr 45.78123 61.76857 95.10870 69.21561 87.84774 1452.0800 100 b
#> hutils 36.48602 46.76336 63.46643 52.95736 64.53066 252.5608 100 a
Created on 2018-03-29 by the reprex package (v0.2.0).
由reprex包(v0.2.0)于2018-03-29创建。
#1
1
An extensible option:
可扩展的选项:
df2 <- df[c('a1','b1','c1')]
df2[] <- mapply(function(x,y) ifelse(is.na(x), y, x),
df[c('a1','b1','c1')], df[c('a2','b2','c2')],
SIMPLIFY=FALSE)
df2
# a1 b1 c1
# 1 1 3 3
# 2 2 2 3
# 3 4 4 3
# 4 5 4 4
# 5 2 4 2
# 6 4 3 3
It's easy enough to extend this to arbitrary column pairs: the first column in the first subset (df[c('a1','b1','c1')]
) is paired with the first column of the second subset; second column first subset, second column second subset; etc. It can even be generalized with df[grepl('1$',colnames(df))]
and df[grepl('2$',colnames(df))]
, assuming they don't mis-match.
很容易将其扩展到任意列对:第一个子集中的第一列(df [c('a1','b1','c1')])与第二个子集的第一列配对;第二列第一子集,第二列第二子集;它甚至可以用df [grepl('1 $',colnames(df))]和df [grepl('2 $',colnames(df))]来推广,假设它们不匹配。
#2
1
coalesce
in dplyr
is meant to do exactly this (replace NAs in a first vector with not NA elements of a later one). e.g.
dplyr中的coalesce意味着要做到这一点(在第一个向量中替换NA而不是后者的NA元素)。例如
coalesce(df$a1,df$a2)
[1] 1 2 4 5 2 4
It can be used with sapply to do the whole dataset in an efficient and easily extensible manner:
它可以与sapply一起使用,以高效且易于扩展的方式完成整个数据集:
sapply(c("a","b","c"),function(x) coalesce(df[,paste0(x,1)],df[,paste0(x,2)]))
a b c
[1,] 1 3 3
[2,] 2 2 3
[3,] 4 4 3
[4,] 5 4 4
[5,] 2 4 2
[6,] 4 3 3
#3
0
dfnew<- ifelse(is.na(df$a1) == T, df$a2, df$a1)
dfnew < - ifelse(is.na(df $ a1)== T,df $ a2,df $ a1)
as.data.frame(dfnew)
this is just for a1 col, you'll have to run this for all a,b and c and cbind it. if there are too many columns, running a loop will be the best option imo
这只是针对a1 col,你必须为所有a,b和c运行它并且cbind它。如果列太多,运行循环将是最好的选择imo
#4
0
You can use hutils::coalesce
. It should be slightly faster, especially if it can 'cheat' -- if any columns have no NA
s and so don't need to change, coalesce
will skip them:
你可以使用hutils :: coalesce。它应该稍快一些,特别是如果它可以“作弊” - 如果任何列没有NA而且不需要更改,coalesce将跳过它们:
a1 <- c(1, 2, 4, NA, 2, NA)
b1 <- c(3, NA, 4, 4, 4, 3)
c1 <- c(NA, 3, 3, 4, 2, 3)
a2 <- c(2, 3, 5, 5, 3, 4)
b2 <- c(1, 2, 4, 5, 6, 3)
c2 <- c(3, 3, 2, 3, 4, 3)
s <- function(x) {
sample(x, size = 1e6, replace = TRUE)
}
df <- as.data.frame(cbind(a1 = s(a1), b1 = s(b1), c1 = s(c1),
a2 = s(a2), b2 = s(b2), c2 = s(c2)))
library(microbenchmark)
library(hutils)
library(data.table)
dt <- as.data.table(df)
old <- paste0(letters[1:3], "1") # you will need to specify
new <- paste0(letters[1:3], "2")
dplyr_coalesce <- function(df) {
ans <- df
for (j in seq_along(old)) {
o <- old[j]
n <- new[j]
ans[[o]] <- dplyr::coalesce(ans[[o]], df[[n]])
}
ans
}
hutils_coalesce <- function(df) {
ans <- df
for (j in seq_along(old)) {
o <- old[j]
n <- new[j]
ans[[o]] <- hutils::coalesce(ans[[o]], df[[n]])
}
ans
}
microbenchmark(dplyr = dplyr_coalesce(df),
hutils = hutils_coalesce(df))
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> dplyr 45.78123 61.76857 95.10870 69.21561 87.84774 1452.0800 100 b
#> hutils 36.48602 46.76336 63.46643 52.95736 64.53066 252.5608 100 a
Created on 2018-03-29 by the reprex package (v0.2.0).
由reprex包(v0.2.0)于2018-03-29创建。