用值替换R data.table中的所有缺失值

If you have an R data.table that has missing values, how do you replace all of them with say, the value 0? E.g.

如果你有一个缺少值的R data.table，你如何用值0替换所有这些值？例如。

aa = data.table(V1=1:10,V2=c(1,2,2,3,3,3,4,4,4,4))
bb = data.table(V1=3:6,X=letters[1:4])
setkey(aa,V1)
setkey(bb,V1)
tt = bb[aa]

    V1  X V2
 1:  1 NA  1
 2:  2 NA  2
 3:  3  a  2
 4:  4  b  3
 5:  5  c  3
 6:  6  d  3
 7:  7 NA  4
 8:  8 NA  4
 9:  9 NA  4
10: 10 NA  4

Any way to do this in one line? If it were just a matrix, you could just do:

有没有办法在一行中做到这一点？如果它只是一个矩阵，你可以这样做：

tt[is.na(tt)] = 0

3 个解决方案

#1

is.na (being a primitive) has relatively very less overhead and is usually quite fast. So, you can just loop through the columns and use set to replace NA with0`.

is.na（是一个原始的）具有相对非常少的开销，通常非常快。因此，您可以循环遍历列并使用set将NA替换为0`。

Using <- to assign will result in a copy of all the columns and this is not the idiomatic way using data.table.

使用< - to assign将生成所有列的副本，这不是使用data.table的惯用方法。

First I'll illustrate as to how to do it and then show how slow this can get on huge data (due to the copy):

首先，我将说明如何执行此操作然后显示它可以在大量数据上获得多慢（由于副本）：

One way to do this efficiently:

for (i in seq_along(tt)) set(tt, i=which(is.na(tt[[i]])), j=i, value=0)

You'll get a warning here that "0" is being coerced to character to match the type of column. You can ignore it.

你会在这里得到一个警告：“0”被强制转换为字符以匹配列的类型。你可以忽略它。

Why shouldn't you use `<-` here:

# by reference - idiomatic way
set.seed(45)
tt <- data.table(matrix(sample(c(NA, rnorm(10)), 1e7*3, TRUE), ncol=3))
tracemem(tt)
# modifies value by reference - no copy
system.time({
for (i in seq_along(tt)) 
    set(tt, i=which(is.na(tt[[i]])), j=i, value=0)
})
#   user  system elapsed 
#  0.284   0.083   0.386 

# by copy - NOT the idiomatic way
set.seed(45)
tt <- data.table(matrix(sample(c(NA, rnorm(10)), 1e7*3, TRUE), ncol=3))
tracemem(tt)
# makes copy
system.time({tt[is.na(tt)] <- 0})
# a bunch of "tracemem" output showing the copies being made
#   user  system elapsed 
#  4.110   0.976   5.187

#2

Nothing unusual here:

这里没什么不寻常的：

tt[is.na(tt)] = 0

..will work.

..将工作。

This is somewhat confusing however given that:

鉴于以下情况，这有点令人困惑：

tt[is.na(tt)]

...currently returns:

...目前返回：

Error in [.data.table(tt, is.na(tt)) : i is invalid type (matrix). Perhaps in future a 2 column matrix could return a list of elements of DT (in the spirit of A[B] in FAQ 2.14). Please let datatable-help know if you'd like this, or add your comments to FR #1611.

[.data.table（tt，is.na（tt））中的错误：i是无效类型（矩阵）。也许在将来，2列矩阵可以返回DT的元素列表（在FAQ 2.14的A [B]的精神中）。如果您愿意，请告知datatable-help，或者将您的评论添加到FR＃1611。

#3

I would make use of data.table and lapply, namely:

我会使用data.table和lapply，即：

tt[,lapply(.SD,function(kkk) ifelse(is.na(kkk),-666,kkk)),.SDcols=names(tt)]

yielding in:

屈服于：

V1    X V2
 1:  1 -666  1
 2:  2 -666  2
 3:  3    a  2
 4:  4    b  3
 5:  5    c  3
 6:  6    d  3
 7:  7 -666  4
 8:  8 -666  4
 9:  9 -666  4
10: 10 -666  4

#1

is.na (being a primitive) has relatively very less overhead and is usually quite fast. So, you can just loop through the columns and use set to replace NA with0`.

is.na（是一个原始的）具有相对非常少的开销，通常非常快。因此，您可以循环遍历列并使用set将NA替换为0`。

Using <- to assign will result in a copy of all the columns and this is not the idiomatic way using data.table.

使用< - to assign将生成所有列的副本，这不是使用data.table的惯用方法。

First I'll illustrate as to how to do it and then show how slow this can get on huge data (due to the copy):

首先，我将说明如何执行此操作然后显示它可以在大量数据上获得多慢（由于副本）：

One way to do this efficiently:

for (i in seq_along(tt)) set(tt, i=which(is.na(tt[[i]])), j=i, value=0)

You'll get a warning here that "0" is being coerced to character to match the type of column. You can ignore it.

你会在这里得到一个警告：“0”被强制转换为字符以匹配列的类型。你可以忽略它。

Why shouldn't you use `<-` here:

# by reference - idiomatic way
set.seed(45)
tt <- data.table(matrix(sample(c(NA, rnorm(10)), 1e7*3, TRUE), ncol=3))
tracemem(tt)
# modifies value by reference - no copy
system.time({
for (i in seq_along(tt)) 
    set(tt, i=which(is.na(tt[[i]])), j=i, value=0)
})
#   user  system elapsed 
#  0.284   0.083   0.386 

# by copy - NOT the idiomatic way
set.seed(45)
tt <- data.table(matrix(sample(c(NA, rnorm(10)), 1e7*3, TRUE), ncol=3))
tracemem(tt)
# makes copy
system.time({tt[is.na(tt)] <- 0})
# a bunch of "tracemem" output showing the copies being made
#   user  system elapsed 
#  4.110   0.976   5.187

#2

Nothing unusual here:

这里没什么不寻常的：

tt[is.na(tt)] = 0

..will work.

..将工作。

This is somewhat confusing however given that:

鉴于以下情况，这有点令人困惑：

tt[is.na(tt)]

...currently returns:

...目前返回：

Error in [.data.table(tt, is.na(tt)) : i is invalid type (matrix). Perhaps in future a 2 column matrix could return a list of elements of DT (in the spirit of A[B] in FAQ 2.14). Please let datatable-help know if you'd like this, or add your comments to FR #1611.

[.data.table（tt，is.na（tt））中的错误：i是无效类型（矩阵）。也许在将来，2列矩阵可以返回DT的元素列表（在FAQ 2.14的A [B]的精神中）。如果您愿意，请告知datatable-help，或者将您的评论添加到FR＃1611。

#3

I would make use of data.table and lapply, namely:

我会使用data.table和lapply，即：

tt[,lapply(.SD,function(kkk) ifelse(is.na(kkk),-666,kkk)),.SDcols=names(tt)]

yielding in:

屈服于：

V1    X V2
 1:  1 -666  1
 2:  2 -666  2
 3:  3    a  2
 4:  4    b  3
 5:  5    c  3
 6:  6    d  3
 7:  7 -666  4
 8:  8 -666  4
 9:  9 -666  4
10: 10 -666  4

秒客网

用值替换R data.table中的所有缺失值

3 个解决方案

#1

One way to do this efficiently:

Why shouldn't you use `<-` here:

#2

#3

#1

One way to do this efficiently:

Why shouldn't you use `<-` here:

#2

#3

相关文章

用值替换R data.table中的所有缺失值

3 个解决方案

#1

One way to do this efficiently:

Why shouldn't you use <- here:

#2

#3

#1

One way to do this efficiently:

Why shouldn't you use <- here:

#2

#3

相关文章

Why shouldn't you use `<-` here:

Why shouldn't you use `<-` here: