If you have an R data.table that has missing values, how do you replace all of them with say, the value 0? E.g.
如果你有一个缺少值的R data.table,你如何用值0替换所有这些值?例如。
aa = data.table(V1=1:10,V2=c(1,2,2,3,3,3,4,4,4,4))
bb = data.table(V1=3:6,X=letters[1:4])
setkey(aa,V1)
setkey(bb,V1)
tt = bb[aa]
V1 X V2
1: 1 NA 1
2: 2 NA 2
3: 3 a 2
4: 4 b 3
5: 5 c 3
6: 6 d 3
7: 7 NA 4
8: 8 NA 4
9: 9 NA 4
10: 10 NA 4
Any way to do this in one line? If it were just a matrix, you could just do:
有没有办法在一行中做到这一点?如果它只是一个矩阵,你可以这样做:
tt[is.na(tt)] = 0
3 个解决方案
#1
29
is.na
(being a primitive) has relatively very less overhead and is usually quite fast. So, you can just loop through the columns and use set
to replace NA with
0`.
is.na(是一个原始的)具有相对非常少的开销,通常非常快。因此,您可以循环遍历列并使用set将NA替换为0`。
Using <-
to assign will result in a copy of all the columns and this is not the idiomatic way using data.table
.
使用< - to assign将生成所有列的副本,这不是使用data.table的惯用方法。
First I'll illustrate as to how to do it and then show how slow this can get on huge data (due to the copy):
首先,我将说明如何执行此操作然后显示它可以在大量数据上获得多慢(由于副本):
One way to do this efficiently:
for (i in seq_along(tt)) set(tt, i=which(is.na(tt[[i]])), j=i, value=0)
You'll get a warning here that "0" is being coerced to character to match the type of column. You can ignore it.
你会在这里得到一个警告:“0”被强制转换为字符以匹配列的类型。你可以忽略它。
Why shouldn't you use <-
here:
# by reference - idiomatic way
set.seed(45)
tt <- data.table(matrix(sample(c(NA, rnorm(10)), 1e7*3, TRUE), ncol=3))
tracemem(tt)
# modifies value by reference - no copy
system.time({
for (i in seq_along(tt))
set(tt, i=which(is.na(tt[[i]])), j=i, value=0)
})
# user system elapsed
# 0.284 0.083 0.386
# by copy - NOT the idiomatic way
set.seed(45)
tt <- data.table(matrix(sample(c(NA, rnorm(10)), 1e7*3, TRUE), ncol=3))
tracemem(tt)
# makes copy
system.time({tt[is.na(tt)] <- 0})
# a bunch of "tracemem" output showing the copies being made
# user system elapsed
# 4.110 0.976 5.187
#2
13
Nothing unusual here:
这里没什么不寻常的:
tt[is.na(tt)] = 0
..will work.
..将工作。
This is somewhat confusing however given that:
鉴于以下情况,这有点令人困惑:
tt[is.na(tt)]
...currently returns:
...目前返回:
Error in
[.data.table
(tt, is.na(tt)) : i is invalid type (matrix). Perhaps in future a 2 column matrix could return a list of elements of DT (in the spirit of A[B] in FAQ 2.14). Please let datatable-help know if you'd like this, or add your comments to FR #1611.[.data.table(tt,is.na(tt))中的错误:i是无效类型(矩阵)。也许在将来,2列矩阵可以返回DT的元素列表(在FAQ 2.14的A [B]的精神中)。如果您愿意,请告知datatable-help,或者将您的评论添加到FR#1611。
#3
0
I would make use of data.table
and lapply
, namely:
我会使用data.table和lapply,即:
tt[,lapply(.SD,function(kkk) ifelse(is.na(kkk),-666,kkk)),.SDcols=names(tt)]
yielding in:
屈服于:
V1 X V2
1: 1 -666 1
2: 2 -666 2
3: 3 a 2
4: 4 b 3
5: 5 c 3
6: 6 d 3
7: 7 -666 4
8: 8 -666 4
9: 9 -666 4
10: 10 -666 4
#1
29
is.na
(being a primitive) has relatively very less overhead and is usually quite fast. So, you can just loop through the columns and use set
to replace NA with
0`.
is.na(是一个原始的)具有相对非常少的开销,通常非常快。因此,您可以循环遍历列并使用set将NA替换为0`。
Using <-
to assign will result in a copy of all the columns and this is not the idiomatic way using data.table
.
使用< - to assign将生成所有列的副本,这不是使用data.table的惯用方法。
First I'll illustrate as to how to do it and then show how slow this can get on huge data (due to the copy):
首先,我将说明如何执行此操作然后显示它可以在大量数据上获得多慢(由于副本):
One way to do this efficiently:
for (i in seq_along(tt)) set(tt, i=which(is.na(tt[[i]])), j=i, value=0)
You'll get a warning here that "0" is being coerced to character to match the type of column. You can ignore it.
你会在这里得到一个警告:“0”被强制转换为字符以匹配列的类型。你可以忽略它。
Why shouldn't you use <-
here:
# by reference - idiomatic way
set.seed(45)
tt <- data.table(matrix(sample(c(NA, rnorm(10)), 1e7*3, TRUE), ncol=3))
tracemem(tt)
# modifies value by reference - no copy
system.time({
for (i in seq_along(tt))
set(tt, i=which(is.na(tt[[i]])), j=i, value=0)
})
# user system elapsed
# 0.284 0.083 0.386
# by copy - NOT the idiomatic way
set.seed(45)
tt <- data.table(matrix(sample(c(NA, rnorm(10)), 1e7*3, TRUE), ncol=3))
tracemem(tt)
# makes copy
system.time({tt[is.na(tt)] <- 0})
# a bunch of "tracemem" output showing the copies being made
# user system elapsed
# 4.110 0.976 5.187
#2
13
Nothing unusual here:
这里没什么不寻常的:
tt[is.na(tt)] = 0
..will work.
..将工作。
This is somewhat confusing however given that:
鉴于以下情况,这有点令人困惑:
tt[is.na(tt)]
...currently returns:
...目前返回:
Error in
[.data.table
(tt, is.na(tt)) : i is invalid type (matrix). Perhaps in future a 2 column matrix could return a list of elements of DT (in the spirit of A[B] in FAQ 2.14). Please let datatable-help know if you'd like this, or add your comments to FR #1611.[.data.table(tt,is.na(tt))中的错误:i是无效类型(矩阵)。也许在将来,2列矩阵可以返回DT的元素列表(在FAQ 2.14的A [B]的精神中)。如果您愿意,请告知datatable-help,或者将您的评论添加到FR#1611。
#3
0
I would make use of data.table
and lapply
, namely:
我会使用data.table和lapply,即:
tt[,lapply(.SD,function(kkk) ifelse(is.na(kkk),-666,kkk)),.SDcols=names(tt)]
yielding in:
屈服于:
V1 X V2
1: 1 -666 1
2: 2 -666 2
3: 3 a 2
4: 4 b 3
5: 5 c 3
6: 6 d 3
7: 7 -666 4
8: 8 -666 4
9: 9 -666 4
10: 10 -666 4