I have a data.table with a column that has NA
s. I want to drop rows where that column takes a particular value (which happens to be ""
). However, my first attempt lead me to lose rows with NA
s as well:
我有一个数据。包含NAs的列的表。我想删除列中包含特定值(恰好是“”)的行。然而,我的第一次尝试也使我在NAs上丢失了一些行:
> a = c(1,"",NA)
> x <- data.table(a);x
a
1: 1
2:
3: NA
> y <- x[a!=""];y
a
1: 1
After looking at ?`!=`
, I found a one liner that works, but it's a pain:
后在看什么?”!= ',我找到了一个管用的眼线,但很痛苦:
> z <- x[!sapply(a,function(x)identical(x,""))]; z
a
1: 1
2: NA
I'm wondering if there's a better way to do this? Also, I see no good way of extending this to excluding multiple non-NA
values. Here's a bad way:
我想知道有没有更好的方法来做这件事?此外,我认为没有很好的方法可以扩展到排除多个非na值。这是一个坏的方式:
> drop_these <- function(these,where){
+ argh <- !sapply(where,
+ function(x)unlist(lapply(as.list(these),function(this)identical(x,this)))
+ )
+ if (is.matrix(argh)){argh <- apply(argh,2,all)}
+ return(argh)
+ }
> x[drop_these("",a)]
a
1: 1
2: NA
> x[drop_these(c(1,""),a)]
a
1: NA
I looked at ?J
and tried things out with a data.frame, which seems to work differently, keeping NA
s when subsetting:
我看了看?J,用data.frame做了一些尝试,这个框架的工作方式似乎有所不同,在细分时保留NAs:
> w <- data.frame(a,stringsAsFactors=F); w
a
1 1
2
3 <NA>
> d <- w[a!="",,drop=F]; d
a
1 1
NA <NA>
3 个解决方案
#1
15
To provide a solution to your question:
You should use %in%
. It gives you back a logical vector.
你应该用% %。它返回一个逻辑向量。
a %in% ""
# [1] FALSE TRUE FALSE
x[!a %in% ""]
# a
# 1: 1
# 2: NA
To find out why this is happening in data.table
:
(as opposted to data.frame
)
(opposted data.frame)
If you look at the data.table
source code on the file data.table.R
under the function "[.data.table"
, there's a set of if-statements
that check for i
argument. One of them is:
如果你看看数据。表源代码。函数下的R "[.data。表",有一组if语句检查i参数。其中一个是:
if (!missing(i)) {
# Part (1)
isub = substitute(i)
# Part (2)
if (is.call(isub) && isub[[1L]] == as.name("!")) {
notjoin = TRUE
if (!missingnomatch) stop("not-join '!' prefix is present on i but nomatch is provided. Please remove nomatch.");
nomatch = 0L
isub = isub[[2L]]
}
.....
# "isub" is being evaluated using "eval" to result in a logical vector
# Part 3
if (is.logical(i)) {
# see DT[NA] thread re recycling of NA logical
if (identical(i,NA)) i = NA_integer_
# avoids DT[!is.na(ColA) & !is.na(ColB) & ColA==ColB], just DT[ColA==ColB]
else i[is.na(i)] = FALSE
}
....
}
To explain the discrepancy, I've pasted the important piece of code here. And I've also marked them into 3 parts.
为了解释差异,我在这里粘贴了一段重要的代码。我还把它们标记为3个部分。
First, why dt[a != ""]
doesn't work as expected (by the OP)?
First, part 1
evaluates to an object of class call
. The second part of the if statement in part 2
returns FALSE. Following that, the call
is "evaluated" to give c(TRUE, FALSE, NA)
. Then part 3
is executed. So, NA
is replaced to FALSE
(the last line of the logical loop).
首先,第1部分计算一个类调用的对象。第2部分中的if语句的第二部分返回FALSE。在此之后,调用将“求值”以给出c(TRUE、FALSE、NA)。然后执行第3部分。因此,NA被替换为FALSE(逻辑循环的最后一行)。
why does x[!(a== "")]
work as expected (by the OP)?
part 1
returns a call once again. But, part 2
evaluates to TRUE and therefore sets:
第1部分再次返回一个调用。但是,第2部分评估为TRUE,因此设置:
1) `notjoin = TRUE`
2) isub <- isub[[2L]] # which is equal to (a == "") without the ! (exclamation)
That is where the magic happened. The negation has been removed for now. And remember, this is still an object of class call. So this gets evaluated (using eval
) to logical again. So, (a=="")
evaluates to c(FALSE, TRUE, NA)
.
这就是奇迹发生的地方。否定现在已经被删除了。记住,这仍然是类调用的对象。因此,这将被评估(使用eval)再次符合逻辑。因此,(a= "" ")计算为c(FALSE, TRUE, NA)
Now, this is checked for is.logical
in part 3
. So, here, NA
gets replaced to FALSE
. It therefore becomes, c(FALSE, TRUE, FALSE)
. At some point later, a which(c(F,T,F))
is executed, which results in 2 here. Because notjoin = TRUE
(from part 2
) seq_len(nrow(x))[-2]
= c(1,3) is returned. so, x[!(a=="")]
basically returns x[c(1,3)]
which is the desired result. Here's the relevant code snippet:
检查一下。在第3部分逻辑。这里,NA被替换为FALSE。因此它变成了c(FALSE, TRUE, FALSE)在之后的某个时刻,执行了a which(c(F,T,F)),结果是2。因为notjoin = TRUE(来自第2部分)返回seq_len(nrow(x))[-2] = c(1,3)。因此,x[!(a= "")]基本上返回x[c(1,3)],这是期望的结果。以下是相关代码片段:
if (notjoin) {
if (bywithoutby || !is.integer(irows) || is.na(nomatch)) stop("Internal error: notjoin but bywithoutby or !integer or nomatch==NA")
irows = irows[irows!=0L]
# WHERE MAGIC HAPPENS (returns c(1,3))
i = irows = if (length(irows)) seq_len(nrow(x))[-irows] else NULL # NULL meaning all rows i.e. seq_len(nrow(x))
# Doing this once here, helps speed later when repeatedly subsetting each column. R's [irows] would do this for each
# column when irows contains negatives.
}
Given that, I think there are some inconsistencies with the syntax.. And if I manage to get time to formulate the problem, then I'll write a post soon.
鉴于此,我认为语法有些不一致。如果我能抽出时间来解决这个问题,我很快就会写一篇文章。
#2
3
As you have already figured out, this is the reason:
正如你已经知道的,这就是原因:
a != ""
#[1] TRUE NA FALSE
You can do what you figured out already, i.e. x[is.na(a) | a != ""]
or you could setkey
on a
and do the following:
你可以做你已经知道的事情,例如x[is.na(a) | a != "]或者你可以设置a并做以下事情:
setkey(x, a)
x[!J("")]
#3
3
Background answer from Matthew :
马修的背景回答:
The behaviour with !=
on NA
as highlighted by this question wasn't intended, thinking about it. The original intention was indeed to be different than [.data.frame
w.r.t. ==
and NA
and I believe everyone is happy with that. For example, FAQ 2.17 has :
这个问题强调的与!= on NA的行为并不是故意的,想想看。最初的意图确实与[.data.frame w.r.t ==和NA不同,我相信每个人都对此感到高兴。例如,FAQ 2.17有:
DT[ColA==ColB]
is simpler thanDF[!is.na(ColA) & !is.na(ColB) & ColA==ColB,]
DT[ColA= ColB]比DF[!is.na(ColA) & !is.na(ColB) & ColA= ColB,]
That convenience is achieved by dint of :
这种方便是通过以下方式实现的:
DT[c(TRUE,NA,FALSE)]
treats theNA
asFALSE
, butDF[c(TRUE,NA,FALSE)]
returnsNA
rows for eachNA
DT[c(TRUE,NA,FALSE)]将NA视为FALSE,而DF[c(TRUE,NA,FALSE)]则为每个NA返回NA行
The motivation is not just convenience but speed, since each and every !
, is.na
, &
and ==
are themselves vector scans with associated memory allocation of each of their results (explained in intro vignette). So although x[is.na(a) | a!=""]
is a working solution, it's exactly the type of logic I was trying to avoid needing in data.table. x[!a %in% ""]
is slightly better; i.e, 2 scans (%in%
and !
) rather than 3 (is.na
, |
and !=
). But really x[a != ""]
should do what Frank expected (include NA
) in a single scan.
它的动机不仅是方便,而且是速度,因为每一个!na, &和==本身是向量扫描,它们各自结果的相关内存分配(在简介中解释)。所以尽管x是。na(a) | a!=""]是一个有效的解决方案,它正是我试图避免在data.table中使用的逻辑类型。x[!%in% "]稍好一些;我。e, 2扫描(%in% in% and !)而不是3 (is)。na、|和! =)。但实际上x[a != ""]应该在一次扫描中完成弗兰克所期望的(包括NA)。
New feature request filed which links back to this question :
新功能请求文件链接到这个问题:
DT(col != " ")应包括NA
Thanks to Frank, Eddi and Arun. If I haven't understood correctly feel free to correct, otherwise the change will get made eventually. It will need to be done in a way that considers compound expressions; e.g., DT[colA=="foo" & colB!="bar"]
should exclude rows with NA
in colA
but include rows where colA
is non-NA
but colB
is NA
. Similarly, DT[colA!=colB]
should include rows where either colA or colB is NA
but not both. And perhaps DT[colA==colB]
should include rows where both colA
and colB
are NA
(which it doesn't currently, I believe).
感谢Frank, Eddi和Arun。如果我还没有正确的理解,就可以*的去纠正,否则,最终会发生改变。它需要以考虑复合表达式的方式进行;例如,DT(可乐= =“foo”& colB !="bar"]应该排除可乐中含有NA的行,但包含可乐不是NA但含有NA的行。同样,DT(可乐!=colB)应该包括可乐或colB都是NA但不是两者都是NA的行。也许DT[colA= colB]应该包含可乐和colB都是NA的行(我认为目前没有)。
#1
15
To provide a solution to your question:
You should use %in%
. It gives you back a logical vector.
你应该用% %。它返回一个逻辑向量。
a %in% ""
# [1] FALSE TRUE FALSE
x[!a %in% ""]
# a
# 1: 1
# 2: NA
To find out why this is happening in data.table
:
(as opposted to data.frame
)
(opposted data.frame)
If you look at the data.table
source code on the file data.table.R
under the function "[.data.table"
, there's a set of if-statements
that check for i
argument. One of them is:
如果你看看数据。表源代码。函数下的R "[.data。表",有一组if语句检查i参数。其中一个是:
if (!missing(i)) {
# Part (1)
isub = substitute(i)
# Part (2)
if (is.call(isub) && isub[[1L]] == as.name("!")) {
notjoin = TRUE
if (!missingnomatch) stop("not-join '!' prefix is present on i but nomatch is provided. Please remove nomatch.");
nomatch = 0L
isub = isub[[2L]]
}
.....
# "isub" is being evaluated using "eval" to result in a logical vector
# Part 3
if (is.logical(i)) {
# see DT[NA] thread re recycling of NA logical
if (identical(i,NA)) i = NA_integer_
# avoids DT[!is.na(ColA) & !is.na(ColB) & ColA==ColB], just DT[ColA==ColB]
else i[is.na(i)] = FALSE
}
....
}
To explain the discrepancy, I've pasted the important piece of code here. And I've also marked them into 3 parts.
为了解释差异,我在这里粘贴了一段重要的代码。我还把它们标记为3个部分。
First, why dt[a != ""]
doesn't work as expected (by the OP)?
First, part 1
evaluates to an object of class call
. The second part of the if statement in part 2
returns FALSE. Following that, the call
is "evaluated" to give c(TRUE, FALSE, NA)
. Then part 3
is executed. So, NA
is replaced to FALSE
(the last line of the logical loop).
首先,第1部分计算一个类调用的对象。第2部分中的if语句的第二部分返回FALSE。在此之后,调用将“求值”以给出c(TRUE、FALSE、NA)。然后执行第3部分。因此,NA被替换为FALSE(逻辑循环的最后一行)。
why does x[!(a== "")]
work as expected (by the OP)?
part 1
returns a call once again. But, part 2
evaluates to TRUE and therefore sets:
第1部分再次返回一个调用。但是,第2部分评估为TRUE,因此设置:
1) `notjoin = TRUE`
2) isub <- isub[[2L]] # which is equal to (a == "") without the ! (exclamation)
That is where the magic happened. The negation has been removed for now. And remember, this is still an object of class call. So this gets evaluated (using eval
) to logical again. So, (a=="")
evaluates to c(FALSE, TRUE, NA)
.
这就是奇迹发生的地方。否定现在已经被删除了。记住,这仍然是类调用的对象。因此,这将被评估(使用eval)再次符合逻辑。因此,(a= "" ")计算为c(FALSE, TRUE, NA)
Now, this is checked for is.logical
in part 3
. So, here, NA
gets replaced to FALSE
. It therefore becomes, c(FALSE, TRUE, FALSE)
. At some point later, a which(c(F,T,F))
is executed, which results in 2 here. Because notjoin = TRUE
(from part 2
) seq_len(nrow(x))[-2]
= c(1,3) is returned. so, x[!(a=="")]
basically returns x[c(1,3)]
which is the desired result. Here's the relevant code snippet:
检查一下。在第3部分逻辑。这里,NA被替换为FALSE。因此它变成了c(FALSE, TRUE, FALSE)在之后的某个时刻,执行了a which(c(F,T,F)),结果是2。因为notjoin = TRUE(来自第2部分)返回seq_len(nrow(x))[-2] = c(1,3)。因此,x[!(a= "")]基本上返回x[c(1,3)],这是期望的结果。以下是相关代码片段:
if (notjoin) {
if (bywithoutby || !is.integer(irows) || is.na(nomatch)) stop("Internal error: notjoin but bywithoutby or !integer or nomatch==NA")
irows = irows[irows!=0L]
# WHERE MAGIC HAPPENS (returns c(1,3))
i = irows = if (length(irows)) seq_len(nrow(x))[-irows] else NULL # NULL meaning all rows i.e. seq_len(nrow(x))
# Doing this once here, helps speed later when repeatedly subsetting each column. R's [irows] would do this for each
# column when irows contains negatives.
}
Given that, I think there are some inconsistencies with the syntax.. And if I manage to get time to formulate the problem, then I'll write a post soon.
鉴于此,我认为语法有些不一致。如果我能抽出时间来解决这个问题,我很快就会写一篇文章。
#2
3
As you have already figured out, this is the reason:
正如你已经知道的,这就是原因:
a != ""
#[1] TRUE NA FALSE
You can do what you figured out already, i.e. x[is.na(a) | a != ""]
or you could setkey
on a
and do the following:
你可以做你已经知道的事情,例如x[is.na(a) | a != "]或者你可以设置a并做以下事情:
setkey(x, a)
x[!J("")]
#3
3
Background answer from Matthew :
马修的背景回答:
The behaviour with !=
on NA
as highlighted by this question wasn't intended, thinking about it. The original intention was indeed to be different than [.data.frame
w.r.t. ==
and NA
and I believe everyone is happy with that. For example, FAQ 2.17 has :
这个问题强调的与!= on NA的行为并不是故意的,想想看。最初的意图确实与[.data.frame w.r.t ==和NA不同,我相信每个人都对此感到高兴。例如,FAQ 2.17有:
DT[ColA==ColB]
is simpler thanDF[!is.na(ColA) & !is.na(ColB) & ColA==ColB,]
DT[ColA= ColB]比DF[!is.na(ColA) & !is.na(ColB) & ColA= ColB,]
That convenience is achieved by dint of :
这种方便是通过以下方式实现的:
DT[c(TRUE,NA,FALSE)]
treats theNA
asFALSE
, butDF[c(TRUE,NA,FALSE)]
returnsNA
rows for eachNA
DT[c(TRUE,NA,FALSE)]将NA视为FALSE,而DF[c(TRUE,NA,FALSE)]则为每个NA返回NA行
The motivation is not just convenience but speed, since each and every !
, is.na
, &
and ==
are themselves vector scans with associated memory allocation of each of their results (explained in intro vignette). So although x[is.na(a) | a!=""]
is a working solution, it's exactly the type of logic I was trying to avoid needing in data.table. x[!a %in% ""]
is slightly better; i.e, 2 scans (%in%
and !
) rather than 3 (is.na
, |
and !=
). But really x[a != ""]
should do what Frank expected (include NA
) in a single scan.
它的动机不仅是方便,而且是速度,因为每一个!na, &和==本身是向量扫描,它们各自结果的相关内存分配(在简介中解释)。所以尽管x是。na(a) | a!=""]是一个有效的解决方案,它正是我试图避免在data.table中使用的逻辑类型。x[!%in% "]稍好一些;我。e, 2扫描(%in% in% and !)而不是3 (is)。na、|和! =)。但实际上x[a != ""]应该在一次扫描中完成弗兰克所期望的(包括NA)。
New feature request filed which links back to this question :
新功能请求文件链接到这个问题:
DT(col != " ")应包括NA
Thanks to Frank, Eddi and Arun. If I haven't understood correctly feel free to correct, otherwise the change will get made eventually. It will need to be done in a way that considers compound expressions; e.g., DT[colA=="foo" & colB!="bar"]
should exclude rows with NA
in colA
but include rows where colA
is non-NA
but colB
is NA
. Similarly, DT[colA!=colB]
should include rows where either colA or colB is NA
but not both. And perhaps DT[colA==colB]
should include rows where both colA
and colB
are NA
(which it doesn't currently, I believe).
感谢Frank, Eddi和Arun。如果我还没有正确的理解,就可以*的去纠正,否则,最终会发生改变。它需要以考虑复合表达式的方式进行;例如,DT(可乐= =“foo”& colB !="bar"]应该排除可乐中含有NA的行,但包含可乐不是NA但含有NA的行。同样,DT(可乐!=colB)应该包括可乐或colB都是NA但不是两者都是NA的行。也许DT[colA= colB]应该包含可乐和colB都是NA的行(我认为目前没有)。