How do I select all the rows that have a missing value in the primary key in a data table.
如何选择数据表中主键中缺少值的所有行。
DT = data.table(x=rep(c("a","b",NA),each=3), y=c(1,3,6), v=1:9)
setkey(DT,x)
Selecting for a particular value is easy
选择特定值很容易
DT["a",]
Selecting for the missing values seems to require a vector search. One cannot use binary search. Am I correct?
选择缺失值似乎需要矢量搜索。一个人不能使用二进制搜索。我对么?
DT[NA,]# does not work
DT[is.na(x),] #does work
2 个解决方案
#1
21
Fortunately, DT[is.na(x),]
is nearly as fast as (e.g.) DT["a",]
, so in practice, this may not really matter much:
幸运的是,DT [is.na(x),]几乎和(例如)DT [“a”,]一样快,所以在实践中,这可能并不重要:
library(data.table)
library(rbenchmark)
DT = data.table(x=rep(c("a","b",NA),each=3e6), y=c(1,3,6), v=1:9)
setkey(DT,x)
benchmark(DT["a",],
DT[is.na(x),],
replications=20)
# test replications elapsed relative user.self sys.self user.child
# 1 DT["a", ] 20 9.18 1.000 7.31 1.83 NA
# 2 DT[is.na(x), ] 20 10.55 1.149 8.69 1.85 NA
===
===
Addition from Matthew (won't fit in comment) :
Matthew的补充(不适合评论):
The data above has 3 very large groups, though. So the speed advantage of binary search is dominated here by the time to create the large subset (1/3 of the data is copied).
不过,上述数据有3个非常大的群体。因此,二进制搜索的速度优势主要在于创建大子集的时间(复制了1/3的数据)。
benchmark(DT["a",], # repeat select of large subset on my netbook
DT[is.na(x),],
replications=3)
test replications elapsed relative user.self sys.self
DT["a", ] 3 2.406 1.000 2.357 0.044
DT[is.na(x), ] 3 3.876 1.611 3.812 0.056
benchmark(DT["a",which=TRUE], # isolate search time
DT[is.na(x),which=TRUE],
replications=3)
test replications elapsed relative user.self sys.self
DT["a", which = TRUE] 3 0.492 1.000 0.492 0.000
DT[is.na(x), which = TRUE] 3 2.941 5.978 2.932 0.004
As the size of the subset returned decreases (e.g. adding more groups), the difference becomes apparent. Vector scans on a single column aren't too bad, but on 2 or more columns it quickly degrades.
随着返回的子集的大小减小(例如,添加更多组),差异变得明显。单列上的矢量扫描也不错,但是在2列或更多列上它会快速降级。
Maybe NAs should be joinable to. I seem to remember a gotcha with that, though. Here's some history linked from FR#1043 Allow or disallow NA in keys?. It mentions there that NA_integer_
is internally a negative integer. That trips up radix/counting sort (iirc) resulting in setkey
going slower. But it's on the list to revisit.
也许NAs应该可以加入。不过,我似乎还记得那个问题。这是从FR#1043链接中允许或禁止NA的一些历史记录?它提到那里NA_integer_在内部是一个负整数。这会导致基数/计数排序(iirc)上升,导致setkey变慢。但它在列表中重新审视。
#2
19
This is now implemented in v1.8.11. From NEWS:
o Binary search is now capable of subsetting
NA
/NaN
s and also performjoins
andmerges
by matchingNA
s/NaN
s.o二进制搜索现在能够对NA / NaN进行子集化,并且还通过匹配NAs / NaN来执行连接和合并。
Although you'll have to provide the correct NA
(NA_real_
, NA_character_
etc..) explicitly at the moment.
虽然您现在必须明确提供正确的NA(NA_real_,NA_character_等..)。
On OP's data:
关于OP的数据:
DT[J(NA_character_)] # or for characters simply DT[NA_character_]
# x y v
# 1: NA 1 7
# 2: NA 3 8
# 3: NA 6 9
Also, here's the same benchmark from @JoshOBrien's post, with this binary search for NA added:
另外,这里是来自@JoshOBrien的帖子的相同基准,加上NA的二进制搜索:
library(data.table)
library(rbenchmark)
DT = data.table(x=rep(c("a","b",NA),each=3e6), y=c(1,3,6), v=1:9)
setkey(DT,x)
benchmark(DT["a",],
DT[is.na(x),],
DT[NA_character_],
replications=20)
test replications elapsed relative user.self sys.self
1 DT["a", ] 20 4.763 1.238 4.000 0.567
2 DT[is.na(x), ] 20 5.399 1.403 4.537 0.794
3 DT[NA] 20 3.847 1.000 3.215 0.600 # <~~~
#1
21
Fortunately, DT[is.na(x),]
is nearly as fast as (e.g.) DT["a",]
, so in practice, this may not really matter much:
幸运的是,DT [is.na(x),]几乎和(例如)DT [“a”,]一样快,所以在实践中,这可能并不重要:
library(data.table)
library(rbenchmark)
DT = data.table(x=rep(c("a","b",NA),each=3e6), y=c(1,3,6), v=1:9)
setkey(DT,x)
benchmark(DT["a",],
DT[is.na(x),],
replications=20)
# test replications elapsed relative user.self sys.self user.child
# 1 DT["a", ] 20 9.18 1.000 7.31 1.83 NA
# 2 DT[is.na(x), ] 20 10.55 1.149 8.69 1.85 NA
===
===
Addition from Matthew (won't fit in comment) :
Matthew的补充(不适合评论):
The data above has 3 very large groups, though. So the speed advantage of binary search is dominated here by the time to create the large subset (1/3 of the data is copied).
不过,上述数据有3个非常大的群体。因此,二进制搜索的速度优势主要在于创建大子集的时间(复制了1/3的数据)。
benchmark(DT["a",], # repeat select of large subset on my netbook
DT[is.na(x),],
replications=3)
test replications elapsed relative user.self sys.self
DT["a", ] 3 2.406 1.000 2.357 0.044
DT[is.na(x), ] 3 3.876 1.611 3.812 0.056
benchmark(DT["a",which=TRUE], # isolate search time
DT[is.na(x),which=TRUE],
replications=3)
test replications elapsed relative user.self sys.self
DT["a", which = TRUE] 3 0.492 1.000 0.492 0.000
DT[is.na(x), which = TRUE] 3 2.941 5.978 2.932 0.004
As the size of the subset returned decreases (e.g. adding more groups), the difference becomes apparent. Vector scans on a single column aren't too bad, but on 2 or more columns it quickly degrades.
随着返回的子集的大小减小(例如,添加更多组),差异变得明显。单列上的矢量扫描也不错,但是在2列或更多列上它会快速降级。
Maybe NAs should be joinable to. I seem to remember a gotcha with that, though. Here's some history linked from FR#1043 Allow or disallow NA in keys?. It mentions there that NA_integer_
is internally a negative integer. That trips up radix/counting sort (iirc) resulting in setkey
going slower. But it's on the list to revisit.
也许NAs应该可以加入。不过,我似乎还记得那个问题。这是从FR#1043链接中允许或禁止NA的一些历史记录?它提到那里NA_integer_在内部是一个负整数。这会导致基数/计数排序(iirc)上升,导致setkey变慢。但它在列表中重新审视。
#2
19
This is now implemented in v1.8.11. From NEWS:
o Binary search is now capable of subsetting
NA
/NaN
s and also performjoins
andmerges
by matchingNA
s/NaN
s.o二进制搜索现在能够对NA / NaN进行子集化,并且还通过匹配NAs / NaN来执行连接和合并。
Although you'll have to provide the correct NA
(NA_real_
, NA_character_
etc..) explicitly at the moment.
虽然您现在必须明确提供正确的NA(NA_real_,NA_character_等..)。
On OP's data:
关于OP的数据:
DT[J(NA_character_)] # or for characters simply DT[NA_character_]
# x y v
# 1: NA 1 7
# 2: NA 3 8
# 3: NA 6 9
Also, here's the same benchmark from @JoshOBrien's post, with this binary search for NA added:
另外,这里是来自@JoshOBrien的帖子的相同基准,加上NA的二进制搜索:
library(data.table)
library(rbenchmark)
DT = data.table(x=rep(c("a","b",NA),each=3e6), y=c(1,3,6), v=1:9)
setkey(DT,x)
benchmark(DT["a",],
DT[is.na(x),],
DT[NA_character_],
replications=20)
test replications elapsed relative user.self sys.self
1 DT["a", ] 20 4.763 1.238 4.000 0.567
2 DT[is.na(x), ] 20 5.399 1.403 4.537 0.794
3 DT[NA] 20 3.847 1.000 3.215 0.600 # <~~~