Is there a way to get the equivalent in data.table to the following SQL query?
有没有办法在data.table中获取等效的以下SQL查询?
create C as select * from R,P where
P.x between R.min_x and R.max_x and P.var2 < R.col3
My problem is that I cannot the the cartesian product of R,P
as R would crash, I am happy with any technique (even if it is in several step...)
我的问题是我不能把R,P的笛卡尔积作为R会崩溃,我对任何技术都很满意(即使它是在几步......)
Typical size are R 1K rows, P 3M rows
典型尺寸为R 1K行,P 3M行
EDIT:
library(data.table)
R = data.table(min_x=c(.6,.4,.01,.8),max_x=c(.7,.51,.05,.95),col3=c(.6,.4,1.2,.6))
P = data.table(x=seq(.1,.9,.1),var2=c(1,.4,.3,.2,0,.5,.65,.7,0))
setkey(P, x)
setkey(R,min_x,max_x) #and max_x is always > min_x
#R
# min_x max_x col3
#1: 0.01 0.05 1.2
#2: 0.40 0.51 0.4
#3: 0.60 0.70 0.6
#4: 0.80 0.95 0.6
#B
# x var2
#1: 0.1 1.00 => var1 not in any [col1,col2]
#2: 0.2 0.40 => same
#3: 0.3 0.30 => same
#4: 0.4 0.20 => .4 in [.4,.51] but .2 < .4 so NO
#5: 0.5 0 => same
#6: 0.6 0.50 => .6 in [.6, .7] but .5 < .6 so NO
#7: 0.7 0.65 => .6 in [.6, .7] AND .65 > .6 => SELECTED
#8: 0.8 0.70 => YES
#9: 0.9 0 => NO
So expected result
所以预期的结果
# min_x max_x col3 x var2
#1: 0.60 0.70 0.6 0.70 0.65
#2: 0.80 0.95 0.6 0.80 0.70
2 个解决方案
#1
1
When this FR is implemented (and its links may be useful) :
实施此FR时(其链接可能有用):
FR#203 Allow 2 column to specify range in i instead of %between%
FR#203允许2列指定i中的范围而不是%介于%之间
it might be :
有可能 :
setkey(B, var1, var2)
B[A[,list(.(col1,col2),.(-Inf,col3))], j]
If that sounds ok? You would want to specify a j
which would run per group (per row of i
) to save a potential cartesian expansion in memory. But if you really wanted the large table returned, the allow.cartesian
flag could be set :
如果这听起来不错?您可能希望指定一个j,它将按组(每行i)运行,以便在内存中保存潜在的笛卡尔扩展。但是如果你真的想要返回大表,可以设置allow.cartesian标志:
B[A[,list(.(col1,col2),.(-Inf,col3))], allow.cartesian=TRUE]
This can't be done right now of course, so this is just an exploratory answer.
当然,现在无法做到这一点,所以这只是一个探索性的答案。
#2
0
I ended up with an answer with the help of @Matthew Dowle from this post
在这篇文章中,我在@Matthew Dowle的帮助下得到了答案
setkey(P,x)
# sort by x and mark as sorted so future queries can use binary search on P
# Lookup each min_x in the key of P, returning the location. J stands for Join.
from = P[J(R$min_x), roll=-Inf, mult='first', which=TRUE]
# Lookup each max_x in the key of P, returning the location.
to = P[J(R$max_x),roll=Inf, mult='last', which=TRUE]
# vectorized for each item the length to[i]-from[i]+1
len = to-from+1
#get NA that can occur if no x > min_x
isNaFromTo = !is.na(from) & !is.na(to)
#remove the NA from from/to
to = to[isNaFromTo]
from = from[isNaFromTo]
#replace NA by 0 in len which will flag the fact that we want to remove the line from R
len[!isNaFromTo] = 0;
# create index for P
i = unlist(mapply("seq.int",from,to,SIMPLIFY=FALSE))
# create index of R
j = rep(1:nrow(R), len);
#bind to get the result
res = cbind(R[j], P[i])
res = res[var2>col3]
Result as expected
结果如预期
min_x max_x col3 x var2
1: 0.6 0.70 0.6 0.7 0.65
2: 0.8 0.95 0.6 0.8 0.70
#1
1
When this FR is implemented (and its links may be useful) :
实施此FR时(其链接可能有用):
FR#203 Allow 2 column to specify range in i instead of %between%
FR#203允许2列指定i中的范围而不是%介于%之间
it might be :
有可能 :
setkey(B, var1, var2)
B[A[,list(.(col1,col2),.(-Inf,col3))], j]
If that sounds ok? You would want to specify a j
which would run per group (per row of i
) to save a potential cartesian expansion in memory. But if you really wanted the large table returned, the allow.cartesian
flag could be set :
如果这听起来不错?您可能希望指定一个j,它将按组(每行i)运行,以便在内存中保存潜在的笛卡尔扩展。但是如果你真的想要返回大表,可以设置allow.cartesian标志:
B[A[,list(.(col1,col2),.(-Inf,col3))], allow.cartesian=TRUE]
This can't be done right now of course, so this is just an exploratory answer.
当然,现在无法做到这一点,所以这只是一个探索性的答案。
#2
0
I ended up with an answer with the help of @Matthew Dowle from this post
在这篇文章中,我在@Matthew Dowle的帮助下得到了答案
setkey(P,x)
# sort by x and mark as sorted so future queries can use binary search on P
# Lookup each min_x in the key of P, returning the location. J stands for Join.
from = P[J(R$min_x), roll=-Inf, mult='first', which=TRUE]
# Lookup each max_x in the key of P, returning the location.
to = P[J(R$max_x),roll=Inf, mult='last', which=TRUE]
# vectorized for each item the length to[i]-from[i]+1
len = to-from+1
#get NA that can occur if no x > min_x
isNaFromTo = !is.na(from) & !is.na(to)
#remove the NA from from/to
to = to[isNaFromTo]
from = from[isNaFromTo]
#replace NA by 0 in len which will flag the fact that we want to remove the line from R
len[!isNaFromTo] = 0;
# create index for P
i = unlist(mapply("seq.int",from,to,SIMPLIFY=FALSE))
# create index of R
j = rep(1:nrow(R), len);
#bind to get the result
res = cbind(R[j], P[i])
res = res[var2>col3]
Result as expected
结果如预期
min_x max_x col3 x var2
1: 0.6 0.70 0.6 0.7 0.65
2: 0.8 0.95 0.6 0.8 0.70