I have a dataset (mydata
) that contains multiple columns which could fit inside ranges that are stored in another dataset (mycomparison
).
我有一个数据集(mydata),其中包含多个列,这些列可以适合存储在另一个数据集(mycomparison)中的范围内。
I would like to join mycomparison
to mydata
where the mydata
values are within the ranges in mycomparison
.
我想加入mycomparison到mydata,其中mydata值在mycomparison的范围内。
MWE
library(data.table)
mydata<-data.table(
id=1:5,
val1=seq(10000, 50000, by=10000),
val2=floor(rnorm(5,mean=400,sd=100)),
val3=rnorm(5,mean=.7,sd=.1)
)
mycomparison<-data.table(
Name=LETTERS[1:3],
minval1=c(0,30000,10000),
maxval1=c(50000,80000,30000),
minval2=c(300,400,300),
maxval2=c(800,800,800),
minval3=c(0,.5,.2),
maxval3=c(1,.9,.8),
correspondingval=c(.1,.2,.3)
)
Desired Output
> mydata.withmatches
id val1 val2 val3 Name minval1 maxval1 minval2 maxval2 minval3 maxval3 correspondingval
1: 1 10000 387 0.4844319 A 0 50000 300 800 0 1 0.1
2: 2 20000 425 0.7856313 NA NA NA NA NA NA NA NA
3: 3 30000 324 0.8063969 NA NA NA NA NA NA NA NA
4: 4 40000 263 0.5590113 NA NA NA NA NA NA NA NA
5: 5 50000 187 0.8764396 NA NA NA NA NA NA NA NA
Current solution
This feels/is very clunky and involves cross-joining the data (using optiRum::CJ.dt
), doing a big logical check, and then reassembling the data.
这感觉/非常笨重,涉及交叉连接数据(使用optiRum :: CJ.dt),进行大的逻辑检查,然后重新组装数据。
library(optiRum)
workingdt<-CJ.dt(mydata,mycomparison)
matched<-workingdt[val1>=minval1 &
val1<=maxval1 &
val2>=minval2 &
val2<=maxval2 &
val3>=minval3 &
val3<=maxval3][which.min(correspondingval)]
notmatched<-mydata[id!= matched[,id]]
all<-list(matched,notmatched)
mydata.withmatches<- rbindlist(all, fill=TRUE, use.names=TRUE)
Looking for a better solution - UPDATED
I'm aware of foverlaps
but it will work on a single interval, not on many ranges like in this instance.
我知道foverlaps,但它会在一个区间内工作,而不是像在这种情况下的许多范围。
I'm hoping for a less clunky and more elegant solution.
我希望有一个不那么笨重,更优雅的解决方案。
3 个解决方案
#1
0
Nice question! Below you can find my quick fix, but it's also still a bit too cluncky for my taste.
好问题!下面你可以找到我的快速修复,但它对我的口味来说仍然有点过于笨拙。
The generated mydata
set:
生成的mydata集:
id val1 val2 val3
1: 1 10000 377 0.7912443
2: 2 20000 378 0.7709792
3: 3 30000 484 0.7049517
4: 4 40000 513 0.5169590
5: 5 50000 474 0.7987448
A simple function to filter away the non-matching rows of the comparison dataset (multiple rows can match).
一个简单的函数,用于过滤掉比较数据集的不匹配行(多行可以匹配)。
library(dplyr)
find_interval_func<-function(var.min.name, var.max.name, value, val.to.return){
compset<-data.frame(mycomparison)
for(i in 1:length(var.min.name)){
compset<-
compset %>%
filter_(paste0(var.min.name[[i]], "<=", value[[i]]),
paste0(var.max.name[[i]], ">", value[[i]]))
}
paste(compset[,val.to.return], collapse="|")
}
And the result:
结果如下:
> mydata %>%
+ group_by(1:n()) %>%
+ mutate(matchedValue = find_interval_func(c("minval1", "minval2", "minval3"),
+ c("maxval1", "maxval2", "maxval3"),
+ c(val1, val2, val3),
+ "Name"))
Source: local data table [5 x 6]
id val1 val2 val3 1:n() matchedValue
1 1 10000 377 0.7912443 1 A|C
2 2 20000 378 0.7709792 2 A|C
3 3 30000 484 0.7049517 3 A|B
4 4 40000 513 0.5169590 4 A|B
5 5 50000 474 0.7987448 5 B
#2
0
I do not exactly understand your Desired Output, because multiple id's match the mycomparison
data.table. Using your data (rounded to two decimal places):
我并不完全理解您的Desired Output,因为多个id匹配mycomparison data.table。使用您的数据(四舍五入到小数点后两位):
> mydata
id val1 val2 val3
1: 1 10000 387 0.48
2: 2 20000 425 0.79
3: 3 30000 324 0.81
4: 4 40000 263 0.56
5: 5 50000 187 0.88
And
> mycomparison
Name minval1 maxval1 minval2 maxval2 minval3 maxval3 correspondingval
1: A 0 50000 300 800 0.0 1.0 0.1
2: B 30000 80000 400 800 0.5 0.9 0.2
3: C 10000 30000 300 800 0.2 0.8 0.3
This gives:
> workingdt
id val1 val2 val3 Name minval1 maxval1 minval2 maxval2 minval3 maxval3 correspondingval
1: 1 10000 387 0.48 A 0 50000 300 800 0.0 1.0 0.1
2: 2 20000 425 0.79 A 0 50000 300 800 0.0 1.0 0.1
3: 3 30000 324 0.81 A 0 50000 300 800 0.0 1.0 0.1
4: 4 40000 263 0.56 A 0 50000 300 800 0.0 1.0 0.1
5: 5 50000 187 0.88 A 0 50000 300 800 0.0 1.0 0.1
6: 1 10000 387 0.48 B 30000 80000 400 800 0.5 0.9 0.2
7: 2 20000 425 0.79 B 30000 80000 400 800 0.5 0.9 0.2
8: 3 30000 324 0.81 B 30000 80000 400 800 0.5 0.9 0.2
9: 4 40000 263 0.56 B 30000 80000 400 800 0.5 0.9 0.2
10: 5 50000 187 0.88 B 30000 80000 400 800 0.5 0.9 0.2
11: 1 10000 387 0.48 C 10000 30000 300 800 0.2 0.8 0.3
12: 2 20000 425 0.79 C 10000 30000 300 800 0.2 0.8 0.3
13: 3 30000 324 0.81 C 10000 30000 300 800 0.2 0.8 0.3
14: 4 40000 263 0.56 C 10000 30000 300 800 0.2 0.8 0.3
15: 5 50000 187 0.88 C 10000 30000 300 800 0.2 0.8 0.3
And leaving off your which.min()
:
并且离开你的which.min():
> workingdt[val1>=minval1 & val1<= maxval1 & val2>=minval2 &
val2<=maxval2 & val3>=minval3 & val3<=maxval3]
id val1 val2 val3 Name minval1 maxval1 minval2 maxval2 minval3 maxval3 correspondingval
1: 1 10000 387 0.48 A 0 50000 300 800 0.0 1.0 0.1
2: 2 20000 425 0.79 A 0 50000 300 800 0.0 1.0 0.1
3: 3 30000 324 0.81 A 0 50000 300 800 0.0 1.0 0.1
4: 1 10000 387 0.48 C 10000 30000 300 800 0.2 0.8 0.3
5: 2 20000 425 0.79 C 10000 30000 300 800 0.2 0.8 0.3
If you use the data.table group-by functionality, you can pick the min(correspondingval)
for each id
(I am leaving off the unmatched data for the moment):
如果你使用data.table分组功能,你可以为每个id选择min(对应的值)(我暂时不使用不匹配的数据):
> workingdt[val1>=minval1 & val1<= maxval1 & val2>=minval2 &
val2<=maxval2 & val3>=minval3 & val3<=maxval3]
[,.SD[which.min(correspondingval)], by=id]
id val1 val2 val3 Name minval1 maxval1 minval2 maxval2 minval3 maxval3 correspondingval
1: 1 10000 387 0.48 A 0 50000 300 800 0 1 0.1
2: 2 20000 425 0.79 A 0 50000 300 800 0 1 0.1
3: 3 30000 324 0.81 A 0 50000 300 800 0 1 0.1
Or, the max(correspondingval)
if you prefer:
或者,如果您愿意,还可以使用max(对应值):
> workingdt[val1>=minval1 & val1<= maxval1 & val2>=minval2 &
val2<=maxval2 & val3>=minval3 & val3<=maxval3]
[,.SD[which.max(correspondingval)], by=id]
id val1 val2 val3 Name minval1 maxval1 minval2 maxval2 minval3 maxval3 correspondingval
1: 1 10000 387 0.48 C 10000 30000 300 800 0.2 0.8 0.3
2: 2 20000 425 0.79 C 10000 30000 300 800 0.2 0.8 0.3
3: 3 30000 324 0.81 A 0 50000 300 800 0.0 1.0 0.1
If all you want--as shown in your Desired Output--is the first row with the minimum correspondingval
and everything else with NA
there are easier ways to do this. If you want to know where each id
matches a range--as I have shown in my output--then a cleaner, more elegant solution is different.
如果您想要的只是 - 如您所需的输出中所示 - 是具有最小对应值的第一行以及具有NA的所有其他行,则有更简单的方法来执行此操作。如果你想知道每个id匹配一个范围 - 正如我在输出中所示 - 那么更清晰,更优雅的解决方案是不同的。
Let me know.
让我知道。
#3
0
There is a simple approach which uses the crossing function from tidyr, along with the very useful between function from dplyr. Of course, this works safely so long as at least one of the tables is relatively small, otherwise the crossing can be a memory hog.
有一种简单的方法,它使用来自tidyr的交叉函数,以及来自dplyr的函数之间非常有用的方法。当然,只要至少一个表格相对较小,这就可以安全地工作,否则交叉可能是记忆猪。
library(tidyr)
library(dplyr)
mydata.withmatches <- mydata %>%
crossing(mycomparison) %>%
filter(between(val1, minval1,maxval1) & between(val2, minval2, maxval2) & between(val3, minval3, maxval3)) %>%
full_join(mydata) %>%
arrange(id, Name)
#1
0
Nice question! Below you can find my quick fix, but it's also still a bit too cluncky for my taste.
好问题!下面你可以找到我的快速修复,但它对我的口味来说仍然有点过于笨拙。
The generated mydata
set:
生成的mydata集:
id val1 val2 val3
1: 1 10000 377 0.7912443
2: 2 20000 378 0.7709792
3: 3 30000 484 0.7049517
4: 4 40000 513 0.5169590
5: 5 50000 474 0.7987448
A simple function to filter away the non-matching rows of the comparison dataset (multiple rows can match).
一个简单的函数,用于过滤掉比较数据集的不匹配行(多行可以匹配)。
library(dplyr)
find_interval_func<-function(var.min.name, var.max.name, value, val.to.return){
compset<-data.frame(mycomparison)
for(i in 1:length(var.min.name)){
compset<-
compset %>%
filter_(paste0(var.min.name[[i]], "<=", value[[i]]),
paste0(var.max.name[[i]], ">", value[[i]]))
}
paste(compset[,val.to.return], collapse="|")
}
And the result:
结果如下:
> mydata %>%
+ group_by(1:n()) %>%
+ mutate(matchedValue = find_interval_func(c("minval1", "minval2", "minval3"),
+ c("maxval1", "maxval2", "maxval3"),
+ c(val1, val2, val3),
+ "Name"))
Source: local data table [5 x 6]
id val1 val2 val3 1:n() matchedValue
1 1 10000 377 0.7912443 1 A|C
2 2 20000 378 0.7709792 2 A|C
3 3 30000 484 0.7049517 3 A|B
4 4 40000 513 0.5169590 4 A|B
5 5 50000 474 0.7987448 5 B
#2
0
I do not exactly understand your Desired Output, because multiple id's match the mycomparison
data.table. Using your data (rounded to two decimal places):
我并不完全理解您的Desired Output,因为多个id匹配mycomparison data.table。使用您的数据(四舍五入到小数点后两位):
> mydata
id val1 val2 val3
1: 1 10000 387 0.48
2: 2 20000 425 0.79
3: 3 30000 324 0.81
4: 4 40000 263 0.56
5: 5 50000 187 0.88
And
> mycomparison
Name minval1 maxval1 minval2 maxval2 minval3 maxval3 correspondingval
1: A 0 50000 300 800 0.0 1.0 0.1
2: B 30000 80000 400 800 0.5 0.9 0.2
3: C 10000 30000 300 800 0.2 0.8 0.3
This gives:
> workingdt
id val1 val2 val3 Name minval1 maxval1 minval2 maxval2 minval3 maxval3 correspondingval
1: 1 10000 387 0.48 A 0 50000 300 800 0.0 1.0 0.1
2: 2 20000 425 0.79 A 0 50000 300 800 0.0 1.0 0.1
3: 3 30000 324 0.81 A 0 50000 300 800 0.0 1.0 0.1
4: 4 40000 263 0.56 A 0 50000 300 800 0.0 1.0 0.1
5: 5 50000 187 0.88 A 0 50000 300 800 0.0 1.0 0.1
6: 1 10000 387 0.48 B 30000 80000 400 800 0.5 0.9 0.2
7: 2 20000 425 0.79 B 30000 80000 400 800 0.5 0.9 0.2
8: 3 30000 324 0.81 B 30000 80000 400 800 0.5 0.9 0.2
9: 4 40000 263 0.56 B 30000 80000 400 800 0.5 0.9 0.2
10: 5 50000 187 0.88 B 30000 80000 400 800 0.5 0.9 0.2
11: 1 10000 387 0.48 C 10000 30000 300 800 0.2 0.8 0.3
12: 2 20000 425 0.79 C 10000 30000 300 800 0.2 0.8 0.3
13: 3 30000 324 0.81 C 10000 30000 300 800 0.2 0.8 0.3
14: 4 40000 263 0.56 C 10000 30000 300 800 0.2 0.8 0.3
15: 5 50000 187 0.88 C 10000 30000 300 800 0.2 0.8 0.3
And leaving off your which.min()
:
并且离开你的which.min():
> workingdt[val1>=minval1 & val1<= maxval1 & val2>=minval2 &
val2<=maxval2 & val3>=minval3 & val3<=maxval3]
id val1 val2 val3 Name minval1 maxval1 minval2 maxval2 minval3 maxval3 correspondingval
1: 1 10000 387 0.48 A 0 50000 300 800 0.0 1.0 0.1
2: 2 20000 425 0.79 A 0 50000 300 800 0.0 1.0 0.1
3: 3 30000 324 0.81 A 0 50000 300 800 0.0 1.0 0.1
4: 1 10000 387 0.48 C 10000 30000 300 800 0.2 0.8 0.3
5: 2 20000 425 0.79 C 10000 30000 300 800 0.2 0.8 0.3
If you use the data.table group-by functionality, you can pick the min(correspondingval)
for each id
(I am leaving off the unmatched data for the moment):
如果你使用data.table分组功能,你可以为每个id选择min(对应的值)(我暂时不使用不匹配的数据):
> workingdt[val1>=minval1 & val1<= maxval1 & val2>=minval2 &
val2<=maxval2 & val3>=minval3 & val3<=maxval3]
[,.SD[which.min(correspondingval)], by=id]
id val1 val2 val3 Name minval1 maxval1 minval2 maxval2 minval3 maxval3 correspondingval
1: 1 10000 387 0.48 A 0 50000 300 800 0 1 0.1
2: 2 20000 425 0.79 A 0 50000 300 800 0 1 0.1
3: 3 30000 324 0.81 A 0 50000 300 800 0 1 0.1
Or, the max(correspondingval)
if you prefer:
或者,如果您愿意,还可以使用max(对应值):
> workingdt[val1>=minval1 & val1<= maxval1 & val2>=minval2 &
val2<=maxval2 & val3>=minval3 & val3<=maxval3]
[,.SD[which.max(correspondingval)], by=id]
id val1 val2 val3 Name minval1 maxval1 minval2 maxval2 minval3 maxval3 correspondingval
1: 1 10000 387 0.48 C 10000 30000 300 800 0.2 0.8 0.3
2: 2 20000 425 0.79 C 10000 30000 300 800 0.2 0.8 0.3
3: 3 30000 324 0.81 A 0 50000 300 800 0.0 1.0 0.1
If all you want--as shown in your Desired Output--is the first row with the minimum correspondingval
and everything else with NA
there are easier ways to do this. If you want to know where each id
matches a range--as I have shown in my output--then a cleaner, more elegant solution is different.
如果您想要的只是 - 如您所需的输出中所示 - 是具有最小对应值的第一行以及具有NA的所有其他行,则有更简单的方法来执行此操作。如果你想知道每个id匹配一个范围 - 正如我在输出中所示 - 那么更清晰,更优雅的解决方案是不同的。
Let me know.
让我知道。
#3
0
There is a simple approach which uses the crossing function from tidyr, along with the very useful between function from dplyr. Of course, this works safely so long as at least one of the tables is relatively small, otherwise the crossing can be a memory hog.
有一种简单的方法,它使用来自tidyr的交叉函数,以及来自dplyr的函数之间非常有用的方法。当然,只要至少一个表格相对较小,这就可以安全地工作,否则交叉可能是记忆猪。
library(tidyr)
library(dplyr)
mydata.withmatches <- mydata %>%
crossing(mycomparison) %>%
filter(between(val1, minval1,maxval1) & between(val2, minval2, maxval2) & between(val3, minval3, maxval3)) %>%
full_join(mydata) %>%
arrange(id, Name)