匹配来自不同来源的数据框中的多个列

时间:2022-05-09 15:45:32

I have the results of games played between multiple players at different points in time. I have this information from two different sources who assign different unique ids to each player. I would like to find an eloquent way to match up the two data sources by player id. The two data sources

我有不同时间点的多个玩家之间玩游戏的结果。我从两个不同的来源获得这些信息,他们为每个玩家分配不同的唯一ID。我想找到一种雄辩的方式来匹配玩家ID的两个数据源。这两个数据源

sourcex <- structure(list(outcomedate = structure(c(12637, 12637, 12637, 
12637, 12637, 12637, 12637, 12637, 12637, 12637, 12637, 12638, 
12639, 12640, 12640, 12640, 12640, 12640, 12640, 12640, 12640, 
12641, 12641, 12641, 12643, 12644, 12644, 12644, 12644, 12644, 
12644, 12644, 12644, 12644, 12644, 12645), class = "Date"), xid1 = c(206L, 
208L, 209L, 216L, 233L, 235L, 239L, 241L, 250L, 253L, 259L, 238L, 
236L, 211L, 221L, 234L, 249L, 254L, 255L, 257L, 258L, 207L, 230L, 
248L, 258L, 207L, 211L, 230L, 234L, 236L, 248L, 249L, 254L, 255L, 
257L, 221L), xid2 = c(211L, 207L, 221L, 249L, 248L, 257L, 234L, 
255L, 236L, 258L, 254L, 230L, 241L, 253L, 235L, 238L, 208L, 233L, 
239L, 259L, 206L, 209L, 250L, 216L, 259L, 216L, 241L, 208L, 235L, 
239L, 253L, 250L, 209L, 238L, 206L, 233L), outcome1 = c(2L, 1L, 
0L, 2L, 1L, 3L, 1L, 1L, 2L, 2L, 0L, 2L, 3L, 3L, 1L, 0L, 2L, 0L, 
0L, 0L, 2L, 1L, 2L, 1L, 0L, 3L, 2L, 0L, 0L, 0L, 2L, 2L, 2L, 1L, 
1L, 1L), outcome2 = c(0L, 0L, 0L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 
0L, 1L, 0L, 1L, 0L, 0L, 1L, 2L, 0L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 
2L, 0L, 1L, 1L, 2L, 1L, 0L, 1L, 1L, 3L)), .Names = c("outcomedate", 
"xid1", "xid2", "outcome1", "outcome2"), row.names = c(NA, 36L
), class = "data.frame")

sourcey <- structure(list(outcomedate = structure(c(12637, 12637, 12637, 
12637, 12637, 12637, 12637, 12637, 12637, 12637, 12637, 12638, 
12639, 12640, 12640, 12640, 12640, 12640, 12640, 12640, 12640, 
12641, 12641, 12641, 12643, 12644, 12644, 12644, 12644, 12644, 
12644, 12644, 12644, 12644, 12644, 12645), class = "Date"), yid1 = c(56, 
46, 67, 68, 59, 63, 55, 50, 66, 61, 57, 58, 53, 60, 64, 48, 69, 
54, 51, 65, 62, 47, 49, 52, 64, 60, 47, 48, 69, 49, 54, 51, 65, 
53, 52, 62), yid2 = c(47, 51, 64, 48, 62, 69, 53, 54, 60, 49, 
65, 52, 50, 63, 57, 56, 61, 46, 58, 67, 66, 59, 68, 55, 63, 57, 
68, 55, 59, 67, 58, 66, 50, 46, 56, 61), outcome1 = structure(c(1L, 
1L, 2L, 2L, 3L, 3L, 2L, 1L, 4L, 1L, 2L, 2L, 4L, 3L, 2L, 2L, 3L, 
3L, 3L, 4L, 1L, 1L, 1L, 2L, 3L, 1L, 4L, 2L, 2L, 2L, 1L, 3L, 2L, 
3L, 3L, 1L), .Label = c("1", "2", "0", "3", "4", "5", "6"), class = "factor"), 
    outcome2 = structure(c(1L, 2L, 3L, 2L, 1L, 1L, 2L, 2L, 3L, 
    2L, 1L, 2L, 1L, 3L, 2L, 2L, 3L, 1L, 1L, 2L, 1L, 3L, 2L, 3L, 
    2L, 2L, 3L, 2L, 1L, 3L, 2L, 2L, 3L, 2L, 1L, 4L), .Label = c("0", 
    "1", "2", "3", "4"), class = "factor")), .Names = c("outcomedate", 
"yid1", "yid2", "outcome1", "outcome2"), row.names = c(NA, 36L
), class = "data.frame")

Both sources have an outcomedate, outcome1, outcome2 in common. They assign different ids to the individual players in the game. I have done the following to find the match between ids.

两个来源都有结果,结果1,结果2的共同点。他们为游戏中的各个玩家分配不同的ID。我已经完成以下操作来查找ID之间的匹配。

sourcex$ID <- with(sourcex, paste0(outcomedate, outcome1, outcome2))
sourcey$ID <- with(sourcey, paste0(outcomedate, outcome1, outcome2))

uPlayersx <- with(sourcex, unique(c(xid1, xid2)))
uPlayersy <- with(sourcey, unique(c(yid1, yid2)))

comparex <- sapply(uPlayersx, function(x){
  paste0(with(sourcex, ID[xid1 == x| xid2 == x]), collapse = '~')
})

comparey <- sapply(uPlayersy, function(x){
  paste0(with(sourcey, ID[yid1 == x| yid2 == x]), collapse = '~')
})

dumMatch <- data.frame(xid = uPlayersx, yid = uPlayersy[match(comparex, comparey)])

It works ok here on this test dataset however the real application is larger and this feels like a hack. Also the real datasets may have errors in reporting etc so partial matches might be needed. Any help would be appreciated.

它在这个测试数据集上运行正常,但实际应用程序更大,这感觉就像一个黑客。此外,真实数据集可能在报告等方面存在错误,因此可能需要部分匹配。任何帮助,将不胜感激。

1 个解决方案

#1


1  

This will (at least) help filtering out days that match perfectly:

这将(至少)帮助过滤掉完美匹配的日子:

match.day <- function(d)
{
    tempx <- sourcex[sourcex$outcomedate==d,]
    tempy <- sourcey[sourcey$outcomedate==d,]

    if(nrow(tempx)!=nrow(tempy)) stop("matching failed: number of rows differ.")

    P <- outer(tempx$outcome1, tempy$outcome1, `==`) &
         outer(tempx$outcome2, tempy$outcome2, `==`)

    if(any(rowSums(P)!=1)) stop("maching failed: ambiguous or impossible assignment.")

    map <- P %*% seq_len(nrow(tempy))

    cbind(tempx[,c("xid1","xid2")], tempy[map,c("yid1","yid2")])
}

days <- unique(c(sourcex$outcomedate, sourcey$outcomedate))

do.call(rbind, lapply(days[-c(1,4,7)], match.day))

Note that it failed for days 1, 4 and 7 (see days[c(1,4,7)]).

请注意,它在第1,4和7天失败了(见[c(1,4,7)]天)。

Result for other days:

其他日子的结果:

   xid1 xid2 yid1 yid2
12  238  230   58   52
13  236  241   53   50
22  207  209   47   59
23  230  250   52   55
24  248  216   49   68
25  258  259   64   63
36  221  233   62   61

#1


1  

This will (at least) help filtering out days that match perfectly:

这将(至少)帮助过滤掉完美匹配的日子:

match.day <- function(d)
{
    tempx <- sourcex[sourcex$outcomedate==d,]
    tempy <- sourcey[sourcey$outcomedate==d,]

    if(nrow(tempx)!=nrow(tempy)) stop("matching failed: number of rows differ.")

    P <- outer(tempx$outcome1, tempy$outcome1, `==`) &
         outer(tempx$outcome2, tempy$outcome2, `==`)

    if(any(rowSums(P)!=1)) stop("maching failed: ambiguous or impossible assignment.")

    map <- P %*% seq_len(nrow(tempy))

    cbind(tempx[,c("xid1","xid2")], tempy[map,c("yid1","yid2")])
}

days <- unique(c(sourcex$outcomedate, sourcey$outcomedate))

do.call(rbind, lapply(days[-c(1,4,7)], match.day))

Note that it failed for days 1, 4 and 7 (see days[c(1,4,7)]).

请注意,它在第1,4和7天失败了(见[c(1,4,7)]天)。

Result for other days:

其他日子的结果:

   xid1 xid2 yid1 yid2
12  238  230   58   52
13  236  241   53   50
22  207  209   47   59
23  230  250   52   55
24  248  216   49   68
25  258  259   64   63
36  221  233   62   61