R:从一个(大)数据帧到另一个(大)数据帧的网格单元的匹配坐标。

I have a large data frame (~200,000 rows) that contains X-Y coordinates, e.g.:

我有一个包含X-Y坐标的大数据帧(~200,000行)，例如:

points <- data.frame(X = c(1,3,2,5,4), Y = c(4,3,2,2,1))

And another large data frame (~1,000,000 rows) that contains the corner cells of a spatial (rectangular) grid, e.g.:

以及另一个包含空间(矩形)网格的角单元格的大型数据帧(约1,000,000行)，例如:

MINX <- rep(0.5:5.5,6)
MINY <- rep(0.5:5.5,each=6)
grid <- data.frame(GridID = 1:36, MINX, MINY, MAXX = MINX+1, MAXY = MINY+1)

I would like to add a column to the "points" data frame that identified the ID of the grid the point is located in:

我想在“点”数据框中添加一列，该数据框标识点所在网格的ID:

I can think of several ways to do this, using loops, using combinations of apply and match, even pulling out some big spatial gun from sp or maptools. But all are prohibitively slow. I have a hunch there's some data.table() one liner that could pull this off in reasonable time. Do any gurus have an idea?

我可以想出几种方法来做这个，使用循环，使用应用和匹配的组合，甚至从sp或maptools中取出一些大的空间枪。但所有这些都太慢了。我有个预感，有一些数据。table()一个班轮可以在合理的时间内完成这项工作。有大师有什么想法吗?

(For the record, this is how I got the grid cell ID's above:

(以下是我获取网格单元ID的方法:

pt.minx <- apply(points,1, 
             function(foo) max(unique(grid)$MINX[unique(grid)$MINX < foo[1]]))
pt.miny <- apply(points,1, 
             function(foo) max(unique(grid)$MINY[unique(grid)$MINY < foo[2]]))
with(grid, GridID[match(pt.minx+1i*pt.miny, MINX + 1i*MINY)])

I can't tell from here whether it's slick or hideous - either way the apply function is way too slow for the complete data frame.)

从这里我看不出它是光滑的还是丑陋的——无论哪种方式，apply函数对于整个数据框架来说都太慢了)。

2 个解决方案

#1

You just need two merges with rolling:

你只需要两个合并就可以了:

grid = data.table(grid, key = 'MINX')
points = data.table(points, key = 'X')

# first merge to find correct MAXX
intermediate = grid[points, roll = Inf][, list(MAXX, X = MINX, Y)]

# now merge by Y
setkey(intermediate, MAXX, Y)
setkey(grid, MAXX, MINY)
grid[intermediate, roll = Inf][, list(X, Y = MINY, GridID)]
#   X Y GridID
#1: 1 4     19
#2: 2 2      8
#3: 3 3     15
#4: 4 1      4
#5: 5 2     11

#2

Doing it the SQL[df] way:

采用SQL[df]方法:

require(sqldf)
sqldf("select X, Y, GridID from grid, pts
       where MINX < X and X < MAXX and MINY < Y and Y < MAXY")

Expanding on @Roland's comment, you can use findInterval here:

扩展@Roland的评论，您可以在这里使用findInterval:

MINX <- MINY <- 0.5:5.5
x <- findInterval(pts$X, MINX)
y <- findInterval(pts$Y, MINY)
grid$GridID[match(MINX[x]+1i*MINY[y], grid$MINX+1i*grid$MINY)]

Nice trick to coerce to complex for 2-dimensional matching, btw.

这是一个很好的技巧来强迫复杂的2维匹配，顺便说一下。

#1