I have a large data frame (~200,000 rows) that contains X-Y coordinates, e.g.:
我有一个包含X-Y坐标的大数据帧(~200,000行),例如:
points <- data.frame(X = c(1,3,2,5,4), Y = c(4,3,2,2,1))
And another large data frame (~1,000,000 rows) that contains the corner cells of a spatial (rectangular) grid, e.g.:
以及另一个包含空间(矩形)网格的角单元格的大型数据帧(约1,000,000行),例如:
MINX <- rep(0.5:5.5,6)
MINY <- rep(0.5:5.5,each=6)
grid <- data.frame(GridID = 1:36, MINX, MINY, MAXX = MINX+1, MAXY = MINY+1)
I would like to add a column to the "points" data frame that identified the ID of the grid the point is located in:
我想在“点”数据框中添加一列,该数据框标识点所在网格的ID:
X Y GridID
1 4 19
3 3 15
2 2 8
5 2 11
4 1 4
I can think of several ways to do this, using loops, using combinations of apply and match, even pulling out some big spatial gun from sp
or maptools
. But all are prohibitively slow. I have a hunch there's some data.table()
one liner that could pull this off in reasonable time. Do any gurus have an idea?
我可以想出几种方法来做这个,使用循环,使用应用和匹配的组合,甚至从sp或maptools中取出一些大的空间枪。但所有这些都太慢了。我有个预感,有一些数据。table()一个班轮可以在合理的时间内完成这项工作。有大师有什么想法吗?
(For the record, this is how I got the grid cell ID's above:
(以下是我获取网格单元ID的方法:
pt.minx <- apply(points,1,
function(foo) max(unique(grid)$MINX[unique(grid)$MINX < foo[1]]))
pt.miny <- apply(points,1,
function(foo) max(unique(grid)$MINY[unique(grid)$MINY < foo[2]]))
with(grid, GridID[match(pt.minx+1i*pt.miny, MINX + 1i*MINY)])
I can't tell from here whether it's slick or hideous - either way the apply function is way too slow for the complete data frame.)
从这里我看不出它是光滑的还是丑陋的——无论哪种方式,apply函数对于整个数据框架来说都太慢了)。
2 个解决方案
#1
1
You just need two merges with rolling:
你只需要两个合并就可以了:
grid = data.table(grid, key = 'MINX')
points = data.table(points, key = 'X')
# first merge to find correct MAXX
intermediate = grid[points, roll = Inf][, list(MAXX, X = MINX, Y)]
# now merge by Y
setkey(intermediate, MAXX, Y)
setkey(grid, MAXX, MINY)
grid[intermediate, roll = Inf][, list(X, Y = MINY, GridID)]
# X Y GridID
#1: 1 4 19
#2: 2 2 8
#3: 3 3 15
#4: 4 1 4
#5: 5 2 11
#2
2
Doing it the SQL[df] way:
采用SQL[df]方法:
require(sqldf)
sqldf("select X, Y, GridID from grid, pts
where MINX < X and X < MAXX and MINY < Y and Y < MAXY")
Expanding on @Roland's comment, you can use findInterval
here:
扩展@Roland的评论,您可以在这里使用findInterval:
MINX <- MINY <- 0.5:5.5
x <- findInterval(pts$X, MINX)
y <- findInterval(pts$Y, MINY)
grid$GridID[match(MINX[x]+1i*MINY[y], grid$MINX+1i*grid$MINY)]
Nice trick to coerce to complex for 2-dimensional matching, btw.
这是一个很好的技巧来强迫复杂的2维匹配,顺便说一下。
#1
1
You just need two merges with rolling:
你只需要两个合并就可以了:
grid = data.table(grid, key = 'MINX')
points = data.table(points, key = 'X')
# first merge to find correct MAXX
intermediate = grid[points, roll = Inf][, list(MAXX, X = MINX, Y)]
# now merge by Y
setkey(intermediate, MAXX, Y)
setkey(grid, MAXX, MINY)
grid[intermediate, roll = Inf][, list(X, Y = MINY, GridID)]
# X Y GridID
#1: 1 4 19
#2: 2 2 8
#3: 3 3 15
#4: 4 1 4
#5: 5 2 11
#2
2
Doing it the SQL[df] way:
采用SQL[df]方法:
require(sqldf)
sqldf("select X, Y, GridID from grid, pts
where MINX < X and X < MAXX and MINY < Y and Y < MAXY")
Expanding on @Roland's comment, you can use findInterval
here:
扩展@Roland的评论,您可以在这里使用findInterval:
MINX <- MINY <- 0.5:5.5
x <- findInterval(pts$X, MINX)
y <- findInterval(pts$Y, MINY)
grid$GridID[match(MINX[x]+1i*MINY[y], grid$MINX+1i*grid$MINY)]
Nice trick to coerce to complex for 2-dimensional matching, btw.
这是一个很好的技巧来强迫复杂的2维匹配,顺便说一下。