如何在第二个df中找到/匹配最近距离坐标的值

时间:2021-12-12 21:32:24

I have a series of geographical positions at sea which I am trying to get geological sediment type information for. I am using an export of the national british geological sediment database (df1)which is a large data set of coordinates and sediment information. Currently I have been rounding the coordinates in the BGS export file (df1) and averaging/recalculating the sediment type for these coordinate squares, then I have rounded my coordinates in (df2) and matched these to these squares to get a sediment classification.

我在海上有一系列地理位置,我正在尝试获取地质沉积物类型信息。我正在使用英国国家地质沉积物数据库(df1)的出口,这是一个大型坐标和沉积物信息数据集。目前我已经对BGS导出文件(df1)中的坐标进行四舍五入并对这些坐标方格的沉积类型进行平均/重新计算,然后我在(df2)中舍入我的坐标并将这些坐标与这些方块匹配以获得沉积物分类。

The BGS export looks like this (df1);

BGS导出如下(df1);

    NUM     X       Y           GRAV    SAND    MUD
1   228     1.93656 52.31307    1.07    98.83   0.10
2   142     1.84667 52.45333    0.00    52.60   47.40
3   182     1.91950 52.17750    9.48    90.38   0.14
4   124     1.88333 52.70833    0.00    98.80   1.20
5   2807    1.91050 51.45000    2.05    97.91   0.05
6   2787    1.74683 51.99382    41.32   52.08   6.60
7   2776    1.66117 51.63550    9.83    87.36   2.81
8   2763    1.82467 51.71767    43.92   47.25   8.83
9   2753    1.76867 51.96349    57.66   39.18   3.15
10  68      2.86967 52.96333    0.30    98.90   0.80
11  2912    1.70083 51.77783    26.90   64.87   8.22
12  2914    1.59750 51.88882    32.00   65.02   2.97
13  2886    1.98833 51.34267    1.05    98.91   0.04
14  2891    1.87817 51.31549    68.57   31.34   0.08
15  2898    1.37433 51.41249    35.93   61.48   2.59
16  45      2.06667 51.82500    9.70    88.10   2.20
17  2904    1.63617 51.45999    16.28   66.67   17.05

My positions at sea look like this (df2);

我在海上的位置看起来像这样(df2);

haul    DecStartLat DecStartLong
1993H_2 55.23983    -5.512830
2794H_1 55.26670    -5.516700
1993H_1 55.27183    -5.521330
0709A_71    55.26569    -5.519730
0396H_2 55.44120    -5.917800
0299H_2 55.44015    -5.917310
0514A_26    55.46897    -5.912167
0411A_64    55.47289    -5.911820
0410A_65    55.46869    -5.911930
0514A_24    55.63585    -5.783500
0295H_4 55.57250    -5.754300
0410A_62    55.63656    -6.041870
0413A_53    55.73280    -6.020600
0396H_13    55.66470    -6.002300
2794H_8 55.83330    -5.883300
0612A_15    55.84025    -5.912130
0410A_74    55.84311    -5.910180
0299H_16    55.90568    -5.732490
0200H_18    55.88600    -5.742900
0612A_18    55.90450    -5.835880

This is my script...

这是我的剧本......

get.Sed.type <- function(x,y) {
  x$Y2 <- round(x$Y, digits=1)
  x$X2 <- round(x$X, digits=1)
  x$BGSQ <- paste(x$Y2,x$X2,sep="_")
  x$RATIO <- x$SAND/x$MUD
  x <- aggregate(cbind(GRAV,RATIO)~BGSQ,data=x,FUN=mean)

  FOLK <- (x$GRAV)
  FOLK[(FOLK)<1] <- 0
  FOLK[(FOLK)>=1&(FOLK)<5] <- 1
  FOLK[(FOLK)>=5&(FOLK)<30] <- 5
  FOLK[(FOLK)>=30&(FOLK)<80] <- 30
  FOLK[(FOLK)>=80] <- 80

  R_CLASS <- (x$RATIO)
  R_CLASS[(R_CLASS)<1/9] <- 0
  R_CLASS[(R_CLASS)>=1/9&(R_CLASS)<1] <- 0.1
  R_CLASS[(R_CLASS)>=1&(R_CLASS)<9] <- 1
  R_CLASS[(R_CLASS)>=9] <- 9

  x$FOLK_CLASS <- NULL
  x$FOLK_CLASS[(R_CLASS)==0&(FOLK)==0] <- "M"
  x$FOLK_CLASS[(R_CLASS)%in%c(0,0.1)&(FOLK)==5] <- "gM"
  x$FOLK_CLASS[(R_CLASS)==0.1&(FOLK)==0] <- "sM"
  x$FOLK_CLASS[(R_CLASS)==0&(FOLK)==1] <- "(g)M"
  x$FOLK_CLASS[(R_CLASS)==0.1&(FOLK)==1] <- "(g)sM"
  x$FOLK_CLASS[(R_CLASS)==9&(FOLK)==0] <- "S"
  x$FOLK_CLASS[(R_CLASS)==1&(FOLK)==0] <- "mS"
  x$FOLK_CLASS[(R_CLASS)==9&(FOLK)==1] <- "(g)S"
  x$FOLK_CLASS[(R_CLASS)==1&(FOLK)==1] <- "(g)sM"
  x$FOLK_CLASS[(R_CLASS)==1&(FOLK)==5] <- "gmS"
  x$FOLK_CLASS[(R_CLASS)==9&(FOLK)==5] <- "gS"
  x$FOLK_CLASS[(FOLK)==80] <- "G"
  x$FOLK_CLASS[(R_CLASS)%in%c(0,0.1)&(FOLK)==30] <- "mG"
  x$FOLK_CLASS[(R_CLASS)==1&(FOLK)==30] <- "msG"
  x$FOLK_CLASS[(R_CLASS)==9&(FOLK)==30] <- "sG"

  y$Lat <- round(y$DecStartLat, digits=1)
  y$Long <- round(y$DecStartLong, digits=1)
  y$LATLONG100_sq <- paste(y$Lat,y$Long,sep="_")

  y <- merge(y, x[,c(1,4)],all.x=TRUE,by.x="LATLONG100_sq",by.y="BGSQ")

  #Delete unwanted columns
  y <- y[, !(colnames(y) %in% c("Lat","Long","LATLONG100_sq"))]
  #Name column something logical
  colnames(y)[colnames(y) == 'FOLK_CLASS'] <- 'BGS_class'

  return(y)
}

However I have a dozen or so positions in db2 for which there are no corresponding values in the BGS export (db1), I want to know how I can either ask it to do another average for the squares surrounding that respective square (i.e. round to a larger number and repeat the process) OR to ask it to find the coordinate in the BGS export file that is closest in proximity and take the existing value.

但是我在db2中有十几个位置,在BGS导出中没有相应的值(db1),我想知道我怎么能要求它为相应方块周围的方块做另一个平均值(即舍入到更大的数字并重复该过程)或要求它在BGS导出文件中找到最接近的坐标并获取现有值。

2 个解决方案

#1


1  

Going for the second option stated in the question, I suggest to frame the question as follows:

针对问题中提到的第二个选项,我建议将问题框如下:

Say that you have a set of m coordinates from db1 and n coordinates from db2, m <=n, and that currently the intersection of these sets is empty.

假设您有一组来自db1的m坐标和来自db2的n坐标,m <= n,并且当前这些集合的交集是空的。

You'd like to match each point from db1 with a point from db2 such that the "error" of the matching, e.g. sum of distances, will be minimized.

您希望将db1中的每个点与来自db2的点匹配,以便匹配的“错误”,例如,距离总和将最小化。

A simple greedy approach for solving this might be to generate an m x n matrix with the distances between each pair of coordinates, and sequentially select the closest match for each point. Of course, If there are many points to match, or if you're after an optimal solution, you may want to consider more elaborate matching algorithms (e.g. the Hungarian algorithm).

解决这个问题的简单贪婪方法可能是生成一个m x n矩阵,其中每对坐标之间的距离,并按顺序为每个点选择最接近的匹配。当然,如果有许多要匹配的点,或者如果您正在寻找最佳解决方案,您可能需要考虑更精细的匹配算法(例如匈牙利算法)。

Code:

码:

  #generate some data (this data will generate sub-optimal matching with greedy matching)
  db1 <- data.frame(id=c("a1","a2","a3","a4"), x=c(1,5,10,20), y=c(1,5,10,20))
  db2 <- data.frame(id=c("b1","b2","b3","b4"),x=c(1.1,2.1,8.1,14.1), y=c(1.1,1.1,8.1,14.1))

  #create cartesian product
  product <- merge(db1, db2, by=NULL)
  #calculate auclidean distances for each possible matching
  product$d <- sqrt((product$x.x - product$x.y)^2 + (product$y.x - product$y.y)^2)

  #(naively & greedily) find the best match for each point
  sorted <- product[ order(product[,"d"]), ]
  found <- vector()
  res <- vector() #this vector will hold the result
  for (i in 1:nrow(db1)) {
    for (j in 1:nrow(sorted)) {
      db2_val <- as.character(sorted[j,"id.y"])
      if (sorted[j,"id.x"] == db1[i, "id"] && length(grep(db2_val, found)) == 0) {    
        #print(paste("matching ", db1[i, "id"], " with ", db2_val))
        res[i] <- db2_val      
        found <- c(found, db2_val)
        break
      }
    }
  }

Note that I'm sure the code can be improved and made more elegant by using methods other than loop.

请注意,我确信通过使用循环以外的方法可以改进代码并使其更加优雅。

#2


0  

Hopefully I do not misunderstand, but as far as I get from the title, you need to match based on minimum distance. If this distance is allowed to be Euclidean distance, then one can use the fast RANN package, if not, then one needs to compute the great circle distance.

希望我不会误解,但就我从标题中得到的,你需要根据最小距离进行匹配。如果允许该距离为欧几里德距离,则可以使用快速RANN包,如果不是,则需要计算大圆距离。

Some of the provided data

BGS_df <- 
  read.table(text = 
               "    NUM     X       Y           GRAV    SAND    MUD
                1   228     1.93656 52.31307    1.07    98.83   0.10
                2   142     1.84667 52.45333    0.00    52.60   47.40
                3   182     1.91950 52.17750    9.48    90.38   0.14
                4   124     1.88333 52.70833    0.00    98.80   1.20
                5   2807    1.91050 51.45000    2.05    97.91   0.05",
             header = TRUE)

my_positions <-
  read.table(text = 
               "haul    DecStartLat DecStartLong
                1993H_2 55.23983    -5.512830
                2794H_1 55.26670    -5.516700
                1993H_1 55.27183    -5.521330",
             header = TRUE)

Euclidean distance (using RANN package)

library(RANN)
# For each point in my_positions, find the nearest neighbor from BGS_df:
# Give X and then Y (longtitude and then latitude)
# Note that argument k sets the number of nearest neighbours, here 1 (the closest)
closest_RANN <- RANN::nn2(data = BGS_df[, c("X", "Y")], 
                          query = my_positions[, c("DecStartLong", "DecStartLat")], 
                          k = 1)
results_RANN <- cbind(my_positions[, c("haul", "DecStartLong", "DecStartLat")],
                      BGS_df[closest_RANN$nn.idx, ])
results_RANN
#        haul DecStartLong DecStartLat NUM       X        Y GRAV SAND MUD
# 4   1993H_2     -5.51283    55.23983 124 1.88333 52.70833    0 98.8 1.2
# 4.1 2794H_1     -5.51670    55.26670 124 1.88333 52.70833    0 98.8 1.2
# 4.2 1993H_1     -5.52133    55.27183 124 1.88333 52.70833    0 98.8 1.2

Great circle distance (using geosphere package)

library(geosphere)
# Compute matrix of great circle distances
dist_mat <- geosphere::distm(x = BGS_df[, c("X", "Y")],
                             y = my_positions[, c("DecStartLong", "DecStartLat")],
                             fun = distHaversine) # can try other distances
# For each column (point in my_positions) get the index of row of min dist
# (corresponds to row index in BGS_df) 
BGS_idx <- apply(dist_mat, 2, which.min)

results_geo <- cbind(my_positions[, c("haul", "DecStartLong", "DecStartLat")],
                     BGS_df[BGS_idx, ])
identical(results_geo, results_RANN) # here TRUE, but not always expected

#1


1  

Going for the second option stated in the question, I suggest to frame the question as follows:

针对问题中提到的第二个选项,我建议将问题框如下:

Say that you have a set of m coordinates from db1 and n coordinates from db2, m <=n, and that currently the intersection of these sets is empty.

假设您有一组来自db1的m坐标和来自db2的n坐标,m <= n,并且当前这些集合的交集是空的。

You'd like to match each point from db1 with a point from db2 such that the "error" of the matching, e.g. sum of distances, will be minimized.

您希望将db1中的每个点与来自db2的点匹配,以便匹配的“错误”,例如,距离总和将最小化。

A simple greedy approach for solving this might be to generate an m x n matrix with the distances between each pair of coordinates, and sequentially select the closest match for each point. Of course, If there are many points to match, or if you're after an optimal solution, you may want to consider more elaborate matching algorithms (e.g. the Hungarian algorithm).

解决这个问题的简单贪婪方法可能是生成一个m x n矩阵,其中每对坐标之间的距离,并按顺序为每个点选择最接近的匹配。当然,如果有许多要匹配的点,或者如果您正在寻找最佳解决方案,您可能需要考虑更精细的匹配算法(例如匈牙利算法)。

Code:

码:

  #generate some data (this data will generate sub-optimal matching with greedy matching)
  db1 <- data.frame(id=c("a1","a2","a3","a4"), x=c(1,5,10,20), y=c(1,5,10,20))
  db2 <- data.frame(id=c("b1","b2","b3","b4"),x=c(1.1,2.1,8.1,14.1), y=c(1.1,1.1,8.1,14.1))

  #create cartesian product
  product <- merge(db1, db2, by=NULL)
  #calculate auclidean distances for each possible matching
  product$d <- sqrt((product$x.x - product$x.y)^2 + (product$y.x - product$y.y)^2)

  #(naively & greedily) find the best match for each point
  sorted <- product[ order(product[,"d"]), ]
  found <- vector()
  res <- vector() #this vector will hold the result
  for (i in 1:nrow(db1)) {
    for (j in 1:nrow(sorted)) {
      db2_val <- as.character(sorted[j,"id.y"])
      if (sorted[j,"id.x"] == db1[i, "id"] && length(grep(db2_val, found)) == 0) {    
        #print(paste("matching ", db1[i, "id"], " with ", db2_val))
        res[i] <- db2_val      
        found <- c(found, db2_val)
        break
      }
    }
  }

Note that I'm sure the code can be improved and made more elegant by using methods other than loop.

请注意,我确信通过使用循环以外的方法可以改进代码并使其更加优雅。

#2


0  

Hopefully I do not misunderstand, but as far as I get from the title, you need to match based on minimum distance. If this distance is allowed to be Euclidean distance, then one can use the fast RANN package, if not, then one needs to compute the great circle distance.

希望我不会误解,但就我从标题中得到的,你需要根据最小距离进行匹配。如果允许该距离为欧几里德距离,则可以使用快速RANN包,如果不是,则需要计算大圆距离。

Some of the provided data

BGS_df <- 
  read.table(text = 
               "    NUM     X       Y           GRAV    SAND    MUD
                1   228     1.93656 52.31307    1.07    98.83   0.10
                2   142     1.84667 52.45333    0.00    52.60   47.40
                3   182     1.91950 52.17750    9.48    90.38   0.14
                4   124     1.88333 52.70833    0.00    98.80   1.20
                5   2807    1.91050 51.45000    2.05    97.91   0.05",
             header = TRUE)

my_positions <-
  read.table(text = 
               "haul    DecStartLat DecStartLong
                1993H_2 55.23983    -5.512830
                2794H_1 55.26670    -5.516700
                1993H_1 55.27183    -5.521330",
             header = TRUE)

Euclidean distance (using RANN package)

library(RANN)
# For each point in my_positions, find the nearest neighbor from BGS_df:
# Give X and then Y (longtitude and then latitude)
# Note that argument k sets the number of nearest neighbours, here 1 (the closest)
closest_RANN <- RANN::nn2(data = BGS_df[, c("X", "Y")], 
                          query = my_positions[, c("DecStartLong", "DecStartLat")], 
                          k = 1)
results_RANN <- cbind(my_positions[, c("haul", "DecStartLong", "DecStartLat")],
                      BGS_df[closest_RANN$nn.idx, ])
results_RANN
#        haul DecStartLong DecStartLat NUM       X        Y GRAV SAND MUD
# 4   1993H_2     -5.51283    55.23983 124 1.88333 52.70833    0 98.8 1.2
# 4.1 2794H_1     -5.51670    55.26670 124 1.88333 52.70833    0 98.8 1.2
# 4.2 1993H_1     -5.52133    55.27183 124 1.88333 52.70833    0 98.8 1.2

Great circle distance (using geosphere package)

library(geosphere)
# Compute matrix of great circle distances
dist_mat <- geosphere::distm(x = BGS_df[, c("X", "Y")],
                             y = my_positions[, c("DecStartLong", "DecStartLat")],
                             fun = distHaversine) # can try other distances
# For each column (point in my_positions) get the index of row of min dist
# (corresponds to row index in BGS_df) 
BGS_idx <- apply(dist_mat, 2, which.min)

results_geo <- cbind(my_positions[, c("haul", "DecStartLong", "DecStartLat")],
                     BGS_df[BGS_idx, ])
identical(results_geo, results_RANN) # here TRUE, but not always expected