加快算法以找到两个2D点数据集中最接近的项目

Hi I'm looking to improve on this algorithm that is really slow. It should just return a point which is one of a pair of the two closest points out of both data sets.

嗨，我正在寻求改进这个非常慢的算法。它应该只返回一个点，该点是两个数据集中两个最接近的点之一。

The method I've used is just brute force, clearly testing distance between each set of points. There must be a better way.

我使用的方法只是蛮力，清楚地测试每组点之间的距离。一定会有更好的办法。

cv::Point FindClosesedValue(std::vector<cv::Point> & point_a, std::vector<cv::Point> point_b)
{
  double lowest_distance = std::numeric_limits<double>::max();
  cv::Point best_point;
  for (cv::Point & a : point_a)
  {
    for (cv::Point & b : point_b)
    {
      double distance = CvFunctions::DistanceSquared(b, a);
      if (distance < lowest_distance)
      {
        lowest_distance = distance;
        best_point = a;
      }
    }
  }

  return best_point;
}

Please can someone point me in the right way to speed up this code, hopefully by order of magnitudes. Example code would be amazing.

请有人指出我以正确的方式加速这段代码，希望按照数量级。示例代码会很棒。

2 个解决方案

#1

I've worked on a similar problem where I managed to get a 100x speedup, but it was dependent on the data.

我已经解决了类似的问题，我设法获得了100倍的加速，但这取决于数据。

If you can pre-sort one set of points into a grid of tiles, you can make use of the tile size to narrow down the points you need to test. A given point will have a minimum and maximum distance to any point in a specific tile. You can use these minimum and maximum distances to bound which tiles you check, thus avoiding checking against points in far away tiles.

如果您可以将一组点预先排序为图块网格，则可以使用图块大小来缩小需要测试的点。给定点将具有到特定图块中任何点的最小和最大距离。您可以使用这些最小和最大距离来绑定您检查的切片，从而避免检查远处切片中的点。

Once the points are divided into tiles, you can lookup which tile a new point would fall into, and start there. Depending on your data, the tile might be empty of pre-sorted points. Initially you just want to check the first tile, and surrounding ones until you find any point. That point will give you an approximation of the minimum distance. Once you know the minimum distance between your chosen point and the point that was found, you can continue checking all points in tiles up until the minimum distance between your chosen point and any point in a given tile is greater than your found minimum. Any points in tiles farther away can not be closer than a point you've already found. That minimum distance is of course updated if you find new closer points.

将点分成切片后，您可以查找新点落入哪个切片，然后从那里开始。根据您的数据，磁贴可能没有预先排序的点。最初你只想检查第一个瓷砖和周围的瓷砖，直到找到任何一点。这一点将为您提供最小距离的近似值。一旦知道所选点与找到的点之间的最小距离，就可以继续检查图块中的所有点，直到所选点与给定图块中任意点之间的最小距离大于找到的最小值。更远的瓷砖中的任何点都不能比您已经找到的点更近。如果您找到新的近点，那么最小距离当然会更新。

The sorting step is O(n), and the lookup step is bounded between n and n^2, so the expected time should be at most O(n^2), and likely much better, possibly close to linear if you have a suitable distribution of points.

排序步骤是O（n），并且查找步骤在n和n ^ 2之间限制，因此预期时间应该最多为O（n ^ 2），并且可能更好，如果你有一个可能接近线性合适的分配点。

As for tile size, I found choosing the tiles so the number of points in each tile was roughly equal to the number of tiles covering the data set yielded optimal runtimes. You can probably do better with a hierarchy of tiles, but I never got that complicated with my solution.

至于图块大小，我发现选择图块，因此每个图块中的点数大致等于覆盖数据集的图块数量，从而产生最佳运行时间。你可以用瓷砖的层次结构做得更好，但我的解决方案从来没有这么复杂。

#2

This is a nice problem. The usual recursive closest pair of points algorithm clearly isn't going to work, as the two point sets may be clustered in different areas of the space.

这是一个很好的问题。通常的递归最接近点对算法显然不起作用，因为两个点集可以聚集在空间的不同区域中。

You can still solve this in O(nlogn) time though. Simply create a kd-tree (k=2) of all the points in one set, and query it with all the points from the other set.

你仍然可以在O（nlogn）时间解决这个问题。只需创建一组中所有点的kd树（k = 2），然后使用另一组中的所有点进行查询。

#1