处理大数据网络文件的高效算法,用于计算最近的节点

时间:2021-04-09 13:46:47

Problem: I have two network files with me (say NET1 and NET2) - each has a set of nodes with unique ID for each node and geographic coordinates X and Y. Each node in NET2 is to have n connections to NET1 and the ID of n nodes will be determined by the minimum straight line distance. The output will have three fields IDs of node in NET1, NET2 and the distance between them. All the files are in tab delimited format.

问题:我有两个网络文件(比如NET1和NET2) - 每个网络都有一组节点,每个节点都有唯一的ID,地理坐标为X和Y. NET2中的每个节点都有n个连接到NET1和ID n个节点将由最小直线距离确定。输出将具有NET1,NET2中节点的三个字段ID以及它们之间的距离。所有文件都采用制表符分隔格式。

One way forward.. One way to implement this is for each node in NET2, we loop through each node in NET1 and compute all NET1-NET2 distance combinations. Sort it by NET2 node id and by distance and write out the first four records for each node. But the problem is there are close to 2 million nodes on NET1, 2000 nodes in NET2 - that is 4 billion distances to be calculated and written in the first step of this algorithm... and the runtime is quite forbidding!

一种方法前进..实现这一点的一种方法是NET2中的每个节点,我们遍历NET1中的每个节点并计算所有NET1-NET2距离组合。按NET2节点id和距离对其进行排序,并为每个节点写出前四条记录。但问题是NET1上有近200万个节点,NET2中有2000个节点 - 这个算法的第一步计算和写入了40亿个距离......运行时非常令人生畏!

Request: I was curious if any of you folks out there has faced similar issue. I would love to hear from y'all about any algorithms and data structures that can be used to speed the processing. I know that the scope of this question is very broad but I hope someone can point me the right way as I have very limited experience optimizing codes for data of this scale.

要求:如果你们中的任何人面临类似的问题,我很好奇。我很乐意听到你们所有关于可以用来加速处理的算法和数据结构。我知道这个问题的范围很广,但我希望有人能指出正确的方法,因为我在优化这种规模的数据代码方面经验非常有限。

Languages: I am trying in C++, Python and R.

语言:我正在尝试使用C ++,Python和R.

Please pitch in with ideas! Help greatly appreciated!

请提出想法!非常感谢!

1 个解决方案

#1


1  

kd-tree is one of the options. It allows you to find nearest neighbor (or a set of nearest neighbors) in reasonable time. Of course, you have to build the tree in the beginning and it takes some time. But generally, kd-tree is suitable, if you don't have to add/remove nodes in runtime, which seems to be your case. It also has better performance with lower dimension (in your case the dimension is 2).

kd-tree是其中一个选项。它允许您在合理的时间内找到最近的邻居(或一组最近的邻居)。当然,你必须在开始时构建树,这需要一些时间。但一般来说,kd-tree是合适的,如果你不必在运行时添加/删除节点,这似乎是你的情况。它还具有更低尺寸的更好性能(在您的情况下尺寸为2)。

Another possible data structure is octree (quadtree for 2D), it's simpler data structure (quite easy to implement), but kd-tree can be more efficient.

另一种可能的数据结构是八叉树(2D树的四叉树),它的数据结构更简单(非常容易实现),但kd-tree可以更高效。

#1


1  

kd-tree is one of the options. It allows you to find nearest neighbor (or a set of nearest neighbors) in reasonable time. Of course, you have to build the tree in the beginning and it takes some time. But generally, kd-tree is suitable, if you don't have to add/remove nodes in runtime, which seems to be your case. It also has better performance with lower dimension (in your case the dimension is 2).

kd-tree是其中一个选项。它允许您在合理的时间内找到最近的邻居(或一组最近的邻居)。当然,你必须在开始时构建树,这需要一些时间。但一般来说,kd-tree是合适的,如果你不必在运行时添加/删除节点,这似乎是你的情况。它还具有更低尺寸的更好性能(在您的情况下尺寸为2)。

Another possible data structure is octree (quadtree for 2D), it's simpler data structure (quite easy to implement), but kd-tree can be more efficient.

另一种可能的数据结构是八叉树(2D树的四叉树),它的数据结构更简单(非常容易实现),但kd-tree可以更高效。