Let's say I have a 2D accumulator array in java int[][] array
. The array could look like this:
假设我有一个java int[][][]]数组中的2D累加器数组。数组可以如下所示:
(x and z axes represent indexes in the array, y axis represents values - these are images of an int[56][56]
with values from 0 ~ 4500)
(x轴和z轴表示数组中的索引,y轴表示值——这些是值为0 ~ 4500的int[56][56]的图像)
or
或
What I need to do is find peaks in the array - there are 2 peaks in the first one and 8 peaks in the second array. These peaks are always 'obvious' (there's always a gap between peaks), but they don't have to be similar like on these images, they can be more or less random - these images are not based on the real data, just samples. The real array can have size like 5000x5000 with peaks from thousands to several hundred thousands... The algorithm has to be universal, I don't know how big the array or peaks can be, I also don't know how many peaks there are. But I do know some sort of threshold - that the peaks can't be smaller than a given value.
我需要做的是在数组中找到峰值——第一个有2个峰值,第二个数组有8个峰值。这些峰值总是“明显的”(峰值之间总是有空隙),但是它们不一定要和这些图像相似,它们或多或少是随机的——这些图像不是基于真实数据,只是样本。真正的数组的大小可以是5000x5000,峰值从千到几十万……算法必须是通用的,我不知道数组或峰值有多大,我也不知道有多少个峰值。但我知道某种阈值——峰值不能小于给定值。
The problem is, that one peak can consist of several smaller peaks nearby (first image), the height can be quite random and also the size can be significantly different within one array (size - I mean the number of units it takes in the array - one peak can consist from 6 units and other from 90). It also has to be fast (all done in 1 iteration), the array can be really big.
问题是,一个峰值可以由几个较小的山峰附近(第一张照片),高度可以很随机,也可以显著不同的大小在一个数组(大小-我的意思是单位的数量需要数组中——一个峰值可以由来自6单位和其他90)。它也必须是快速的(全部在一次迭代中完成),数组可以非常大。
Any help is appreciated - I don't expect code from you, just the right idea :) Thanks!
任何帮助都是感激的-我不期望从你的代码,只是正确的想法:)谢谢!
edit: You asked about the domain - but it's quite complicated and imho it can't help with the problem. It's actually an array of ArrayLists with 3D points, like ArrayList< Point3D >[][] and the value in question is the size of the ArrayList. Each peak contains points that belong to one cluster (plane, in this case) - this array is a result of an algorithm, that segments a pointcloud . I need to find the highest value in the peak so I can fit the points from the 'biggest' arraylist to a plane, compute some parameters from it and than properly cluster most of the points from the peak.
3 个解决方案
#1
7
He's not interested in estimating the global maximum using some sort of optimization heuristic - he just wants to find the maximum values within each of a number of separate clusters.
他对使用某种优化启发式方法来估计全局最大值不感兴趣——他只想在许多单独的集群中找到每个的最大值。
These peaks are always 'obvious' (there's always a gap between peaks)
这些峰总是“明显的”(峰间总是有空隙)
Based on your images, I assume you mean there's always some 0
-values separating clusters? If that's the case, you can use a simple flood-fill to identify the clusters. You can also keep track of each cluster's maximum while doing the flood-fill, so you both identify the clusters and find their maximum simultaneously.
基于您的映像,我假设您的意思是总是有一些0值分隔集群?如果是这样,您可以使用一个简单的洪泛填充来识别集群。您还可以在执行注水时跟踪每个集群的最大值,这样您就可以同时识别集群并找到它们的最大值。
This is also as fast as you can get, without relying on heuristics (which could return the wrong answer), since the maximum of each cluster could potentially be any value in the cluster, so you have to check them all at least once.
这也是最快的,不需要依赖启发式(可能会返回错误的答案),因为每个集群的最大值可能是集群中的任何值,所以您必须至少检查它们一次。
Note that this will iterate through every item in the array. This is also necessary, since (from the information you've given us) it's potentially possible for any single item in the array to be its own cluster (which would also make it a peak). With around 25 million items in the array, this should only take a few seconds on a modern computer.
注意,这将遍历数组中的每个项。这也是必要的,因为(从您给我们的信息来看)数组中的任何一个项目都可能是它自己的集群(这也会使它成为一个峰值)。在这个数组中大约有2500万个条目,这只需要在一台现代计算机上花费几秒钟。
#2
2
This might not be an optimal solution, but since the problem sounds somewhat fluid too, I'll write it down.
这可能不是一个最优的解决方案,但是由于这个问题听起来有点不稳定,我将把它写下来。
- Construct a list of all the values (and coordinates) that are over your minimum treshold.
- 构造一个超过最小treshold的所有值(和坐标)的列表。
- Sort it in descending order of height.
- 按照高度的降序排列。
- The first element will be the biggest peak, add it to the peak list.
- 第一个元素将是最大的峰值,将它添加到峰值列表中。
- Then descend down the list, if the current element is further than the minimum distance from all the existing peaks, add it to the peak list.
- 然后沿下拉列表,如果当前元素与所有现有峰值的距离超过最小距离,则将其添加到峰值列表。
This is a linear description but all the steps (except 3) can be trivially parallelised. In step 4 you can also use a coverage map: a 2D array of booleans that show which coordinates have been "covered" by a nearby peak.
这是一个线性的描述,但是所有的步骤(除了3)都可以被简单地并行化。在第4步中,您还可以使用覆盖图:一个二维布尔数组,显示哪些坐标已被附近的峰值“覆盖”。
(Caveat emptor: once you refine the criteria, this solution might become completely unfeasible, but in general it works.)
(买者告诫:一旦您细化了标准,这个解决方案可能完全不可行的,但总的来说它是有效的。)
#3
1
Simulated annealing, or hill climbing are what immediately comes to mind. These algorithms though will not guarantee that all peaks are found.
模拟退火,或爬山是立即想到的。但是这些算法不能保证找到所有的峰值。
However if your "peaks" are separated by values of 0 as the gap, maybe a connected components analysis would help. You would label a region as "connected" if it is connected with values greater than 0(or if you have a certain threshold, label regions as connected that are over that threshold), then your number of components would be your number of peaks. You could also then do another pass of the array to find the max of each component.
但是,如果您的“峰值”以0作为间隔分隔,那么可能需要进行连接的组件分析。如果一个区域与大于0的值相连接(或者如果您有一个阈值,将连接区域标记为大于该阈值的连接区域),那么您的组件数量将是您的峰值数量。然后还可以对数组进行另一次遍历,以找到每个组件的最大值。
I should note that connected components can be done in linear time, and finding the peak values can also be done in linear time.
我要注意的是,连通分量可以在线性时间内完成,而找出峰值也可以在线性时间内完成。
#1
7
He's not interested in estimating the global maximum using some sort of optimization heuristic - he just wants to find the maximum values within each of a number of separate clusters.
他对使用某种优化启发式方法来估计全局最大值不感兴趣——他只想在许多单独的集群中找到每个的最大值。
These peaks are always 'obvious' (there's always a gap between peaks)
这些峰总是“明显的”(峰间总是有空隙)
Based on your images, I assume you mean there's always some 0
-values separating clusters? If that's the case, you can use a simple flood-fill to identify the clusters. You can also keep track of each cluster's maximum while doing the flood-fill, so you both identify the clusters and find their maximum simultaneously.
基于您的映像,我假设您的意思是总是有一些0值分隔集群?如果是这样,您可以使用一个简单的洪泛填充来识别集群。您还可以在执行注水时跟踪每个集群的最大值,这样您就可以同时识别集群并找到它们的最大值。
This is also as fast as you can get, without relying on heuristics (which could return the wrong answer), since the maximum of each cluster could potentially be any value in the cluster, so you have to check them all at least once.
这也是最快的,不需要依赖启发式(可能会返回错误的答案),因为每个集群的最大值可能是集群中的任何值,所以您必须至少检查它们一次。
Note that this will iterate through every item in the array. This is also necessary, since (from the information you've given us) it's potentially possible for any single item in the array to be its own cluster (which would also make it a peak). With around 25 million items in the array, this should only take a few seconds on a modern computer.
注意,这将遍历数组中的每个项。这也是必要的,因为(从您给我们的信息来看)数组中的任何一个项目都可能是它自己的集群(这也会使它成为一个峰值)。在这个数组中大约有2500万个条目,这只需要在一台现代计算机上花费几秒钟。
#2
2
This might not be an optimal solution, but since the problem sounds somewhat fluid too, I'll write it down.
这可能不是一个最优的解决方案,但是由于这个问题听起来有点不稳定,我将把它写下来。
- Construct a list of all the values (and coordinates) that are over your minimum treshold.
- 构造一个超过最小treshold的所有值(和坐标)的列表。
- Sort it in descending order of height.
- 按照高度的降序排列。
- The first element will be the biggest peak, add it to the peak list.
- 第一个元素将是最大的峰值,将它添加到峰值列表中。
- Then descend down the list, if the current element is further than the minimum distance from all the existing peaks, add it to the peak list.
- 然后沿下拉列表,如果当前元素与所有现有峰值的距离超过最小距离,则将其添加到峰值列表。
This is a linear description but all the steps (except 3) can be trivially parallelised. In step 4 you can also use a coverage map: a 2D array of booleans that show which coordinates have been "covered" by a nearby peak.
这是一个线性的描述,但是所有的步骤(除了3)都可以被简单地并行化。在第4步中,您还可以使用覆盖图:一个二维布尔数组,显示哪些坐标已被附近的峰值“覆盖”。
(Caveat emptor: once you refine the criteria, this solution might become completely unfeasible, but in general it works.)
(买者告诫:一旦您细化了标准,这个解决方案可能完全不可行的,但总的来说它是有效的。)
#3
1
Simulated annealing, or hill climbing are what immediately comes to mind. These algorithms though will not guarantee that all peaks are found.
模拟退火,或爬山是立即想到的。但是这些算法不能保证找到所有的峰值。
However if your "peaks" are separated by values of 0 as the gap, maybe a connected components analysis would help. You would label a region as "connected" if it is connected with values greater than 0(or if you have a certain threshold, label regions as connected that are over that threshold), then your number of components would be your number of peaks. You could also then do another pass of the array to find the max of each component.
但是,如果您的“峰值”以0作为间隔分隔,那么可能需要进行连接的组件分析。如果一个区域与大于0的值相连接(或者如果您有一个阈值,将连接区域标记为大于该阈值的连接区域),那么您的组件数量将是您的峰值数量。然后还可以对数组进行另一次遍历,以找到每个组件的最大值。
I should note that connected components can be done in linear time, and finding the peak values can also be done in linear time.
我要注意的是,连通分量可以在线性时间内完成,而找出峰值也可以在线性时间内完成。