I am trying to implement the Canopy clustering algorithm along with K-Means. I've done some searching online that says to use Canopy clustering to get your initial starting points to feed into K-means, the problem is, in Canopy clustering, you need to specify 2 threshold values for the canopy: T1 and T2, where points in the inner threshold are strongly tied to that canopy and the points in the wider threshold are less tied to that canopy. How are these threshold, or distances from the canopy center, determined?
我尝试用K-Means来实现冠层聚类算法。我做了一些搜索在线说用树冠聚类得到初始起点喂到k - means,问题是,在树冠集群,您需要指定2树冠阈值:T1和T2,点的阈值是紧紧联系在一起,树冠和更广泛的阈值的点与树冠。这些阈值是如何确定的?
Problem context:
问题背景:
The problem I'm trying to solve is, I have a set of numbers such as [1,30] or [1,250] with set sizes of about 50. There can be duplicate elements and they can be floating point numbers as well, such as 8, 17.5, 17.5, 23, 66, ... I want to find the optimal clusters, or subsets of the set of numbers.
我要解决的问题是,我有一组数字,比如[1,30]或[1,250],设置大小约为50。可以有重复的元素,它们也可以是浮点数,比如8,17.5,17.5,23,66,…我想要找到最优的集合,或者是一组数字的子集。
So, if Canopy clustering with K-means is a good choice, then my questions still stands: how do you find the T1, T2 values?. If this is not a good choice, is there a better, simpler but effective algorithm to use?
因此,如果用K-means进行冠层聚类是一个很好的选择,那么我的问题仍然是:如何找到T1, T2值?如果这不是一个好的选择,是否有更好、更简单但有效的算法?
2 个解决方案
#1
2
Perhaps naively, I see the problem in terms of a sort of spectral-estimation. Suppose I have 10 vectors. I can compute the distances between all pairs. In this case I'd get 45 such distances. Plot them as a histogram in various distance ranges. E.g. 10 distances are between 0.1 and 0.2, 5 between 0.2 and 0.3 etc. and you get an idea of how the distances between vectors are distributed. From this information you can choose T1 and T2 (e.g. choose them so that you cover the distance range that is the most populated).
也许是天真的,我从一种光谱估计的角度来看这个问题。假设有10个向量。我可以计算出所有对的距离。在这种情况下,我会得到45个这样的距离。把它们绘制成不同距离的直方图。10的距离是0。1到0。2之间的距离,0。2和0。3之间的距离,你可以知道矢量之间的距离是怎样分布的。从这些信息中,你可以选择T1和T2(例如,选择它们,这样你就可以覆盖最密集的距离范围)。
Of course, this is not practical for a large dataset - but you could just take a random sample or something so that you at least know the ballpark of T1 and T2. Using something like Hadoop you could do some sort of prior spectral estimation on a large number of points. If all incoming data you are trying to cluster is distributed in much the same way then you cjust need to get T1 and T2 once, then fix them as constants for all future runs.
当然,对于一个大数据集来说,这是不实际的——但是你可以随机抽取一个样本或者其他东西,这样你至少可以知道T1和T2的大致范围。使用Hadoop之类的东西,你可以在大量的点上做一些先验的谱估计。如果您试图集群的所有传入数据都以相同的方式分布,那么您只需要获得T1和T2,然后将它们作为所有未来运行的常量进行修正。
#2
2
Actually that is the big issue with Canopy Clustering. Choosing the thresholds is pretty much as difficult as the actual algorithm. In particular in high dimensions. For a 2D geographic data set, a domain expert can probably define the distance thresholds easily. But in high-dimensional data, probably the best you can do is to run k-means on a sample of your data first, then choose the distances based on this sample run.
实际上这是树冠聚集的一个大问题。选择阈值和实际的算法一样困难。特别是在高维度。对于2D地理数据集,领域专家可以很容易地定义距离阈值。但是在高维数据中,可能最好的方法是先在数据样本上运行k-means,然后根据这个示例运行选择距离。
#1
2
Perhaps naively, I see the problem in terms of a sort of spectral-estimation. Suppose I have 10 vectors. I can compute the distances between all pairs. In this case I'd get 45 such distances. Plot them as a histogram in various distance ranges. E.g. 10 distances are between 0.1 and 0.2, 5 between 0.2 and 0.3 etc. and you get an idea of how the distances between vectors are distributed. From this information you can choose T1 and T2 (e.g. choose them so that you cover the distance range that is the most populated).
也许是天真的,我从一种光谱估计的角度来看这个问题。假设有10个向量。我可以计算出所有对的距离。在这种情况下,我会得到45个这样的距离。把它们绘制成不同距离的直方图。10的距离是0。1到0。2之间的距离,0。2和0。3之间的距离,你可以知道矢量之间的距离是怎样分布的。从这些信息中,你可以选择T1和T2(例如,选择它们,这样你就可以覆盖最密集的距离范围)。
Of course, this is not practical for a large dataset - but you could just take a random sample or something so that you at least know the ballpark of T1 and T2. Using something like Hadoop you could do some sort of prior spectral estimation on a large number of points. If all incoming data you are trying to cluster is distributed in much the same way then you cjust need to get T1 and T2 once, then fix them as constants for all future runs.
当然,对于一个大数据集来说,这是不实际的——但是你可以随机抽取一个样本或者其他东西,这样你至少可以知道T1和T2的大致范围。使用Hadoop之类的东西,你可以在大量的点上做一些先验的谱估计。如果您试图集群的所有传入数据都以相同的方式分布,那么您只需要获得T1和T2,然后将它们作为所有未来运行的常量进行修正。
#2
2
Actually that is the big issue with Canopy Clustering. Choosing the thresholds is pretty much as difficult as the actual algorithm. In particular in high dimensions. For a 2D geographic data set, a domain expert can probably define the distance thresholds easily. But in high-dimensional data, probably the best you can do is to run k-means on a sample of your data first, then choose the distances based on this sample run.
实际上这是树冠聚集的一个大问题。选择阈值和实际的算法一样困难。特别是在高维度。对于2D地理数据集,领域专家可以很容易地定义距离阈值。但是在高维数据中,可能最好的方法是先在数据样本上运行k-means,然后根据这个示例运行选择距离。