是什么使得k-medoid的距离测量比k-means更好呢?

时间:2023-01-15 10:37:08

I am reading about the difference between k-means clustering and k-medoid clustering.

我正在阅读k-means集群和k-medoid集群之间的区别。

Supposedly there is an advantage to using the pairwise distance measure in the k-medoid algorithm, instead of the more familiar sum of squared Euclidean distance-type metric to evaluate variance that we find with k-means. And apparently this different distance metric somehow reduces noise and outliers.

假设在k-medoid算法中使用pairwise距离度量是有好处的,而不是用更熟悉的平方欧几里德距离度量来评估我们用k-means找到的方差。显然,这个不同的距离度量可以降低噪声和离群值。

I have seen this claim but I have yet to see any good reasoning as to the mathematics behind this claim.

我已经看到了这种说法,但我还没有看到任何关于这一说法背后的数学推理。

What makes the pairwise distance measure commonly used in k-medoid better? More exactly, how does the lack of a squared term allow k-medoids to have the desirable properties associated with the concept of taking a median?

在k-medoid中,最常用的pairwise距离测量是什么?更确切地说,缺少一个平方项是如何让k-medoid具有与取中值相关的理想属性的?

3 个解决方案

#1


28  

1. K-medoid is more flexible

First of all, you can use k-medoids with any similarity measure. K-means however, may fail to converge - it really must only be used with distances that are consistent with the mean. So e.g. Absolute Pearson Correlation must not be used with k-means, but it works well with k-medoids.

首先,你可以使用k-medoid和任何相似性度量。然而,K-means可能无法收敛——它必须只与与平均值一致的距离使用。因此,绝对Pearson相关不能与k-means一起使用,但它与k-medoid很有效。

2. Robustness of medoid

Secondly, the medoid as used by k-medoids is roughly comparable to the median (in fact, there also is k-medians, which is like K-means but for Manhattan distance). If you look up literature on the median, you will see plenty of explanations and examples why the median is more robust to outliers than the arithmetic mean. Essentially, these explanations and examples will also hold for the medoid. It is a more robust estimate of a representative point than the mean as used in k-means.

其次,k-medoid所使用的medoid与中位数大致相当(事实上,也有k-medians,这就像K-means,但对于曼哈顿距离)。如果你查阅中位数的文献,你会看到大量的解释和例子,为什么中值比算术平均数更有说服力。从本质上讲,这些解释和例子也适用于medoid。它比k-means中使用的均值更可靠。

Consider this 1-dimensional example:

考虑这个维的例子:

1 2 3 4 100000

1 2 3 4 10万。

Both the median and medoid of this set are 3. The mean is 20002.

这个集合的中值和medoid都是3。平均值是20002。

Which do you think is more representative of the data set? The mean has the lower squared error, but assuming that there might be a measurement error in this data set ...

你认为哪个更能代表数据集?均值有较低的平方误差,但假设这个数据集有测量误差…

Technically, the notion of breakdown point is used in statistics. The median has a breakdown point of 50% (i.e. half of the data points can be incorrect, and the result is still unaffected), whereas the mean has a breakdown point of 0 (i.e. a single large observation can yield a bad estimate).

从技术上讲,崩溃点的概念用于统计。中值有50%的崩溃点(即数据点的一半可能不正确,结果仍然不受影响),而平均值的崩溃点为0(也就是说,单个大的观察可以产生一个糟糕的估计)。

I do not have a proof, but I assume the medoid will have a similar breakdown point as the median.

我没有证据,但我假设medoid将有一个类似的崩溃点作为中位数。

3. k-medoids is much more expensive

That's the main drawback. Usually, PAM takes much longer to run than k-means. As it involves computing all pairwise distances, it is O(n^2*k*i); whereas k-means runs in O(n*k*i) where usually, k times the number of iterations is k*i << n.

这是主要的缺点。通常,PAM运行的时间比k-means要长得多。它涉及计算所有成对的距离,它是O(n ^ 2 * k *我);而k-means在O(n*k*i)中运行,通常k乘以迭代次数k*i << n。

#2


5  

I think this has to do with the selection of the center for the cluster. k-means will select the "center" of the cluster, while k-medoid will select the "most centered" member of the cluster. In a cluster with outliers (i.e. points far away from the other members of the cluster) k-means will place the center of the cluster towards the outliers, whereas k-medoid will select one of the more clustered members (the medoid) as the center.

我认为这与集群中心的选择有关。k-means将选择集群的“中心”,而k-medoid将选择集群中“最中心”的成员。在具有离群值的集群中(即远离集群的其他成员的点),k-means会将集群的中心放置到离群值,而k-medoid将选择一个更集群的成员(medoid)作为中心。

It now depends on what you use clustering for. If you just wanted to classify a bunch of objects then you don't really care about where the center is; but if the clustering was used to train a decider which will now classify new objects based on those center points, then k-medoid will give you a center closer to where a human would place the center.

现在取决于您使用集群的方式。如果你只是想对一堆物体进行分类那么你就不关心中心在哪里;但是如果聚类被用来训练一个决定器,它将根据这些中心点对新对象进行分类,那么k-medoid将会给你一个靠近人类放置中心位置的中心。

In wikipedia's words:

在*的话说:

"It [k-medoid] is more robust to noise and outliers as compared to k-means because it minimizes a sum of pairwise dissimilarities instead of a sum of squared Euclidean distances."

“它(k-medoid)与k-means相比,更容易产生噪声和异常值,因为它最小化了一组成对的不相似点,而不是一个平方欧几里得距离的平方和。”

Here's an example:

这里有一个例子:

Suppose you want to cluster on one dimension with k=2. One cluster has most of its members around 1000 and the other around -1000; but there is an outlier (or noise) at 100000. It obviously belongs to the cluster around 1000 but k-means will put the center point away from 1000 and towards 100000. This may even make some of the members of the 1000 cluster (say a member with value 500) to be assigned to the -1000 cluster. k-medoid will select one of the members around 1000 as the medoid, it'll probably select one that is bigger than 1000, but it will not select an outlier.

假设你想用k=2来聚在一个维度上。一个集群的成员大部分在1000左右,而另一个集群的成员大约是-1000;但10万有一个离群点(或噪音)。它显然属于1000左右的集群,但k-means将把中心点从1000点移到10万。这甚至可能使1000个集群中的一些成员(比如一个值500的成员)被分配到-1000集群。k-medoid将选择其中的一个成员作为medoid,它可能会选择大于1000的成员,但是它不会选择一个异常值。

#3


2  

Just a tiny note added to @Eli's answer, K-medoid is more robust to noise and outliers than k-means because the latter selects the cluster center, which is mostly just a "virtue point", on the other hand the former chooses the "actual object" from the cluster.

在@Eli的回答中添加了一个小的注释,K-medoid比k-means更健壮,因为后者选择了集群中心,而后者大多只是一个“优点”,而另一方面,前者从集群中选择“实际对象”。

Suppose you have five 2D points in one cluster with the coordinates of (1,1),(1,2),(2,1),(2,2), and (100,100). If we don't consider the object exchanges among the clusters, with k-means you will get the center of cluster (21.2,21.2) which is pretty distracted by the point (100,100). However, with k-medoid will choose the center among (1,1),(1,2),(2,1),and (2,2) according to its algorithm.

假设在一个星系团中有5个二维点,其坐标为(1,1)、(1,2)、(2,1)、(2,2)和(100,100)。如果我们不考虑集群之间的对象交换,使用k表示您将获得集群的中心(21.2,21.2),这将被点(100,100)分散。然而,k-medoid将根据其算法选择(1,1)、(1,2)、(2,1)和(2,2)中的中心。

Here is a fun applet ( E.M. Mirkes, K-means and K-medoids applet. University of Leicester, 2011 ) that you can randomly generate dataset in the 2D plane and compare k-medoid and k-means learning process.

这是一个有趣的小应用程序(E.M. Mirkes, K-means和K-medoids applet)。你可以在二维平面上随机生成数据集,并比较k-medoid和k-means学习过程。

#1


28  

1. K-medoid is more flexible

First of all, you can use k-medoids with any similarity measure. K-means however, may fail to converge - it really must only be used with distances that are consistent with the mean. So e.g. Absolute Pearson Correlation must not be used with k-means, but it works well with k-medoids.

首先,你可以使用k-medoid和任何相似性度量。然而,K-means可能无法收敛——它必须只与与平均值一致的距离使用。因此,绝对Pearson相关不能与k-means一起使用,但它与k-medoid很有效。

2. Robustness of medoid

Secondly, the medoid as used by k-medoids is roughly comparable to the median (in fact, there also is k-medians, which is like K-means but for Manhattan distance). If you look up literature on the median, you will see plenty of explanations and examples why the median is more robust to outliers than the arithmetic mean. Essentially, these explanations and examples will also hold for the medoid. It is a more robust estimate of a representative point than the mean as used in k-means.

其次,k-medoid所使用的medoid与中位数大致相当(事实上,也有k-medians,这就像K-means,但对于曼哈顿距离)。如果你查阅中位数的文献,你会看到大量的解释和例子,为什么中值比算术平均数更有说服力。从本质上讲,这些解释和例子也适用于medoid。它比k-means中使用的均值更可靠。

Consider this 1-dimensional example:

考虑这个维的例子:

1 2 3 4 100000

1 2 3 4 10万。

Both the median and medoid of this set are 3. The mean is 20002.

这个集合的中值和medoid都是3。平均值是20002。

Which do you think is more representative of the data set? The mean has the lower squared error, but assuming that there might be a measurement error in this data set ...

你认为哪个更能代表数据集?均值有较低的平方误差,但假设这个数据集有测量误差…

Technically, the notion of breakdown point is used in statistics. The median has a breakdown point of 50% (i.e. half of the data points can be incorrect, and the result is still unaffected), whereas the mean has a breakdown point of 0 (i.e. a single large observation can yield a bad estimate).

从技术上讲,崩溃点的概念用于统计。中值有50%的崩溃点(即数据点的一半可能不正确,结果仍然不受影响),而平均值的崩溃点为0(也就是说,单个大的观察可以产生一个糟糕的估计)。

I do not have a proof, but I assume the medoid will have a similar breakdown point as the median.

我没有证据,但我假设medoid将有一个类似的崩溃点作为中位数。

3. k-medoids is much more expensive

That's the main drawback. Usually, PAM takes much longer to run than k-means. As it involves computing all pairwise distances, it is O(n^2*k*i); whereas k-means runs in O(n*k*i) where usually, k times the number of iterations is k*i << n.

这是主要的缺点。通常,PAM运行的时间比k-means要长得多。它涉及计算所有成对的距离,它是O(n ^ 2 * k *我);而k-means在O(n*k*i)中运行,通常k乘以迭代次数k*i << n。

#2


5  

I think this has to do with the selection of the center for the cluster. k-means will select the "center" of the cluster, while k-medoid will select the "most centered" member of the cluster. In a cluster with outliers (i.e. points far away from the other members of the cluster) k-means will place the center of the cluster towards the outliers, whereas k-medoid will select one of the more clustered members (the medoid) as the center.

我认为这与集群中心的选择有关。k-means将选择集群的“中心”,而k-medoid将选择集群中“最中心”的成员。在具有离群值的集群中(即远离集群的其他成员的点),k-means会将集群的中心放置到离群值,而k-medoid将选择一个更集群的成员(medoid)作为中心。

It now depends on what you use clustering for. If you just wanted to classify a bunch of objects then you don't really care about where the center is; but if the clustering was used to train a decider which will now classify new objects based on those center points, then k-medoid will give you a center closer to where a human would place the center.

现在取决于您使用集群的方式。如果你只是想对一堆物体进行分类那么你就不关心中心在哪里;但是如果聚类被用来训练一个决定器,它将根据这些中心点对新对象进行分类,那么k-medoid将会给你一个靠近人类放置中心位置的中心。

In wikipedia's words:

在*的话说:

"It [k-medoid] is more robust to noise and outliers as compared to k-means because it minimizes a sum of pairwise dissimilarities instead of a sum of squared Euclidean distances."

“它(k-medoid)与k-means相比,更容易产生噪声和异常值,因为它最小化了一组成对的不相似点,而不是一个平方欧几里得距离的平方和。”

Here's an example:

这里有一个例子:

Suppose you want to cluster on one dimension with k=2. One cluster has most of its members around 1000 and the other around -1000; but there is an outlier (or noise) at 100000. It obviously belongs to the cluster around 1000 but k-means will put the center point away from 1000 and towards 100000. This may even make some of the members of the 1000 cluster (say a member with value 500) to be assigned to the -1000 cluster. k-medoid will select one of the members around 1000 as the medoid, it'll probably select one that is bigger than 1000, but it will not select an outlier.

假设你想用k=2来聚在一个维度上。一个集群的成员大部分在1000左右,而另一个集群的成员大约是-1000;但10万有一个离群点(或噪音)。它显然属于1000左右的集群,但k-means将把中心点从1000点移到10万。这甚至可能使1000个集群中的一些成员(比如一个值500的成员)被分配到-1000集群。k-medoid将选择其中的一个成员作为medoid,它可能会选择大于1000的成员,但是它不会选择一个异常值。

#3


2  

Just a tiny note added to @Eli's answer, K-medoid is more robust to noise and outliers than k-means because the latter selects the cluster center, which is mostly just a "virtue point", on the other hand the former chooses the "actual object" from the cluster.

在@Eli的回答中添加了一个小的注释,K-medoid比k-means更健壮,因为后者选择了集群中心,而后者大多只是一个“优点”,而另一方面,前者从集群中选择“实际对象”。

Suppose you have five 2D points in one cluster with the coordinates of (1,1),(1,2),(2,1),(2,2), and (100,100). If we don't consider the object exchanges among the clusters, with k-means you will get the center of cluster (21.2,21.2) which is pretty distracted by the point (100,100). However, with k-medoid will choose the center among (1,1),(1,2),(2,1),and (2,2) according to its algorithm.

假设在一个星系团中有5个二维点,其坐标为(1,1)、(1,2)、(2,1)、(2,2)和(100,100)。如果我们不考虑集群之间的对象交换,使用k表示您将获得集群的中心(21.2,21.2),这将被点(100,100)分散。然而,k-medoid将根据其算法选择(1,1)、(1,2)、(2,1)和(2,2)中的中心。

Here is a fun applet ( E.M. Mirkes, K-means and K-medoids applet. University of Leicester, 2011 ) that you can randomly generate dataset in the 2D plane and compare k-medoid and k-means learning process.

这是一个有趣的小应用程序(E.M. Mirkes, K-means和K-medoids applet)。你可以在二维平面上随机生成数据集,并比较k-medoid和k-means学习过程。