在scipy.cluster.hierarchy.linkage()中使用距离矩阵?

时间:2023-01-21 21:20:30

I have a distance matrix n*n M where M_ij is the distance between object_i and object_j. So as expected, it takes the following form:

我有一个距离矩阵n*n M其中M_ij是object_i和object_j之间的距离。如预期的那样,它采取以下形式:

   /  0     M_01    M_02    ...    M_0n\
   | M_10    0      M_12    ...    M_1n |
   | M_20   M_21     0      ...    M2_n |
   |                ...                 |
   \ M_n0   M_n2    M_n2    ...      0 / 

Now I wish to cluster these n objects with hierarchical clustering. Python has an implementation of this called scipy.cluster.hierarchy.linkage(y, method='single', metric='euclidean').

现在我想把这n个对象用层次聚类法进行聚类。Python有一个名为scipy.cluster.hierarchy的实现。连杆(y =“单一”方法,度量=“欧几里得”)。

Its documentation says:

它的文档表示:

y must be a {n \choose 2} sized vector where n is the number of original observations paired in the distance matrix.

y必须是{n \选择2}大小的向量,其中n是在距离矩阵中成对的原始观测数。

y : ndarray

y:ndarray

A condensed or redundant distance matrix. A condensed distance matrix is a flat array containing the upper triangular of the distance matrix. This is the form that pdist returns. Alternatively, a collection of m observation vectors in n dimensions may be passed as an m by n array.

压缩或冗余距离矩阵。压缩距离矩阵是包含距离矩阵上三角的平面阵列。这是pdist返回的形式。或者,可以将n维中的m个观测向量集合作为m×n的数组传递。

I am confused by this description of y. Can I directly feed my M in as the input y?

我被这个y的描述弄糊涂了。我可以直接把M输入y吗?


Update

更新

@hongbo-zhu-cn has raised this issue up in GitHub. This is exactly what I am concerning about. However, as a newbie to GitHub, I don't know how it works and therefore have no idea how this issue is dealt with.

@hongbo-zhu-cn在GitHub上提出了这个问题。这正是我所关心的。然而,作为GitHub的新手,我不知道它是如何工作的,因此也不知道如何处理这个问题。

2 个解决方案

#1


29  

It seems that indeed we cannot directly pass the redundant square matrix in, although the documentation claims we can do so.

似乎我们确实不能直接传递冗余方阵,尽管文档声称我们可以这样做。

To benefit anyone who faces the same problem in the future, I write my solution as an additional answer here. So the copy-and-paste guys can just proceed with the clustering.

为了让将来遇到同样问题的人受益,我把我的解决方案写在这里作为额外的答案。复制粘贴的东西可以继续进行集群。

Use the following snippet to condense the matrix and happily proceed.

使用下面的代码片段压缩矩阵并愉快地继续。

import scipy.spatial.distance as ssd
# convert the redundant n*n square matrix form into a condensed nC2 array
    distArray = ssd.squareform(distMatrix) # distArray[{n choose 2}-{n-i choose 2} + (j-i-1)] is the distance between points i and j

Please correct me if I am wrong.

如果我说错了,请纠正我。

#2


5  

For now you should pass in the 'condensed distance matrix', i.e. just the upper triangle of the distance matrix in vector form:

现在你应该输入“压缩距离矩阵”,即距离矩阵的上三角向量形式:

y = M[np.triu_indices(n,1)]

From the discussion of @hongbo-zhu-cn's pull request it looks as though the solution will be to add an extra keyword argument to the linkage function that will allow the user to explicitly specify that they are passing in an n x n distance matrix rather than an m x n observation matrix.

从讨论@hongbo-zhu-cn拉请求看起来解决方案将添加一个额外的链接关键字参数函数,将允许用户显式地指定他们传入一个n * n的距离矩阵,而不是一个m x n观测矩阵。

#1


29  

It seems that indeed we cannot directly pass the redundant square matrix in, although the documentation claims we can do so.

似乎我们确实不能直接传递冗余方阵,尽管文档声称我们可以这样做。

To benefit anyone who faces the same problem in the future, I write my solution as an additional answer here. So the copy-and-paste guys can just proceed with the clustering.

为了让将来遇到同样问题的人受益,我把我的解决方案写在这里作为额外的答案。复制粘贴的东西可以继续进行集群。

Use the following snippet to condense the matrix and happily proceed.

使用下面的代码片段压缩矩阵并愉快地继续。

import scipy.spatial.distance as ssd
# convert the redundant n*n square matrix form into a condensed nC2 array
    distArray = ssd.squareform(distMatrix) # distArray[{n choose 2}-{n-i choose 2} + (j-i-1)] is the distance between points i and j

Please correct me if I am wrong.

如果我说错了,请纠正我。

#2


5  

For now you should pass in the 'condensed distance matrix', i.e. just the upper triangle of the distance matrix in vector form:

现在你应该输入“压缩距离矩阵”,即距离矩阵的上三角向量形式:

y = M[np.triu_indices(n,1)]

From the discussion of @hongbo-zhu-cn's pull request it looks as though the solution will be to add an extra keyword argument to the linkage function that will allow the user to explicitly specify that they are passing in an n x n distance matrix rather than an m x n observation matrix.

从讨论@hongbo-zhu-cn拉请求看起来解决方案将添加一个额外的链接关键字参数函数,将允许用户显式地指定他们传入一个n * n的距离矩阵,而不是一个m x n观测矩阵。