用numpy计算值之间的平均加权欧式距离

I searched a bit around and found comparable questions/answers, but none of them returned the correct results for me.

我搜索了一下，找到了一些类似的问题/答案，但是没有一个人给我返回了正确的结果。

Situation: I have an array with a number of clumps of values == 1, while the rest of the cells are set to zero. Each cell is a square (width=height). Now I want to calculate the average distance between all 1 values. The formula should be like this: d = sqrt ( (( x2 - x1 )*size)**2 + (( y2 - y1 )*size)**2 )

情境:我有一个数组，数组中有许多块值== 1，而其余的单元格被设置为0。每个单元格都是一个正方形(宽=高)。现在我要计算所有1个值之间的平均距离。公式应该是这样的:d =√(((x2 - x1)*size)* 2 + (y2 - y1)*size)* 2)

Example:

例子:

import numpy as np
from scipy.spatial.distance import pdist

a = np.array([[1, 0, 1],
              [0, 0, 0],
              [0, 0, 1]])

# Given that each cell is 10m wide/high
val = 10
d = pdist(a, lambda u, v: np.sqrt( ( ((u-v)*val)**2).sum() ) )
d
array([ 14.14213562,  10.        ,  10.        ])

After that I would calculate the average via d.mean(). However the result in d is obviously wrong as the distance between the cells in the top-row should be 20 already (two crossed cells * 10). Is there something wrong with my formula, math or approach?

之后，我将通过d。mean()计算平均值。但是，d的结果显然是错误的，因为第一行的单元格之间的距离应该已经是20(两个交叉的单元格* 10)。我的公式、数学或方法有什么问题吗?

1 个解决方案

#1

You need the actual coordinates of the non-zero markers, to compute the distance between them:

需要非零标记的实际坐标，计算它们之间的距离:

>>> import numpy as np
>>> from scipy.spatial.distance import squareform, pdist
>>> a = np.array([[1, 0, 1],
...               [0, 0, 0],
...               [0, 0, 1]])
>>> np.where(a)
(array([0, 0, 2]), array([0, 2, 2]))
>>> x,y = np.where(a)
>>> coords = np.vstack((x,y)).T
>>> coords
array([[0, 0],   # That's the coordinate of the "1" in the top left,
       [0, 2],   # top right,
       [2, 2]])  # and bottom right.

Next you want to calculate the distance between these points. You use pdist for this, like so:

接下来你要计算这些点之间的距离。你用pdist来做这个，像这样:

>>> dists = pdist(coords) * 10  # Uses the Euclidean distance metric by default.
>>> squareform(dists)
array([[  0.        ,  20.        ,  28.28427125],
       [ 20.        ,   0.        ,  20.        ],
       [ 28.28427125,  20.        ,   0.        ]])

In this last matrix, you will find (above the diagonal), the distance between each marked point in a and another coordinate. In this case, you had 3 coordinates, so it gives you the distance between node 0 (a[0,0]) and node 1 (a[0,2]), node 0 and node 2 (a[2,2]) and finally between node 1 and node 2. To put it in different words, if S = squareform(dists), then S[i,j] returns the distance between the coordinates on row i of coords and row j.

在最后一个矩阵中，你会发现(在对角线上)，每个标记点与另一个坐标之间的距离。在这种情况下，有3个坐标，它给出了节点0 (a[0,0])和节点1 (a[0,2])、节点0和节点2 (a[2,2])之间的距离，以及节点1和节点2之间的距离。换句话说，如果S = squareform(dists)，则S[i,j]返回coords第i行坐标与第j行坐标之间的距离。

Just the values in the upper triangle of that last matrix are also present in the variable dist, from which you can derive the mean easily, without having to perform the relatively expensive calculation of the squareform (shown here just for demonstration purposes):

最后一个矩阵的上三角的值也存在于变量dist中，你可以很容易地推导出平均值，而不需要对squareform进行相对昂贵的计算(这里只展示了演示的目的):

>>> dists
array([ 20.        ,  28.2842712,  20.        ])
>>> dists.mean()
22.761423749153966

Remark that your computed solution "looks" nearly correct (aside from a factor of 2), because of the example you chose. What pdist does, is it takes the Euclidean distance between the first point in the n-dimensional space and the second and then between the first and the third and so on. In your example, that means, it computes the distance between a point on row 0: that point has coordinates in 3 dimensional space given by [1,0,1]. The 2nd point is [0,0,0]. The Euclidean distance between those two sqrt(2)~1.4. Then, the distance between the first and the 3rd coordinate (the last row in a), is only 1. Finally, the distance between the 2nd coordinate (row 1: [0,0,0]) and the 3rd (last row, row 2: [0,0,1]) is also 1. So remember, pdist interprets its first argument as a stack of coordinates in n-dimensional space, n being the number of elements in the tuple of each node.

注意，由于您选择的示例，您计算的解决方案“看起来”几乎是正确的(除了因子2)。pdist做的是，它是在n维空间的第一个点和第一个点之间的欧几里得距离，然后是第一个点和第三个点之间的距离，以此类推。在你的例子中，这意味着，它计算第0行上的点之间的距离:这个点在三维空间中的坐标由[1,0,1]给出。第二个点是[0,0,0]这两个根号(2)~1.4之间的欧几里得距离。那么，第一个坐标和第三个坐标(a中的最后一行)之间的距离只有1。最后，第二坐标(第一行:[0,0])和第三坐标(最后一行:[0,1])之间的距离也是1。记住，pdist把它的第一个参数解释为n维空间中的一堆坐标，n是每个节点元组中元素的数量。

#1