N点与numpy / scipy中的参考之间的有效距离计算

时间:2022-02-06 19:42:21

I just started using scipy/numpy. I have an 100000*3 array, each row is a coordinate, and a 1*3 center point. I want to calculate the distance for each row in the array to the center and store them in another array. What is the most efficient way to do it?

我刚刚开始使用scipy / numpy。我有一个100000 * 3阵列,每行是一个坐标,一个1 * 3中心点。我想计算数组中每行到中心的距离,并将它们存储在另一个数组中。最有效的方法是什么?

5 个解决方案

#1


26  

I would take a look at scipy.spatial.distance.cdist:

我来看看scipy.spatial.distance.cdist:

http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html

http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html

import numpy as np
import scipy

a = np.random.normal(size=(10,3))
b = np.random.normal(size=(1,3))

dist = scipy.spatial.distance.cdist(a,b) # pick the appropriate distance metric 

dist for the default distant metric is equivalent to:

对于默认的远程度量,dist等效于:

np.sqrt(np.sum((a-b)**2,axis=1))  

although cdist is much more efficient for large arrays (on my machine for your size problem, cdist is faster by a factor of ~35x).

虽然cdist对于大型阵列来说效率更高(在我的机器上,因为你的尺寸问题,cdist的速度提高了约35倍)。

#2


5  

I would use the sklearn implementation of the euclidean distance. The advantage is the usage of the more efficient expression by using Matrix multiplication:

我会使用欧几里德距离的sklearn实现。优点是使用矩阵乘法使用更高效的表达式:

dist(x, y) = sqrt(dot(x, x) - 2 * dot(x, y) + dot(y, y)

A simple script would look like this:

一个简单的脚本如下所示:

import numpy as np

x = np.random.rand(1000, 3)
y = np.random.rand(1000, 3)

dist = np.sqrt(np.dot(x, x)) - (dot(x, y) + dot(x, y)) + dot(y, y)

The advantage of this approach has been nicely described in the sklearn documentation: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.euclidean_distances.html#sklearn.metrics.pairwise.euclidean_distances

sklearn文档中很好地描述了这种方法的优点:http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.euclidean_distances.html#sklearn.metrics.pairwise.euclidean_distances

I am using this approach to crunch large datamatrices (10000, 10000) with some minor modifications like using the np.einsum function.

我正在使用这种方法来处理大型数据矩阵(10000,10000),并进行一些小修改,例如使用np.einsum函数。

#3


1  

You can also use the development of the norm (similar to remarkable identities). This is probably the most efficent way to compute the distance of a matrix of points.

您还可以使用规范的发展(类似于卓越的身份)。这可能是计算点矩阵距离的最有效方法。

Here is a code snippet that I originally used for a k-Nearest-Neighbors implementation, in Octave, but you can easily adapt it to numpy since it only uses matrix multiplications (the equivalent is numpy.dot()):

这是我最初在Octave中用于k-Nearest-Neighbors实现的代码片段,但是你可以很容易地将它调整为numpy,因为它只使用矩阵乘法(相当于numpy.dot()):

% Computing the euclidian distance between each known point (Xapp) and unknown points (Xtest)
% Note: we use the development of the norm just like a remarkable identity:
% ||x1 - x2||^2 = ||x1||^2 + ||x2||^2 - 2*<x1,x2>
[napp, d] = size(Xapp);
[ntest, d] = size(Xtest);

A = sum(Xapp.^2, 2);
A = repmat(A, 1, ntest);

B = sum(Xtest.^2, 2);
B = repmat(B', napp, 1);

C = Xapp*Xtest';

dist = A+B-2.*C;

#4


0  

You may need to specify a more detailed manner the distance function you are interested of, but here is a very simple (and efficient) implementation of Squared Euclidean Distance based on inner product (which obviously can be generalized, straightforward manner, to other kind of distance measures):

您可能需要更详细地指定您感兴趣的距离函数,但这里是一个非常简单(有效)的基于内积的平方欧几里德距离的实现(显然可以是通用的,直接的方式,到其他类型的距离措施):

In []: P, c= randn(5, 3), randn(1, 3)
In []: dot(((P- c)** 2), ones(3))
Out[]: array([  8.80512,   4.61693,   2.6002,   3.3293,  12.41800])

Where P are your points and c is the center.

P是你的分数,c是中心。

#5


0  

This might not answer your question directly, but if you are after all permutations of particle pairs, I've found the following solution to be faster than the pdist function in some cases.

这可能不会直接回答你的问题,但是如果你已经完成了粒子对的排列,我发现以下解决方案在某些情况下比pdist函数更快。

import numpy as np

L   = 100       # simulation box dimension
N   = 100       # Number of particles
dim = 2         # Dimensions

# Generate random positions of particles
r = (np.random.random(size=(N,dim))-0.5)*L

# uti is a list of two (1-D) numpy arrays  
# containing the indices of the upper triangular matrix
uti = np.triu_indices(100,k=1)        # k=1 eliminates diagonal indices

# uti[0] is i, and uti[1] is j from the previous example 
dr = r[uti[0]] - r[uti[1]]            # computes differences between particle positions
D = np.sqrt(np.sum(dr*dr, axis=1))    # computes distances; D is a 4950 x 1 np array

See this for a more in-depth look on this matter, on my blog post.

请在我的博客文章中查看此内容,以便更深入地了解此事。

#1


26  

I would take a look at scipy.spatial.distance.cdist:

我来看看scipy.spatial.distance.cdist:

http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html

http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html

import numpy as np
import scipy

a = np.random.normal(size=(10,3))
b = np.random.normal(size=(1,3))

dist = scipy.spatial.distance.cdist(a,b) # pick the appropriate distance metric 

dist for the default distant metric is equivalent to:

对于默认的远程度量,dist等效于:

np.sqrt(np.sum((a-b)**2,axis=1))  

although cdist is much more efficient for large arrays (on my machine for your size problem, cdist is faster by a factor of ~35x).

虽然cdist对于大型阵列来说效率更高(在我的机器上,因为你的尺寸问题,cdist的速度提高了约35倍)。

#2


5  

I would use the sklearn implementation of the euclidean distance. The advantage is the usage of the more efficient expression by using Matrix multiplication:

我会使用欧几里德距离的sklearn实现。优点是使用矩阵乘法使用更高效的表达式:

dist(x, y) = sqrt(dot(x, x) - 2 * dot(x, y) + dot(y, y)

A simple script would look like this:

一个简单的脚本如下所示:

import numpy as np

x = np.random.rand(1000, 3)
y = np.random.rand(1000, 3)

dist = np.sqrt(np.dot(x, x)) - (dot(x, y) + dot(x, y)) + dot(y, y)

The advantage of this approach has been nicely described in the sklearn documentation: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.euclidean_distances.html#sklearn.metrics.pairwise.euclidean_distances

sklearn文档中很好地描述了这种方法的优点:http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.euclidean_distances.html#sklearn.metrics.pairwise.euclidean_distances

I am using this approach to crunch large datamatrices (10000, 10000) with some minor modifications like using the np.einsum function.

我正在使用这种方法来处理大型数据矩阵(10000,10000),并进行一些小修改,例如使用np.einsum函数。

#3


1  

You can also use the development of the norm (similar to remarkable identities). This is probably the most efficent way to compute the distance of a matrix of points.

您还可以使用规范的发展(类似于卓越的身份)。这可能是计算点矩阵距离的最有效方法。

Here is a code snippet that I originally used for a k-Nearest-Neighbors implementation, in Octave, but you can easily adapt it to numpy since it only uses matrix multiplications (the equivalent is numpy.dot()):

这是我最初在Octave中用于k-Nearest-Neighbors实现的代码片段,但是你可以很容易地将它调整为numpy,因为它只使用矩阵乘法(相当于numpy.dot()):

% Computing the euclidian distance between each known point (Xapp) and unknown points (Xtest)
% Note: we use the development of the norm just like a remarkable identity:
% ||x1 - x2||^2 = ||x1||^2 + ||x2||^2 - 2*<x1,x2>
[napp, d] = size(Xapp);
[ntest, d] = size(Xtest);

A = sum(Xapp.^2, 2);
A = repmat(A, 1, ntest);

B = sum(Xtest.^2, 2);
B = repmat(B', napp, 1);

C = Xapp*Xtest';

dist = A+B-2.*C;

#4


0  

You may need to specify a more detailed manner the distance function you are interested of, but here is a very simple (and efficient) implementation of Squared Euclidean Distance based on inner product (which obviously can be generalized, straightforward manner, to other kind of distance measures):

您可能需要更详细地指定您感兴趣的距离函数,但这里是一个非常简单(有效)的基于内积的平方欧几里德距离的实现(显然可以是通用的,直接的方式,到其他类型的距离措施):

In []: P, c= randn(5, 3), randn(1, 3)
In []: dot(((P- c)** 2), ones(3))
Out[]: array([  8.80512,   4.61693,   2.6002,   3.3293,  12.41800])

Where P are your points and c is the center.

P是你的分数,c是中心。

#5


0  

This might not answer your question directly, but if you are after all permutations of particle pairs, I've found the following solution to be faster than the pdist function in some cases.

这可能不会直接回答你的问题,但是如果你已经完成了粒子对的排列,我发现以下解决方案在某些情况下比pdist函数更快。

import numpy as np

L   = 100       # simulation box dimension
N   = 100       # Number of particles
dim = 2         # Dimensions

# Generate random positions of particles
r = (np.random.random(size=(N,dim))-0.5)*L

# uti is a list of two (1-D) numpy arrays  
# containing the indices of the upper triangular matrix
uti = np.triu_indices(100,k=1)        # k=1 eliminates diagonal indices

# uti[0] is i, and uti[1] is j from the previous example 
dr = r[uti[0]] - r[uti[1]]            # computes differences between particle positions
D = np.sqrt(np.sum(dr*dr, axis=1))    # computes distances; D is a 4950 x 1 np array

See this for a more in-depth look on this matter, on my blog post.

请在我的博客文章中查看此内容,以便更深入地了解此事。