如何比较不同大小的两个numpy数组并返回具有公共元素的索引列?

时间:2021-06-04 19:31:36

For obvious reasons I have two numpy arrays of different size one with an index column along with x y z coordinates and the other just containing the coordinates. (please ignore the first serial no., I can't figure out the formatting.) The second array has less no. of coordinates and I need the indexes (atomID) of those coordinates from the first array.

由于明显的原因,我有两个不同大小的numpy数组,一个是索引列,另一个是x y z坐标,另一个只是包含坐标。(请忽略第一个序列号。,我搞不懂格式。第二个数组的no更少。我需要来自第一个数组的坐标的索引(atomID)。

Array1 (with index column):

Array1(索引列):

    serialNo. moleculeID atomID x y z
  1. 1 1 2 0 7.7590151 7.2925348 12.5933323
  2. 1 12 2 0 7.7590151 7.2925348 12.5933323
  3. 2 1 2 0 7.123642 6.1970949 11.5622416
  4. 2 12 2 0 7.123642 6.1970949 11.5622416
  5. 3 1 6 0 6.944543 7.0390449 12.0713224
  6. 3 1 6 0 6.944543 7.0390449 12.0713224
  7. 4 1 2 0 8.8900348 11.5477333 13.5633965
  8. 4 1 2 0 8.8900348 11.5477333 13.5633965
  9. 5 1 2 0 7.857268 12.8062735 13.4357052
  10. 5 12 0 7.857268 12.8062735 13.4357052
  11. 6 1 6 0 8.2124357 12.1004238 14.0486889
  12. 6 1 6 0 8.2124357 12.1004238 14.0486889

Array2 (just the coordinates):

Array2(坐标):

x          y             z
  1. 7.7590151 7.2925348 12.5933323
  2. 7.7590151 7.2925348 12.5933323
  3. 7.123642 6.1970949 11.5622416
  4. 7.123642 6.1970949 11.5622416
  5. 6.944543 7.0390449 12.0713224
  6. 6.944543 7.0390449 12.0713224
  7. 8.8900348 11.5477333 13.5633965
  8. 8.8900348 11.5477333 13.5633965

The array with the index column (atomID) has the indexes as 2, 2, 6, 2, 2 and 6. How can I get the indexes for the coordinates that are common in Array1 and Array2. I expect to return 2 2 6 2 as a list and then concatenate it with the second array. Any easy ideas?

具有索引列(atomID)的数组的索引有2、2、6、2、2和6。如何获得Array1和Array2中常见的坐标的索引?我希望返回2 2 6 2作为一个列表,然后将它与第二个数组连接起来。任何简单的想法吗?

Update:

更新:

Tried using the following code, but it doesn't seem to be working.

尝试使用以下代码,但它似乎不工作。

import numpy as np

a = np.array([[4, 2.2, 5], [2, -6.3, 0], [3, 3.6, 8], [5, -9.8, 50]])

b = np.array([[2.2, 5], [-6.3, 0], [3.6, 8]])

print a
print b

for i in range(len(b)):
 for j in range(len(a)):
    if a[j,1]==b[i,0]:
        x = np.insert(b, 0, a[i,0], axis=1) #(input array, position to insert, value to insert, axis)
        #continue
    else:
        print 'not true'
print x 

which outputs the following:

输出如下:

not true
not true
not true
not true
not true
not true
not true
not true
not true
[[ 3.   2.2  5. ]
 [ 3.  -6.3  0. ]
 [ 3.   3.6  8. ]]

but expectation was:

但期望是:

    [[ 4.   2.2  5. ]
     [ 2.  -6.3  0. ]
     [ 3.   3.6  8. ]]

4 个解决方案

#1


2  

The numpy_indexed package (disclaimer: I am its author) contains functionality to solve such problems in an elegant and efficient/vectorized manner:

numpy_index包(免责声明:我是它的作者)包含以优雅、高效/向量化的方式解决此类问题的功能:

import numpy_indexed as npi
print(a[npi.contains(b, a[:, 1:])])

The currently accepted answer strikes me as being incorrect for points which differ in their latter coordinates. And performance should be much improved here as well; not only is this solution vectorized, but worst case performance is NlogN, as opposed to the quadratic time complexity of the currently accepted answer.

当前被接受的答案在我看来是不正确的,因为它们后面的坐标不同。这里的表现也应该提高很多;这个解决方案不仅是矢量化的,而且最坏的情况是NlogN,而不是当前所接受的二次时间复杂度。

#2


2  

Two concise vectorized ways to do it using cdist -

使用cdist -有两种简洁的矢量化方法

from scipy.spatial.distance import cdist

out = a[np.any(cdist(a[:,1:],b)==0,axis=1)]

Or if you don't mind getting a bit voodoo-ish, here's np.einsum to replace np.any -

或者,如果你不介意有一点*,这是np。einsum取代np。任何,

out = a[np.einsum('ij->i',cdist(a[:,1:],b)==0)]

Sample run -

样本运行-

In [15]: from scipy.spatial.distance import cdist

In [16]: a
Out[16]: 
array([[  4. ,   2.2,   5. ],
       [  2. ,  -6.3,   0. ],
       [  3. ,   3.6,   8. ],
       [  5. ,  -9.8,  50. ]])

In [17]: b
Out[17]: 
array([[ 2.2,  5. ],
       [-6.3,  0. ],
       [ 3.6,  8. ]])

In [18]: a[np.any(cdist(a[:,1:],b)==0,axis=1)]
Out[18]: 
array([[ 4. ,  2.2,  5. ],
       [ 2. , -6.3,  0. ],
       [ 3. ,  3.6,  8. ]])

In [19]: a[np.einsum('ij->i',cdist(a[:,1:],b)==0)]
Out[19]: 
array([[ 4. ,  2.2,  5. ],
       [ 2. , -6.3,  0. ],
       [ 3. ,  3.6,  8. ]])

#3


1  

This is just a pseudo code for your question:

这只是你的问题的伪代码:

import numpy as np
for i in range(len(array2)):
    for element in array1:
        if array2[i]xyz == elementxyz: #compare the coordinates of the two elements
            np.insert(array2[i], 0, element_coord) #insert the atomid at the beginning of the coordinate array
            break

#4


0  

Using a list instead of array for the values of np.insert did the trick.

使用列表代替数组作为np值。插入的技巧。

import numpy as np

a = np.array([[4, 2.2, 5], [2, -6.3, 0], [3, 3.6, 8], [5, -9.8, 50]])

b = np.array([[2.2, 5], [-6.3, 0], [3.6, 8]])

print a
print b
x = []

for i in range(len(b)):
 for j in range(len(a)):
    if a[j,1]==b[i,0]:
        x.append(a[j,0])
    else:
        x = x
print np.insert(b,0,x,axis=1)

which would output:

这将输出:

[[ 4.   2.2  5. ]
 [ 2.  -6.3  0. ]
 [ 3.   3.6  8. ]]

#1


2  

The numpy_indexed package (disclaimer: I am its author) contains functionality to solve such problems in an elegant and efficient/vectorized manner:

numpy_index包(免责声明:我是它的作者)包含以优雅、高效/向量化的方式解决此类问题的功能:

import numpy_indexed as npi
print(a[npi.contains(b, a[:, 1:])])

The currently accepted answer strikes me as being incorrect for points which differ in their latter coordinates. And performance should be much improved here as well; not only is this solution vectorized, but worst case performance is NlogN, as opposed to the quadratic time complexity of the currently accepted answer.

当前被接受的答案在我看来是不正确的,因为它们后面的坐标不同。这里的表现也应该提高很多;这个解决方案不仅是矢量化的,而且最坏的情况是NlogN,而不是当前所接受的二次时间复杂度。

#2


2  

Two concise vectorized ways to do it using cdist -

使用cdist -有两种简洁的矢量化方法

from scipy.spatial.distance import cdist

out = a[np.any(cdist(a[:,1:],b)==0,axis=1)]

Or if you don't mind getting a bit voodoo-ish, here's np.einsum to replace np.any -

或者,如果你不介意有一点*,这是np。einsum取代np。任何,

out = a[np.einsum('ij->i',cdist(a[:,1:],b)==0)]

Sample run -

样本运行-

In [15]: from scipy.spatial.distance import cdist

In [16]: a
Out[16]: 
array([[  4. ,   2.2,   5. ],
       [  2. ,  -6.3,   0. ],
       [  3. ,   3.6,   8. ],
       [  5. ,  -9.8,  50. ]])

In [17]: b
Out[17]: 
array([[ 2.2,  5. ],
       [-6.3,  0. ],
       [ 3.6,  8. ]])

In [18]: a[np.any(cdist(a[:,1:],b)==0,axis=1)]
Out[18]: 
array([[ 4. ,  2.2,  5. ],
       [ 2. , -6.3,  0. ],
       [ 3. ,  3.6,  8. ]])

In [19]: a[np.einsum('ij->i',cdist(a[:,1:],b)==0)]
Out[19]: 
array([[ 4. ,  2.2,  5. ],
       [ 2. , -6.3,  0. ],
       [ 3. ,  3.6,  8. ]])

#3


1  

This is just a pseudo code for your question:

这只是你的问题的伪代码:

import numpy as np
for i in range(len(array2)):
    for element in array1:
        if array2[i]xyz == elementxyz: #compare the coordinates of the two elements
            np.insert(array2[i], 0, element_coord) #insert the atomid at the beginning of the coordinate array
            break

#4


0  

Using a list instead of array for the values of np.insert did the trick.

使用列表代替数组作为np值。插入的技巧。

import numpy as np

a = np.array([[4, 2.2, 5], [2, -6.3, 0], [3, 3.6, 8], [5, -9.8, 50]])

b = np.array([[2.2, 5], [-6.3, 0], [3.6, 8]])

print a
print b
x = []

for i in range(len(b)):
 for j in range(len(a)):
    if a[j,1]==b[i,0]:
        x.append(a[j,0])
    else:
        x = x
print np.insert(b,0,x,axis=1)

which would output:

这将输出:

[[ 4.   2.2  5. ]
 [ 2.  -6.3  0. ]
 [ 3.   3.6  8. ]]