返回两个numpy数组之间的公共元素索引

I have two arrays, a1 and a2. Assume len(a2) >> len(a1), and that a1 is a subset of a2.

我有两个数组，a1和a2。假设len(a2) >> len(a1) a1是a2的子集。

I would like a quick way to return the a2 indices of all elements in a1. The time-intensive way to do this is obviously:

我想要一个快速返回a1中所有元素的a2指标的方法。时间密集的方法显然是:

from operator import indexOf
indices = []
for i in a1:
    indices.append(indexOf(a2,i))

This of course takes a long time where a2 is large. I could also use numpy.where() instead (although each entry in a1 will appear just once in a2), but I'm not convinced it will be quicker. I could also traverse the large array just once:

这当然要花很长时间，因为a2是大的。我也可以使用numpi .where()代替(尽管a1中的每个条目在a2中只出现一次)，但我不认为它会更快。我也可以只遍历一次大数组:

for i in xrange(len(a2)):
    if a2[i] in a1:
        indices.append(i)

But I'm sure there is a faster, more 'numpy' way - I've looked through the numpy method list, but cannot find anything appropriate.

但我确信有一种更快、更“numpy”的方法——我查看了numpy方法列表，但找不到任何合适的方法。

Many thanks in advance,

非常感谢,

5 个解决方案

#1

How about

如何

numpy.nonzero(numpy.in1d(a2, a1))[0]

This should be fast. From my basic testing, it's about 7 times faster than your second code snippet for len(a2) == 100, len(a1) == 10000, and only one common element at index 45. This assumes that both a1 and a2 have no repeating elements.

这应该是快。根据我的基本测试，它比len(a2) = 100、len(a1) = 10000的第二个代码片段快7倍，在索引45中只有一个公共元素。这假设a1和a2都没有重复的元素。

#2

how about:

如何:

wanted = set(a1)
indices =[idx for (idx, value) in enumerate(a2) if value in wanted]

This should be O(len(a1)+len(a2)) instead of O(len(a1)*len(a2))

应该是O(len(a1)+len(a2))而不是O(len(a1)*len(a2))

NB I don't know numpy so there may be a more 'numpythonic' way to do it, but this is how I would do it in pure python.

NB，我不知道numpy，所以可能有一种更“numpythonic”的方法，但是这是我在纯python中怎么做的。

#3

index = in1d(a2,a1)
result = a2[index]

#4

Very similar to @AlokSinghal, but you get an already flattened version.

非常类似于@AlokSinghal，但是您会得到一个已经变平的版本。

numpy.flatnonzero(numpy.in1d(a2, a1))

#5

The numpy_indexed package (disclaimer: I am its author) contains a vectorized equivalent of list.index; performance should be similar to the currently accepted answer, but as a bonus, it gives you explicit control over missing values as well, using the 'missing' kwarg.

numpy_indexpackage(免责声明:我是它的作者)包含一个矢量化的等价列表。index;性能应该与当前接受的答案相似，但作为额外的好处，它还可以使用“丢失”kwarg显式地控制丢失的值。

import numpy_indexed as npi
indices = npi.indices(a2, a1, missing='raise')

Also, it will also work on multi-dimensional arrays, ie, finding the indices of one set of rows in another.

此外，它还可以用于多维数组，即在另一个行中查找一组行的索引。

#1