计算两个numpy数组之间相交值的有效方法

I have a bottleneck in my program which is caused by the following:

我的程序中存在瓶颈,原因如下:

A = numpy.array([10,4,6,7,1,5,3,4,24,1,1,9,10,10,18])
B = numpy.array([1,4,5,6,7,8,9])

C = numpy.array([i for i in A if i in B])

The expected outcome for C is the following:

C的预期结果如下:

C = [4 6 7 1 5 4 1 1 9]

Is there a more efficient way of doing this operation?

有没有更有效的方法来执行此操作?

Note that array A contains repeating values and they need to be taken into account. I wasn't able to use set intersection since taking the intersection will omit the repeating values, returning just [1,4,5,6,7,9].

请注意,数组A包含重复值,需要将它们考虑在内。我无法使用集合交集,因为取交点将省略重复值,仅返回[1,4,5,6,7,9]。

Also note this is only a simple demonstration. The actual array sizes can be in the order of thousands, to well over millions.

另请注意,这只是一个简单的演示。实际的阵列大小可以是数千,而不是数百万。

3 个解决方案

#1

You can use np.in1d:

你可以使用np.in1d:

>>> A[np.in1d(A, B)]
array([4, 6, 7, 1, 5, 4, 1, 1, 9])

np.in1d returns a boolean array indicating whether each value of A also appears in B. This array can then be used to index A and return the common values.

np.in1d返回一个布尔数组,指示A的每个值是否也出现在B.这个数组然后可用于索引A并返回公共值。

It's not relevant to your example, but it's also worth mentioning that if A and B each contain unique values then np.in1d can be sped up by setting assume_unique=True:

它与你的例子无关,但是值得一提的是,如果A和B都包含唯一值,那么可以通过设置assume_unique = True来加速np.in1d:

np.in1d(A, B, assume_unique=True)

You might also be interested in np.intersect1d which returns an array of the unique values common to both arrays (sorted by value):

您可能还对np.intersect1d感兴趣,它返回两个数组共有的唯一值数组(按值排序):

>>> np.intersect1d(A, B)
array([1, 4, 5, 6, 7, 9])

#2

Use numpy.in1d:

>>> A[np.in1d(A, B)]
array([4, 6, 7, 1, 5, 4, 1, 1, 9])

#3

If you check only for existence in B (if i in B) then obviously you can use a set for this. It doesn't matter how many fours there are in B as long as there is at least one. Of course you are right, that you can't use two sets and an intersection. But even one set should improve performance, as searching complexity is less than O(n):

如果你只检查B中的存在(如果我在B中)那么显然你可以使用一个集合。只要至少有一个,在B中有多少四个并不重要。当然你是对的,你不能使用两套和十字路口。但是,即使一组也应该提高性能,因为搜索复杂度小于O(n):

A = numpy.array([10,4,6,7,1,5,3,4,24,1,1,9,10,10,18])
B = set([1,4,5,6,7,8,9])

C = numpy.array([i for i in A if i in B])

#1