numpy数组的稀疏逻辑比较

I have two numpy arrays, A and B, representing coordinates of points in a 2D plane. Let's say A is 10000-by-2, and B is 20000-by-2. Both have float64 dtype.

我有两个numpy数组,A和B,表示2D平面中点的坐标。假设A是10000乘2,B是20000乘2。两者都有float64 dtype。

I want to find out which of the points in first array, A, are in the second (B). Doing this with a for loop would be very slow. I came up with the following broadcasting scheme to perform the comparison (ignoring the floating point equality vs closeness issue for the moment):

我想找出第一个数组中的哪个点A在第二个(B)中。使用for循环执行此操作将非常缓慢。我想出了以下广播方案来进行比较(暂时忽略浮点平等与亲密度问题):

x_bool_array = A[:,0][numpy.newaxis,...] == B[:,0][...,numpy.newaxis]
y_bool_array = A[:,1][numpy.newaxis,...] == B[:,1][...,numpy.newaxis]
bool_array = numpy.logical_and(x_bool_array, y_bool_array)
indices = numpy.where(bool_array)

However this would result in very large, 20000-by-10000, boolean arrays which are mostly sparse, i.e, the number of Trues is much much less than the number of Falses.

然而,这将导致非常大的20000×10000个布尔阵列,这些阵列大多数都是稀疏的,即,Trues的数量远远少于Falses的数量。

I'm wondering if there's a way to keep them sparse through some switch or property? Or if there is a better way to do this that is fast and doesn't consume a lot of memory? (doing it piece-wise is probably another option, but I guess I'm looking for elegance as well, besides, speed and low-memory).

我想知道是否有办法通过某些开关或属性来保持稀疏?或者,如果有更好的方法来做到这一点,速度快,不消耗大量内存? (分段做可能是另一种选择,但我想我也在寻找优雅,除了速度和低内存)。

Edit: Responding to @Tai's comment for clarification, let's take a small example:

编辑:回应@ Tai的评论澄清,让我们举个小例子:

A = numpy.array([[0.1, 0.2], [0.34, 0.44], [0.5, 0.6]])
B = numpy.array([[0.05, 0.05], [0.1, 0.2], [0.7, 0.8], [0.5, 0.6]])

In other words, A is an array of 3 2D points (3-by-2), and B is one with 4 2D points (4-by-2).

换句话说,A是3个2D点(3乘2)的阵列,B是具有4个2D点(4乘2)的阵列。

We can see that B[1,:] is same as A[0,:], and B[3,:] is same as A[2,:]. So we have two matches. The final result, indices, would be as follows:

我们可以看到B [1,:]与A [0,:]相同,B [3,:]与A [2,:]相同。所以我们有两场比赛。最终结果指数如下:

(array([1, 3]), array([0, 2]))

Edit 2: Previously I said piece-wise is an option. I tried it and it is not any better. Essentially I split one of the two arrays into 100 chunks, ran the logical comparison on each chunk against the full second array, and consolidated the results, in a for loop. Unfortunately, there is no way to let the interpreter know that it can use the previous memory (i.e., you cannot explicitly control the garbage collector, or at least it would not be very idiomatic python/numpy), and the allocator keeps allocating new memory for each new chunk.

编辑2:以前我说过分段是一种选择。我尝试过它并没有更好。本质上,我将两个数组中的一个分成100个块,在每个块上对整个第二个数组进行逻辑比较,并在for循环中合并结果。不幸的是,没有办法让解释器知道它可以使用以前的内存(即,你不能显式地控制垃圾收集器,或者至少它不会是非常惯用的python / numpy),并且分配器不断分配新的内存对于每个新块。

2 个解决方案

#1

If you do not mind, pandas would be a workaround.

如果你不介意,熊猫将成为一种解决方法。

import pandas as pd
import numpy as np
A = np.array([[0.1, 0.2], [0.34, 0.44], [0.5, 0.6]])
B = np.array([[0.05, 0.05], [0.1, 0.2], [0.7, 0.8], [0.5, 0.6]])

dfA = pd.DataFrame(A, columns=["v1", "v2"]).reset_index()
dfB = pd.DataFrame(B, columns=["v1", "v2"]).reset_index()
common_vals = pd.merge(dfA, dfB, how='inner', on=['v1','v2'])

    index_x  v1   v2    index_y
 0  0        0.1  0.2   1
 1  2        0.5  0.6   3

Then select index_x and index_y two columns by passing a list of column names you need, here ["index_x", "index_y"].

然后通过传递所需的列名列表来选择index_x和index_y两列,这里是[“index_x”,“index_y”]。

common_vals[["index_x", "index_y"]].as_matrix()

Out: array([[0, 1],
            [2, 3]])

#2

Fundamentally, this is a nearest neighbors search where you're looking for neighbors at a distance of zero. You can do this quite efficiently using the appropriate data structure; here a KD-Tree is the best option.

从根本上说,这是一个最近邻搜索,您正在寻找距离为零的邻居。您可以使用适当的数据结构非常有效地完成此操作;这里KD-Tree是最好的选择。

Here's a quick example using the arrays you provided:

以下是使用您提供的阵列的快速示例:

from scipy.spatial import cKDTree
dist, ind = cKDTree(B).query(A, 1)
results = (ind[dist == 0], np.where(dist == 0)[0])

results
# (array([1, 3]), array([0, 2]))

This approach should scale quite well for very large arrays, because it avoids doing all N x M comparisons that a direct approach requires. For the size of the large arrays you propose, this finishes in less than 20 milliseconds:

这种方法对于非常大的数组应该可以很好地扩展,因为它避免了直接方法所需的所有N×M比较。对于您建议的大型阵列的大小,这将在不到20毫秒的时间内完成:

A = np.random.randint(0, 1000, (10000, 2))
B = np.random.randint(0, 1000, (20000, 2))

%%timeit
dist, ind = cKDTree(B).query(A, 1)
results = ind[dist == 0], np.where(dist == 0)[0]
# 16.9 ms ± 530 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#1