快速计算这个numpy查询的方法

时间:2021-08-11 21:21:57

I have a boolean numpy array mask of length n. I also have a numpy array a of length <= n, containing numbers ranging from 0 (inclusive) to n-1 (inclusive), and it contains no duplicates. The query I want to compute is np.array([x for x in a if mask[x]]), but I don't think it's the fastest way to do it.

我有一个长度为n的布尔numpy数组掩码。我还有一个长度<= n的numpy数组a,包含从0(含)到n-1(含)的数字,并且它不包含重复项。我想要计算的查询是np.array([x for if mask [x]]中的x),但我不认为这是最快的方法。

Is there a faster way of doing this in numpy than the way I just wrote?

有没有一种比我刚写的方式更快的方式呢?

1 个解决方案

#1


1  

It looks like the fastest way to do this is simply a[mask[a]]. I wrote a quick test which shows the difference in speed of the two methods depending on the coverage of the mask, p (the number of true items / n).

看起来最快的方法就是[mask [a]]。我写了一个快速测试,它显示了两种方法的速度差异,具体取决于掩码的覆盖范围,p(真项数/ n)。

import timeit
import matplotlib.pyplot as plt
import numpy as np
n = 10000
p = 0.25
slow_times = []
fast_times = []
p_space = np.linspace(0, 1, 100)
for p in p_space:
    mask = np.random.choice([True, False], n, p=[p, 1 - p])
    a = np.arange(n)
    np.random.shuffle(a)
    y = np.array([x for x in a if mask[x]])
    z = a[mask[a]]
    n_test = 100
    t1 = timeit.timeit(lambda: np.array([x for x in a if mask[x]]), number=n_test)
    t2 = timeit.timeit(lambda: a[mask[a]], number=n_test)
    slow_times.append(t1)
    fast_times.append(t2)
plt.plot(p_space, slow_times, label='slow')
plt.plot(p_space, fast_times, label='fast')
plt.xlabel('p (# true items in mask)')
plt.ylabel('time (ms)')
plt.legend()
plt.title('Speed of method vs. coverage of mask')
plt.show()

Which gave me this plot

这给了我这个情节

快速计算这个numpy查询的方法

So this method is a whole lot faster regardless of the coverage of mask.

因此无论掩模的覆盖范围如何,这种方法都要快得多。

#1


1  

It looks like the fastest way to do this is simply a[mask[a]]. I wrote a quick test which shows the difference in speed of the two methods depending on the coverage of the mask, p (the number of true items / n).

看起来最快的方法就是[mask [a]]。我写了一个快速测试,它显示了两种方法的速度差异,具体取决于掩码的覆盖范围,p(真项数/ n)。

import timeit
import matplotlib.pyplot as plt
import numpy as np
n = 10000
p = 0.25
slow_times = []
fast_times = []
p_space = np.linspace(0, 1, 100)
for p in p_space:
    mask = np.random.choice([True, False], n, p=[p, 1 - p])
    a = np.arange(n)
    np.random.shuffle(a)
    y = np.array([x for x in a if mask[x]])
    z = a[mask[a]]
    n_test = 100
    t1 = timeit.timeit(lambda: np.array([x for x in a if mask[x]]), number=n_test)
    t2 = timeit.timeit(lambda: a[mask[a]], number=n_test)
    slow_times.append(t1)
    fast_times.append(t2)
plt.plot(p_space, slow_times, label='slow')
plt.plot(p_space, fast_times, label='fast')
plt.xlabel('p (# true items in mask)')
plt.ylabel('time (ms)')
plt.legend()
plt.title('Speed of method vs. coverage of mask')
plt.show()

Which gave me this plot

这给了我这个情节

快速计算这个numpy查询的方法

So this method is a whole lot faster regardless of the coverage of mask.

因此无论掩模的覆盖范围如何,这种方法都要快得多。