为什么这个numba代码比numpy代码慢6倍?

时间:2021-09-04 21:20:17

Is there any reason why the following code run in 2s,

是否有任何理由为什么以下代码在2s运行,

def euclidean_distance_square(x1, x2):
    return -2*np.dot(x1, x2.T) + np.expand_dims(np.sum(np.square(x1), axis=1), axis=1) + np.sum(np.square(x2), axis=1)

while the following numba code run in 12s?

以下numba代码在12s运行?

@jit(nopython=True)
def euclidean_distance_square(x1, x2):
   return -2*np.dot(x1, x2.T) + np.expand_dims(np.sum(np.square(x1), axis=1), axis=1) + np.sum(np.square(x2), axis=1)

My x1 is a matrix of dimension (1, 512) and x2 is a matrix of dimension (3000000, 512). It is quite weird that numba can be so much slower. Am I using it wrong?

我的x1是尺寸矩阵(1,512),x2是尺寸矩阵(3000000,512)。 numba可以慢得多,这很奇怪。我用错了吗?

I really need to speed this up because I need to run this function 3 million times and 2s is still way too slow.

我真的需要加快速度,因为我需要运行这个功能300万次而且2s仍然太慢了。

I need to run this on CPU because as you can see the dimension of x2 is so huge, it cannot be loaded onto a GPU (or at least my GPU), not enough memory.

我需要在CPU上运行它,因为你可以看到x2的维度是如此巨大,它无法加载到GPU(或至少我的GPU),没有足够的内存。

3 个解决方案

#1


6  

It is quite weird that numba can be so much slower.

numba可以慢得多,这很奇怪。

It's not too weird. When you call NumPy functions inside a numba function you call the numba-version of these functions. These can be faster, slower or just as fast as the NumPy versions. You might be lucky or you can be unlucky (you were unlucky!). But even in the numba function you still create lots of temporaries because you use the NumPy functions (one temporary array for the dot result, one for each square and sum, one for the dot plus first sum) so you don't take advantage of the possibilities with numba.

这不太奇怪。当你在numba函数中调用NumPy函数时,你可以调用这些函数的numba版本。这些可以更快,更慢或与NumPy版本一样快。你可能很幸运,或者你可能不走运(你运气不好!)。但是即使在numba函数中,你仍然会创建很多临时函数,因为你使用了NumPy函数(一个临时数组用于点结果,一个用于每个正方形和总和,一个用于点加上第一个总和)所以你没有利用numba的可能性。

Am I using it wrong?

我用错了吗?

Essentially: Yes.

基本上:是的。

I really need to speed this up

我真的需要加快速度

Okay, I'll give it a try.

好的,我会试一试。

Let's start with unrolling the sum of squares along axis 1 calls:

让我们从展开沿轴1调用的平方和开始:

import numba as nb

@nb.njit
def sum_squares_2d_array_along_axis1(arr):
    res = np.empty(arr.shape[0], dtype=arr.dtype)
    for o_idx in range(arr.shape[0]):
        sum_ = 0
        for i_idx in range(arr.shape[1]):
            sum_ += arr[o_idx, i_idx] * arr[o_idx, i_idx]
        res[o_idx] = sum_
    return res


@nb.njit
def euclidean_distance_square_numba_v1(x1, x2):
    return -2 * np.dot(x1, x2.T) + np.expand_dims(sum_squares_2d_array_along_axis1(x1), axis=1) + sum_squares_2d_array_along_axis1(x2)

On my computer that's already 2 times faster than the NumPy code and almost 10 times faster than your original Numba code.

在我的计算机上,它已经比NumPy代码快2倍,比原来的Numba代码快10倍。

Speaking from experience getting it 2x faster than NumPy is generally the limit (at least if the NumPy version isn't needlessly complicated or inefficient), however you can squeeze out a bit more by unrolling everything:

从经验来看,它比NumPy快2倍通常是极限(至少如果NumPy版本不是不必要地复杂或低效),但是你可以通过展开所有内容来挤出更多:

import numba as nb

@nb.njit
def euclidean_distance_square_numba_v2(x1, x2):
    f1 = 0.
    for i_idx in range(x1.shape[1]):
        f1 += x1[0, i_idx] * x1[0, i_idx]

    res = np.empty(x2.shape[0], dtype=x2.dtype)
    for o_idx in range(x2.shape[0]):
        val = 0
        for i_idx in range(x2.shape[1]):
            val_from_x2 = x2[o_idx, i_idx]
            val += (-2) * x1[0, i_idx] * val_from_x2 + val_from_x2 * val_from_x2
        val += f1
        res[o_idx] = val
    return res

But that only gives a ~10-20% improvement over the latest approach.

但这比最新方法仅提高了约10-20%。

At that point you might realize that you can simplify the code (even though it probably won't speed it up):

那时你可能会意识到你可以简化代码(即使它可能不会加速它):

import numba as nb

@nb.njit
def euclidean_distance_square_numba_v3(x1, x2):
    res = np.empty(x2.shape[0], dtype=x2.dtype)
    for o_idx in range(x2.shape[0]):
        val = 0
        for i_idx in range(x2.shape[1]):
            tmp = x1[0, i_idx] - x2[o_idx, i_idx]
            val += tmp * tmp
        res[o_idx] = val
    return res

Yeah, that look pretty straight-forward and it's not really slower.

是的,那看起来非常简单,并不是真的慢。

However in all the excitement I forgot to mention the obvious solution: scipy.spatial.distance.cdist which has a sqeuclidean (squared euclidean distance) option:

然而,在所有的兴奋中,我忘了提到明显的解决方案:scipy.spatial.distance.cdist,它有一个sqeuclidean(欧氏距离平方)选项:

from scipy.spatial import distance
distance.cdist(x1, x2, metric='sqeuclidean')

It's not really faster than numba but it's available without having to write your own function...

它并不比numba快,但它可以在不必编写自己的功能的情况下使用...

Tests

Test for correctness and do the warmups:

测试正确性并进行预热:

x1 = np.array([[1.,2,3]])
x2 = np.array([[1.,2,3], [2,3,4], [3,4,5], [4,5,6], [5,6,7]])

res1 = euclidean_distance_square(x1, x2)
res2 = euclidean_distance_square_numba_original(x1, x2)
res3 = euclidean_distance_square_numba_v1(x1, x2)
res4 = euclidean_distance_square_numba_v2(x1, x2)
res5 = euclidean_distance_square_numba_v3(x1, x2)
np.testing.assert_array_equal(res1, res2)
np.testing.assert_array_equal(res1, res3)
np.testing.assert_array_equal(res1[0], res4)
np.testing.assert_array_equal(res1[0], res5)
np.testing.assert_almost_equal(res1, distance.cdist(x1, x2, metric='sqeuclidean'))

Timings:

时序:

x1 = np.random.random((1, 512))
x2 = np.random.random((1000000, 512))

%timeit euclidean_distance_square(x1, x2)
# 2.09 s ± 54.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit euclidean_distance_square_numba_original(x1, x2)
# 10.9 s ± 158 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit euclidean_distance_square_numba_v1(x1, x2)
# 907 ms ± 7.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit euclidean_distance_square_numba_v2(x1, x2)
# 715 ms ± 15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit euclidean_distance_square_numba_v3(x1, x2)
# 731 ms ± 34.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit distance.cdist(x1, x2, metric='sqeuclidean')
# 706 ms ± 4.99 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Note: If you have arrays of integers you might want to change the hard-coded 0.0 in the numba functions to 0.

注意:如果您有整数数组,则可能需要将numba函数中的硬编码0.0更改为0。

#2


4  

Despite the fact, that the answer of @MSeifert makes this answer quite obsolete, I'm still posting it, because it explains in more detail why the numba-version was slower than the numpy-version.

尽管事实上,@ MSeifert的答案使得这个答案相当陈旧,但我仍然发布它,因为它更详细地解释了为什么numba版本比numpy版本慢。

As we will see, the main culprit are the different memory access patterns for numpy and numba.

正如我们将要看到的,主要的罪魁祸首是numpy和numba的不同内存访问模式。

We can reproduce the behavior with much a simpler function:

我们可以通过更简单的函数重现行为:

import numpy as np
import numba as nb

def just_sum(x2):
    return np.sum(x2, axis=1)

@nb.jit('double[:](double[:, :])', nopython=True)
def nb_just_sum(x2):
    return np.sum(x2, axis=1)

x2=np.random.random((2048,2048))

And now the timings:

现在的时间安排:

>>> %timeit just_sum(x)
2.33 ms ± 71.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit nb_just_sum(x)
33.7 ms ± 296 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

that means numpy is about 15 times faster!

这意味着numpy快15倍!

When compiling the numba code with annotations (e.g. numba --annotate-html sum.html numba_sum.py) we can see, how the sum is performed by numba (see the whole listing of the summation in the appendix):

在使用注释编译numba代码时(例如numba --annotate -html sum.html numba_sum.py),我们可以看到numba如何执行求和(参见附录中求和的完整列表):

  1. initialize the result-column
  2. 初始化结果列
  3. add the whole first column to the result-column
  4. 将整个第一列添加到结果列
  5. add the whole second column to the result-column
  6. 将整个第二列添加到结果列
  7. and so on
  8. 等等

What is the problem of this approach? The memory layout! The array is stored in the row-major-order and thus reading it column-wise leads to much more cache-misses than reading it row-wise (which is what the numpy does). There is a great article which explains the possible cache effects.

这种方法有什么问题?内存布局!数组以行主顺序存储,因此按行读取它会导致比按行读取更多的缓存未命中(这就是numpy所做的)。有一篇很棒的文章解释了可能的缓存效果。

As we can see, the sum-implementation of numba is yet not very mature. However, from the above consideration the numba-implementation could be competitive for column-major-order (i.e. transposed matrix):

我们可以看到,numba的总和实现还不是很成熟。然而,从上面的考虑,numba实现可能对列主要顺序(即转置矩阵)具有竞争力:

>>> %timeit just_sum(x.T)
3.09 ms ± 66.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit nb_just_sum(x.T)
3.58 ms ± 45.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

and it really is.

它确实是。

As the code of @MSeifert has shown, the main advantage of numba is, that with its help we can reduce the number of temporary numpy-arrays. However, some things that looks easy are not easy at all and a naive solution can be pretty bad. Building a sum is such an operation - one should not think that a simple loop is good enough - see for example this question.

正如@MSeifert的代码所示,numba的主要优点是,在它的帮助下,我们可以减少临时numpy数组的数量。然而,一些看起来容易的事情根本不容易,而天真的解决方案可能非常糟糕。建立一个总和就是这样一个操作 - 人们不应该认为一个简单的循环足够好 - 比如看看这个问题。


Listing numba-summation:

列出numba-summation:

 Function name: array_sum_impl_axis
in file: /home/ed/anaconda3/lib/python3.6/site-packages/numba/targets/arraymath.py
with signature: (array(float64, 2d, A), int64) -> array(float64, 1d, C)
show numba IR
194:    def array_sum_impl_axis(arr, axis):
195:        ndim = arr.ndim
196:    
197:        if not is_axis_const:
198:            # Catch where axis is negative or greater than 3.
199:            if axis < 0 or axis > 3:
200:                raise ValueError("Numba does not support sum with axis"
201:                                 "parameter outside the range 0 to 3.")
202:    
203:        # Catch the case where the user misspecifies the axis to be
204:        # more than the number of the array's dimensions.
205:        if axis >= ndim:
206:            raise ValueError("axis is out of bounds for array")
207:    
208:        # Convert the shape of the input array to a list.
209:        ashape = list(arr.shape)
210:        # Get the length of the axis dimension.
211:        axis_len = ashape[axis]
212:        # Remove the axis dimension from the list of dimensional lengths.
213:        ashape.pop(axis)
214:        # Convert this shape list back to a tuple using above intrinsic.
215:        ashape_without_axis = _create_tuple_result_shape(ashape, arr.shape)
216:        # Tuple needed here to create output array with correct size.
217:        result = np.full(ashape_without_axis, zero, type(zero))
218:    
219:        # Iterate through the axis dimension.
220:        for axis_index in range(axis_len):
221:            if is_axis_const:
222:                # constant specialized version works for any valid axis value
223:                index_tuple_generic = _gen_index_tuple(arr.shape, axis_index,
224:                                                       const_axis_val)
225:                result += arr[index_tuple_generic]
226:            else:
227:                # Generate a tuple used to index the input array.
228:                # The tuple is ":" in all dimensions except the axis
229:                # dimension where it is "axis_index".
230:                if axis == 0:
231:                    index_tuple1 = _gen_index_tuple(arr.shape, axis_index, 0)
232:                    result += arr[index_tuple1]
233:                elif axis == 1:
234:                    index_tuple2 = _gen_index_tuple(arr.shape, axis_index, 1)
235:                    result += arr[index_tuple2]
236:                elif axis == 2:
237:                    index_tuple3 = _gen_index_tuple(arr.shape, axis_index, 2)
238:                    result += arr[index_tuple3]
239:                elif axis == 3:
240:                    index_tuple4 = _gen_index_tuple(arr.shape, axis_index, 3)
241:                    result += arr[index_tuple4]
242:    
243:        return result 

#3


2  

This is a comment to @MSeifert answer. There are some more things to gain performance. As in every numerical code it is recommendable to think about which datatype is precise enough for your problem. Often float32 is also enough, sometimes even float64 isn't enough.

这是对@MSeifert回答的评论。还有一些事情可以获得表现。与每个数字代码一样,建议考虑哪种数据类型对于您的问题足够精确。通常float32也足够了,有时甚至float64还不够。

I want also to mention the fastmath keyword here, which can give another 1.7x speed up here.

我还想在这里提到fastmath关键字,这里可以再提高1.7倍速度。

[Edit]

[编辑]

For a simple summation I looked into the LLVM-code and found that the sumation was splitted up in partial sums on vectorization. (4 partial sums for double and 8 for float using AVX2). This has to be investigated further.

为了简单总结,我查看了LLVM代码,发现sumation在矢量化的部分和中被分割。 (4个部分和为double,8为浮动,使用AVX2)。这必须进一步调查。

Code import llvmlite.binding as llvm llvm.set_option('', '--debug-only=loop-vectorize')

代码导入llvmlite.binding为llvm llvm.set_option('',' - debug-only = loop-vectorize')

@nb.njit
def euclidean_distance_square_numba_v3(x1, x2):
    res = np.empty(x2.shape[0], dtype=x2.dtype)
    for o_idx in range(x2.shape[0]):
        val = 0
        for i_idx in range(x2.shape[1]):
            tmp = x1[0, i_idx] - x2[o_idx, i_idx]
            val += tmp * tmp
        res[o_idx] = val
    return res

@nb.njit(fastmath=True)
def euclidean_distance_square_numba_v4(x1, x2):
    res = np.empty(x2.shape[0], dtype=x2.dtype)
    for o_idx in range(x2.shape[0]):
        val = 0.
        for i_idx in range(x2.shape[1]):
            tmp = x1[0, i_idx] - x2[o_idx, i_idx]
            val += tmp * tmp
        res[o_idx] = val
    return res

@nb.njit(fastmath=True,parallel=True)
def euclidean_distance_square_numba_v5(x1, x2):
    res = np.empty(x2.shape[0], dtype=x2.dtype)
    for o_idx in nb.prange(x2.shape[0]):
        val = 0.
        for i_idx in range(x2.shape[1]):
            tmp = x1[0, i_idx] - x2[o_idx, i_idx]
            val += tmp * tmp
        res[o_idx] = val
    return res

Timings

计时

float64
x1 = np.random.random((1, 512))
x2 = np.random.random((1000000, 512))

0.42 v3 @MSeifert
0.25 v4
0.18 v5 parallel-version
0.48 distance.cdist

float32
x1 = np.random.random((1, 512)).astype(np.float32)
x2 = np.random.random((1000000, 512)).astype(np.float32)

0.09 v5

How to explicitly declare types

如何显式声明类型

In general I wouldn't recommend this. Your input arrays can be C-contigous (as the testdata) Fortran contigous or strided. If you know that your data is always C-contiguos you can write

一般来说,我不推荐这个。您的输入数组可以是C-contigous(作为testdata)Fortran连续或跨步。如果您知道您的数据始终是C-contiguos,那么您可以编写

@nb.njit('double[:](double[:, ::1],double[:, ::1])',fastmath=True)
def euclidean_distance_square_numba_v6(x1, x2):
    res = np.empty(x2.shape[0], dtype=x2.dtype)
    for o_idx in range(x2.shape[0]):
        val = 0.
        for i_idx in range(x2.shape[1]):
            tmp = x1[0, i_idx] - x2[o_idx, i_idx]
            val += tmp * tmp
        res[o_idx] = val
    return res

This offers the same performance than the v4 version, but will fail if the input arrays are not C-contigous or not of dtype=np.float64.

这提供了与v4版本相同的性能,但如果输入数组不是C-contigous或不是dtype = np.float64,则会失败。

You can also use

你也可以使用

@nb.njit('double[:](double[:, :],double[:, :])',fastmath=True)
def euclidean_distance_square_numba_v7(x1, x2):
    res = np.empty(x2.shape[0], dtype=x2.dtype)
    for o_idx in range(x2.shape[0]):
        val = 0.
        for i_idx in range(x2.shape[1]):
            tmp = x1[0, i_idx] - x2[o_idx, i_idx]
            val += tmp * tmp
        res[o_idx] = val
    return res

This will also work on strided arrays, but will be much slower than the version above on C-contigous arrays. (0.66s vs. 0.25s). Please note also that your problem is quite limited by memory bandwidth. The difference can be higher with CPU bound calculations.

这也适用于跨步阵列,但是比C-contigous阵列上的版本慢得多。 (0.66秒对0.25秒)。请注意,您的问题受内存带宽的限制。 CPU绑定计算的差异可能更大。

If you let do Numba the job for you, it will be automaticly detected if the array is contigous or not (providing contgous input data on the first try and than non contigous data, will lead to a recompilation)

如果你让Numba为你做的工作,它会被自动检测到阵列是否有连续(在第一次尝试时提供相关的输入数据而不是非连续数据,将导致重新编译)

#1


6  

It is quite weird that numba can be so much slower.

numba可以慢得多,这很奇怪。

It's not too weird. When you call NumPy functions inside a numba function you call the numba-version of these functions. These can be faster, slower or just as fast as the NumPy versions. You might be lucky or you can be unlucky (you were unlucky!). But even in the numba function you still create lots of temporaries because you use the NumPy functions (one temporary array for the dot result, one for each square and sum, one for the dot plus first sum) so you don't take advantage of the possibilities with numba.

这不太奇怪。当你在numba函数中调用NumPy函数时,你可以调用这些函数的numba版本。这些可以更快,更慢或与NumPy版本一样快。你可能很幸运,或者你可能不走运(你运气不好!)。但是即使在numba函数中,你仍然会创建很多临时函数,因为你使用了NumPy函数(一个临时数组用于点结果,一个用于每个正方形和总和,一个用于点加上第一个总和)所以你没有利用numba的可能性。

Am I using it wrong?

我用错了吗?

Essentially: Yes.

基本上:是的。

I really need to speed this up

我真的需要加快速度

Okay, I'll give it a try.

好的,我会试一试。

Let's start with unrolling the sum of squares along axis 1 calls:

让我们从展开沿轴1调用的平方和开始:

import numba as nb

@nb.njit
def sum_squares_2d_array_along_axis1(arr):
    res = np.empty(arr.shape[0], dtype=arr.dtype)
    for o_idx in range(arr.shape[0]):
        sum_ = 0
        for i_idx in range(arr.shape[1]):
            sum_ += arr[o_idx, i_idx] * arr[o_idx, i_idx]
        res[o_idx] = sum_
    return res


@nb.njit
def euclidean_distance_square_numba_v1(x1, x2):
    return -2 * np.dot(x1, x2.T) + np.expand_dims(sum_squares_2d_array_along_axis1(x1), axis=1) + sum_squares_2d_array_along_axis1(x2)

On my computer that's already 2 times faster than the NumPy code and almost 10 times faster than your original Numba code.

在我的计算机上,它已经比NumPy代码快2倍,比原来的Numba代码快10倍。

Speaking from experience getting it 2x faster than NumPy is generally the limit (at least if the NumPy version isn't needlessly complicated or inefficient), however you can squeeze out a bit more by unrolling everything:

从经验来看,它比NumPy快2倍通常是极限(至少如果NumPy版本不是不必要地复杂或低效),但是你可以通过展开所有内容来挤出更多:

import numba as nb

@nb.njit
def euclidean_distance_square_numba_v2(x1, x2):
    f1 = 0.
    for i_idx in range(x1.shape[1]):
        f1 += x1[0, i_idx] * x1[0, i_idx]

    res = np.empty(x2.shape[0], dtype=x2.dtype)
    for o_idx in range(x2.shape[0]):
        val = 0
        for i_idx in range(x2.shape[1]):
            val_from_x2 = x2[o_idx, i_idx]
            val += (-2) * x1[0, i_idx] * val_from_x2 + val_from_x2 * val_from_x2
        val += f1
        res[o_idx] = val
    return res

But that only gives a ~10-20% improvement over the latest approach.

但这比最新方法仅提高了约10-20%。

At that point you might realize that you can simplify the code (even though it probably won't speed it up):

那时你可能会意识到你可以简化代码(即使它可能不会加速它):

import numba as nb

@nb.njit
def euclidean_distance_square_numba_v3(x1, x2):
    res = np.empty(x2.shape[0], dtype=x2.dtype)
    for o_idx in range(x2.shape[0]):
        val = 0
        for i_idx in range(x2.shape[1]):
            tmp = x1[0, i_idx] - x2[o_idx, i_idx]
            val += tmp * tmp
        res[o_idx] = val
    return res

Yeah, that look pretty straight-forward and it's not really slower.

是的,那看起来非常简单,并不是真的慢。

However in all the excitement I forgot to mention the obvious solution: scipy.spatial.distance.cdist which has a sqeuclidean (squared euclidean distance) option:

然而,在所有的兴奋中,我忘了提到明显的解决方案:scipy.spatial.distance.cdist,它有一个sqeuclidean(欧氏距离平方)选项:

from scipy.spatial import distance
distance.cdist(x1, x2, metric='sqeuclidean')

It's not really faster than numba but it's available without having to write your own function...

它并不比numba快,但它可以在不必编写自己的功能的情况下使用...

Tests

Test for correctness and do the warmups:

测试正确性并进行预热:

x1 = np.array([[1.,2,3]])
x2 = np.array([[1.,2,3], [2,3,4], [3,4,5], [4,5,6], [5,6,7]])

res1 = euclidean_distance_square(x1, x2)
res2 = euclidean_distance_square_numba_original(x1, x2)
res3 = euclidean_distance_square_numba_v1(x1, x2)
res4 = euclidean_distance_square_numba_v2(x1, x2)
res5 = euclidean_distance_square_numba_v3(x1, x2)
np.testing.assert_array_equal(res1, res2)
np.testing.assert_array_equal(res1, res3)
np.testing.assert_array_equal(res1[0], res4)
np.testing.assert_array_equal(res1[0], res5)
np.testing.assert_almost_equal(res1, distance.cdist(x1, x2, metric='sqeuclidean'))

Timings:

时序:

x1 = np.random.random((1, 512))
x2 = np.random.random((1000000, 512))

%timeit euclidean_distance_square(x1, x2)
# 2.09 s ± 54.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit euclidean_distance_square_numba_original(x1, x2)
# 10.9 s ± 158 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit euclidean_distance_square_numba_v1(x1, x2)
# 907 ms ± 7.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit euclidean_distance_square_numba_v2(x1, x2)
# 715 ms ± 15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit euclidean_distance_square_numba_v3(x1, x2)
# 731 ms ± 34.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit distance.cdist(x1, x2, metric='sqeuclidean')
# 706 ms ± 4.99 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Note: If you have arrays of integers you might want to change the hard-coded 0.0 in the numba functions to 0.

注意:如果您有整数数组,则可能需要将numba函数中的硬编码0.0更改为0。

#2


4  

Despite the fact, that the answer of @MSeifert makes this answer quite obsolete, I'm still posting it, because it explains in more detail why the numba-version was slower than the numpy-version.

尽管事实上,@ MSeifert的答案使得这个答案相当陈旧,但我仍然发布它,因为它更详细地解释了为什么numba版本比numpy版本慢。

As we will see, the main culprit are the different memory access patterns for numpy and numba.

正如我们将要看到的,主要的罪魁祸首是numpy和numba的不同内存访问模式。

We can reproduce the behavior with much a simpler function:

我们可以通过更简单的函数重现行为:

import numpy as np
import numba as nb

def just_sum(x2):
    return np.sum(x2, axis=1)

@nb.jit('double[:](double[:, :])', nopython=True)
def nb_just_sum(x2):
    return np.sum(x2, axis=1)

x2=np.random.random((2048,2048))

And now the timings:

现在的时间安排:

>>> %timeit just_sum(x)
2.33 ms ± 71.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit nb_just_sum(x)
33.7 ms ± 296 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

that means numpy is about 15 times faster!

这意味着numpy快15倍!

When compiling the numba code with annotations (e.g. numba --annotate-html sum.html numba_sum.py) we can see, how the sum is performed by numba (see the whole listing of the summation in the appendix):

在使用注释编译numba代码时(例如numba --annotate -html sum.html numba_sum.py),我们可以看到numba如何执行求和(参见附录中求和的完整列表):

  1. initialize the result-column
  2. 初始化结果列
  3. add the whole first column to the result-column
  4. 将整个第一列添加到结果列
  5. add the whole second column to the result-column
  6. 将整个第二列添加到结果列
  7. and so on
  8. 等等

What is the problem of this approach? The memory layout! The array is stored in the row-major-order and thus reading it column-wise leads to much more cache-misses than reading it row-wise (which is what the numpy does). There is a great article which explains the possible cache effects.

这种方法有什么问题?内存布局!数组以行主顺序存储,因此按行读取它会导致比按行读取更多的缓存未命中(这就是numpy所做的)。有一篇很棒的文章解释了可能的缓存效果。

As we can see, the sum-implementation of numba is yet not very mature. However, from the above consideration the numba-implementation could be competitive for column-major-order (i.e. transposed matrix):

我们可以看到,numba的总和实现还不是很成熟。然而,从上面的考虑,numba实现可能对列主要顺序(即转置矩阵)具有竞争力:

>>> %timeit just_sum(x.T)
3.09 ms ± 66.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit nb_just_sum(x.T)
3.58 ms ± 45.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

and it really is.

它确实是。

As the code of @MSeifert has shown, the main advantage of numba is, that with its help we can reduce the number of temporary numpy-arrays. However, some things that looks easy are not easy at all and a naive solution can be pretty bad. Building a sum is such an operation - one should not think that a simple loop is good enough - see for example this question.

正如@MSeifert的代码所示,numba的主要优点是,在它的帮助下,我们可以减少临时numpy数组的数量。然而,一些看起来容易的事情根本不容易,而天真的解决方案可能非常糟糕。建立一个总和就是这样一个操作 - 人们不应该认为一个简单的循环足够好 - 比如看看这个问题。


Listing numba-summation:

列出numba-summation:

 Function name: array_sum_impl_axis
in file: /home/ed/anaconda3/lib/python3.6/site-packages/numba/targets/arraymath.py
with signature: (array(float64, 2d, A), int64) -> array(float64, 1d, C)
show numba IR
194:    def array_sum_impl_axis(arr, axis):
195:        ndim = arr.ndim
196:    
197:        if not is_axis_const:
198:            # Catch where axis is negative or greater than 3.
199:            if axis < 0 or axis > 3:
200:                raise ValueError("Numba does not support sum with axis"
201:                                 "parameter outside the range 0 to 3.")
202:    
203:        # Catch the case where the user misspecifies the axis to be
204:        # more than the number of the array's dimensions.
205:        if axis >= ndim:
206:            raise ValueError("axis is out of bounds for array")
207:    
208:        # Convert the shape of the input array to a list.
209:        ashape = list(arr.shape)
210:        # Get the length of the axis dimension.
211:        axis_len = ashape[axis]
212:        # Remove the axis dimension from the list of dimensional lengths.
213:        ashape.pop(axis)
214:        # Convert this shape list back to a tuple using above intrinsic.
215:        ashape_without_axis = _create_tuple_result_shape(ashape, arr.shape)
216:        # Tuple needed here to create output array with correct size.
217:        result = np.full(ashape_without_axis, zero, type(zero))
218:    
219:        # Iterate through the axis dimension.
220:        for axis_index in range(axis_len):
221:            if is_axis_const:
222:                # constant specialized version works for any valid axis value
223:                index_tuple_generic = _gen_index_tuple(arr.shape, axis_index,
224:                                                       const_axis_val)
225:                result += arr[index_tuple_generic]
226:            else:
227:                # Generate a tuple used to index the input array.
228:                # The tuple is ":" in all dimensions except the axis
229:                # dimension where it is "axis_index".
230:                if axis == 0:
231:                    index_tuple1 = _gen_index_tuple(arr.shape, axis_index, 0)
232:                    result += arr[index_tuple1]
233:                elif axis == 1:
234:                    index_tuple2 = _gen_index_tuple(arr.shape, axis_index, 1)
235:                    result += arr[index_tuple2]
236:                elif axis == 2:
237:                    index_tuple3 = _gen_index_tuple(arr.shape, axis_index, 2)
238:                    result += arr[index_tuple3]
239:                elif axis == 3:
240:                    index_tuple4 = _gen_index_tuple(arr.shape, axis_index, 3)
241:                    result += arr[index_tuple4]
242:    
243:        return result 

#3


2  

This is a comment to @MSeifert answer. There are some more things to gain performance. As in every numerical code it is recommendable to think about which datatype is precise enough for your problem. Often float32 is also enough, sometimes even float64 isn't enough.

这是对@MSeifert回答的评论。还有一些事情可以获得表现。与每个数字代码一样,建议考虑哪种数据类型对于您的问题足够精确。通常float32也足够了,有时甚至float64还不够。

I want also to mention the fastmath keyword here, which can give another 1.7x speed up here.

我还想在这里提到fastmath关键字,这里可以再提高1.7倍速度。

[Edit]

[编辑]

For a simple summation I looked into the LLVM-code and found that the sumation was splitted up in partial sums on vectorization. (4 partial sums for double and 8 for float using AVX2). This has to be investigated further.

为了简单总结,我查看了LLVM代码,发现sumation在矢量化的部分和中被分割。 (4个部分和为double,8为浮动,使用AVX2)。这必须进一步调查。

Code import llvmlite.binding as llvm llvm.set_option('', '--debug-only=loop-vectorize')

代码导入llvmlite.binding为llvm llvm.set_option('',' - debug-only = loop-vectorize')

@nb.njit
def euclidean_distance_square_numba_v3(x1, x2):
    res = np.empty(x2.shape[0], dtype=x2.dtype)
    for o_idx in range(x2.shape[0]):
        val = 0
        for i_idx in range(x2.shape[1]):
            tmp = x1[0, i_idx] - x2[o_idx, i_idx]
            val += tmp * tmp
        res[o_idx] = val
    return res

@nb.njit(fastmath=True)
def euclidean_distance_square_numba_v4(x1, x2):
    res = np.empty(x2.shape[0], dtype=x2.dtype)
    for o_idx in range(x2.shape[0]):
        val = 0.
        for i_idx in range(x2.shape[1]):
            tmp = x1[0, i_idx] - x2[o_idx, i_idx]
            val += tmp * tmp
        res[o_idx] = val
    return res

@nb.njit(fastmath=True,parallel=True)
def euclidean_distance_square_numba_v5(x1, x2):
    res = np.empty(x2.shape[0], dtype=x2.dtype)
    for o_idx in nb.prange(x2.shape[0]):
        val = 0.
        for i_idx in range(x2.shape[1]):
            tmp = x1[0, i_idx] - x2[o_idx, i_idx]
            val += tmp * tmp
        res[o_idx] = val
    return res

Timings

计时

float64
x1 = np.random.random((1, 512))
x2 = np.random.random((1000000, 512))

0.42 v3 @MSeifert
0.25 v4
0.18 v5 parallel-version
0.48 distance.cdist

float32
x1 = np.random.random((1, 512)).astype(np.float32)
x2 = np.random.random((1000000, 512)).astype(np.float32)

0.09 v5

How to explicitly declare types

如何显式声明类型

In general I wouldn't recommend this. Your input arrays can be C-contigous (as the testdata) Fortran contigous or strided. If you know that your data is always C-contiguos you can write

一般来说,我不推荐这个。您的输入数组可以是C-contigous(作为testdata)Fortran连续或跨步。如果您知道您的数据始终是C-contiguos,那么您可以编写

@nb.njit('double[:](double[:, ::1],double[:, ::1])',fastmath=True)
def euclidean_distance_square_numba_v6(x1, x2):
    res = np.empty(x2.shape[0], dtype=x2.dtype)
    for o_idx in range(x2.shape[0]):
        val = 0.
        for i_idx in range(x2.shape[1]):
            tmp = x1[0, i_idx] - x2[o_idx, i_idx]
            val += tmp * tmp
        res[o_idx] = val
    return res

This offers the same performance than the v4 version, but will fail if the input arrays are not C-contigous or not of dtype=np.float64.

这提供了与v4版本相同的性能,但如果输入数组不是C-contigous或不是dtype = np.float64,则会失败。

You can also use

你也可以使用

@nb.njit('double[:](double[:, :],double[:, :])',fastmath=True)
def euclidean_distance_square_numba_v7(x1, x2):
    res = np.empty(x2.shape[0], dtype=x2.dtype)
    for o_idx in range(x2.shape[0]):
        val = 0.
        for i_idx in range(x2.shape[1]):
            tmp = x1[0, i_idx] - x2[o_idx, i_idx]
            val += tmp * tmp
        res[o_idx] = val
    return res

This will also work on strided arrays, but will be much slower than the version above on C-contigous arrays. (0.66s vs. 0.25s). Please note also that your problem is quite limited by memory bandwidth. The difference can be higher with CPU bound calculations.

这也适用于跨步阵列,但是比C-contigous阵列上的版本慢得多。 (0.66秒对0.25秒)。请注意,您的问题受内存带宽的限制。 CPU绑定计算的差异可能更大。

If you let do Numba the job for you, it will be automaticly detected if the array is contigous or not (providing contgous input data on the first try and than non contigous data, will lead to a recompilation)

如果你让Numba为你做的工作,它会被自动检测到阵列是否有连续(在第一次尝试时提供相关的输入数据而不是非连续数据,将导致重新编译)