为什么numpy。对大数组有这么慢吗?

时间:2021-05-21 21:19:09

I'm looking for the most efficient way to determine whether a large array contains at least one nonzero value. At first glance np.any seems like the obvious tool for the job, but it seems unexpectedly slow over large arrays.

我正在寻找最有效的方法来确定一个大数组是否包含至少一个非零值。乍一看np。任何一种似乎都是显而易见的工具,但对于大型数组来说,它的速度却出奇地慢。

Consider this extreme case:

考虑一下这个极端的例子:

first = np.zeros(1E3,dtype=np.bool)
last = np.zeros(1E3,dtype=np.bool)

first[0] = True
last[-1] = True

# test 1
%timeit np.any(first)
>>> 100000 loops, best of 3: 6.36 us per loop

# test 2
%timeit np.any(last)
>>> 100000 loops, best of 3: 6.95 us per loop

At least np.any seems to be doing something vaguely sensible here - if the nonzero value is the first one in the array, there should be no need to consider any others before returning True, so I would expect test 1 to be slightly quicker than test 2.

至少np。any似乎在这里做了一些模糊的事情—如果非零值是数组中的第一个值,那么在返回True之前不需要考虑任何其他值,因此我预计test 1比test 2稍微快一些。

However, what happens when we make the arrays much larger?

然而,当我们让数组变得更大时,会发生什么呢?

first = np.zeros(1E9,dtype=np.bool)
last = np.zeros(1E9,dtype=np.bool)

first[0] = True
last[-1] = True

# test 3
%timeit np.any(first)
>>> 10 loops, best of 3: 21.6 ms per loop

# test 4
%timeit np.any(last)
>>> 1 loops, best of 3: 739 ms per loop

As expected, test 4 is quite a lot slower than test 3. However, in test 3 np.any should still only have to check the value of a single element in first in order to know that it contains at least one nonzero value. Why, then, is test 3 so much slower than test 1?

正如预期的那样,测试4比测试3要慢得多。然而,在测试3 np中。任何元素都应该只需要首先检查单个元素的值,以便知道它至少包含一个非零值。那么,为什么测试3比测试1慢这么多呢?

Edit 1:

I'm using a development version of Numpy (1.8.0.dev-e11cd9b), but I get the exact same timing results using Numpy 1.7.1. I'm running 64 bit Linux, Python 2.7.4. My system is basically idling (I'm running an IPython session, a browser and a text editor), and I'm definitely not hitting the swap. I also replicated the result on another machine running Numpy 1.7.1.

我正在使用Numpy (1.8.0.dev-e11cd9b)的开发版本,但是我使用Numpy 1.7.1得到了完全相同的计时结果。我正在运行64位Linux, Python 2.7.4。我的系统基本上是空闲的(我正在运行一个IPython会话、一个浏览器和一个文本编辑器),而且我绝对不会进行交换。我还在运行Numpy 1.7.1的另一台机器上复制了结果。

Edit 2:

Using Numpy 1.6.2 I get times of ~1.85us for both tests 1 and 3, so as jorgeca says there seems to have been some performance regression between Numpy 1.6.2 and 1.7.1 1.7.0 in this regard.

使用Numpy 1.6.2,我得到的测试1和3的次数都是~1.85us,因此jorgeca说,在这方面,Numpy 1.6.2和1.7.1 1.7.0之间似乎有一些性能回归。

Edit 3:

Following J.F. Sebastian and jorgeca's lead I've done some more benchmarking using np.all on an array of zeros, which ought to be equivalent to calling np.any on an array where the first element is one.

在J.F. Sebastian和jorgeca的带领下,我用np做了更多的基准测试。都在一个0的数组上,它应该等于调用np。数组中第一个元素为1的元素。

Test script:

测试脚本:

import timeit
import numpy as np
print 'Numpy v%s' %np.version.full_version
stmt = "np.all(x)"
for ii in xrange(10):
    setup = "import numpy as np; x = np.zeros(%d,dtype=np.bool)" %(10**ii)
    timer = timeit.Timer(stmt,setup)
    n,r = 1,3
    t = np.min(timer.repeat(r,n))
    while t < 0.2:
        n *= 10
        t = np.min(timer.repeat(r,n))
    t /= n
    if t < 1E-3:
        timestr = "%1.3f us" %(t*1E6)
    elif t < 1:
        timestr = "%1.3f ms" %(t*1E3)
    else:
        timestr = "%1.3f s" %t
    print "Array size: 1E%i, %i loops, best of %i: %s/loop" %(ii,n,r,timestr)

Results:

结果:

Numpy v1.6.2
Array size: 1E0, 1000000 loops, best of 3: 1.738 us/loop
Array size: 1E1, 1000000 loops, best of 3: 1.845 us/loop
Array size: 1E2, 1000000 loops, best of 3: 1.862 us/loop
Array size: 1E3, 1000000 loops, best of 3: 1.858 us/loop
Array size: 1E4, 1000000 loops, best of 3: 1.864 us/loop
Array size: 1E5, 1000000 loops, best of 3: 1.882 us/loop
Array size: 1E6, 1000000 loops, best of 3: 1.866 us/loop
Array size: 1E7, 1000000 loops, best of 3: 1.853 us/loop
Array size: 1E8, 1000000 loops, best of 3: 1.860 us/loop
Array size: 1E9, 1000000 loops, best of 3: 1.854 us/loop

Numpy v1.7.0
Array size: 1E0, 100000 loops, best of 3: 5.881 us/loop
Array size: 1E1, 100000 loops, best of 3: 5.831 us/loop
Array size: 1E2, 100000 loops, best of 3: 5.924 us/loop
Array size: 1E3, 100000 loops, best of 3: 5.864 us/loop
Array size: 1E4, 100000 loops, best of 3: 5.997 us/loop
Array size: 1E5, 100000 loops, best of 3: 6.979 us/loop
Array size: 1E6, 100000 loops, best of 3: 17.196 us/loop
Array size: 1E7, 10000 loops, best of 3: 116.162 us/loop
Array size: 1E8, 1000 loops, best of 3: 1.112 ms/loop
Array size: 1E9, 100 loops, best of 3: 11.061 ms/loop

Numpy v1.7.1
Array size: 1E0, 100000 loops, best of 3: 6.216 us/loop
Array size: 1E1, 100000 loops, best of 3: 6.257 us/loop
Array size: 1E2, 100000 loops, best of 3: 6.318 us/loop
Array size: 1E3, 100000 loops, best of 3: 6.247 us/loop
Array size: 1E4, 100000 loops, best of 3: 6.492 us/loop
Array size: 1E5, 100000 loops, best of 3: 7.406 us/loop
Array size: 1E6, 100000 loops, best of 3: 17.426 us/loop
Array size: 1E7, 10000 loops, best of 3: 115.946 us/loop
Array size: 1E8, 1000 loops, best of 3: 1.102 ms/loop
Array size: 1E9, 100 loops, best of 3: 10.987 ms/loop

Numpy v1.8.0.dev-e11cd9b
Array size: 1E0, 100000 loops, best of 3: 6.357 us/loop
Array size: 1E1, 100000 loops, best of 3: 6.399 us/loop
Array size: 1E2, 100000 loops, best of 3: 6.425 us/loop
Array size: 1E3, 100000 loops, best of 3: 6.397 us/loop
Array size: 1E4, 100000 loops, best of 3: 6.596 us/loop
Array size: 1E5, 100000 loops, best of 3: 7.569 us/loop
Array size: 1E6, 100000 loops, best of 3: 17.445 us/loop
Array size: 1E7, 10000 loops, best of 3: 115.109 us/loop
Array size: 1E8, 1000 loops, best of 3: 1.094 ms/loop
Array size: 1E9, 100 loops, best of 3: 10.840 ms/loop

Edit 4:

Following seberg's comment I tried the same test with an np.float32 array instead of np.bool. In this case, Numpy 1.6.2 also shows a slowdown as array sizes increase:

在seberg的评论之后,我用一个np尝试了同样的测试。float32阵列而不是np.bool。在本例中,Numpy 1.6.2也显示了随着数组大小的增加而减小:

Numpy v1.6.2
Array size: 1E0, 100000 loops, best of 3: 3.503 us/loop
Array size: 1E1, 100000 loops, best of 3: 3.597 us/loop
Array size: 1E2, 100000 loops, best of 3: 3.742 us/loop
Array size: 1E3, 100000 loops, best of 3: 4.745 us/loop
Array size: 1E4, 100000 loops, best of 3: 14.533 us/loop
Array size: 1E5, 10000 loops, best of 3: 112.463 us/loop
Array size: 1E6, 1000 loops, best of 3: 1.101 ms/loop
Array size: 1E7, 100 loops, best of 3: 11.724 ms/loop
Array size: 1E8, 10 loops, best of 3: 116.924 ms/loop
Array size: 1E9, 1 loops, best of 3: 1.168 s/loop

Numpy v1.7.1
Array size: 1E0, 100000 loops, best of 3: 6.548 us/loop
Array size: 1E1, 100000 loops, best of 3: 6.546 us/loop
Array size: 1E2, 100000 loops, best of 3: 6.804 us/loop
Array size: 1E3, 100000 loops, best of 3: 7.784 us/loop
Array size: 1E4, 100000 loops, best of 3: 17.946 us/loop
Array size: 1E5, 10000 loops, best of 3: 117.235 us/loop
Array size: 1E6, 1000 loops, best of 3: 1.096 ms/loop
Array size: 1E7, 100 loops, best of 3: 12.328 ms/loop
Array size: 1E8, 10 loops, best of 3: 118.431 ms/loop
Array size: 1E9, 1 loops, best of 3: 1.172 s/loop

Why should this happen? As with the boolean case, np.all should still only have to check the first element before returning, so times should still be constant w.r.t. array size.

为什么要这样呢?和布尔情况一样,np。在返回之前,所有元素都应该只检查第一个元素,因此时间仍然应该是恒定的w.r.t.数组大小。

1 个解决方案

#1


26  

As has been guessed in the comments, I can confirm that the processing of the array is being done in chunks. First, I will show you where things are in the code and then I will show you how you can change the chunk size and the effect that doing so has on your benchmark.

正如在评论中所猜测的那样,我可以确认数组的处理是在块中完成的。首先,我将向您展示代码中的内容,然后我将向您展示如何更改数据块大小以及在基准测试中这样做的效果。

Where to find the reduction processing in the Numpy source files

np.all(x) is the same as x.all(). all() really calls np.core.umath.logical_and.reduce(x).

np.all(x)与x.all()相同。所有()调用np.core.umath.logical_and.reduce(x)。

If you want to dig into the numpy source, I will try to guide you through finding that a buffer/chunk size is used. The folder with all of the code we will be looking at is numpy/core/src/umath/.

如果您想深入了解numpy源代码,我将尝试指导您查找使用了缓冲区/块大小。包含所有代码的文件夹是numpy/core/src/umath/。

PyUFunc_Reduce() in ufunc_object.c is the C function that handles the reduce. In PyUFunc_Reduce(), the chunk, or buffer, size is found by looking up the value for reduce in some global dictionary via the PyUFunc_GetPyValues() function (ufunc_object.c). On my machine and compiling from the development branch, the chunk size is 8192. PyUFunc_ReduceWrapper() in reduction.c is called to set-up the iterator (with a stride equal to the chunk size) and it calls the passed in loop function which is reduce_loop() in ufunc_object.c.

在ufunc_object PyUFunc_Reduce()。c是处理reduce的c函数。在PyUFunc_Reduce()中,块(或缓冲区)的大小是通过通过PyUFunc_GetPyValues()函数(ufunc_object.c)在某个全局字典中查找reduce的值来找到的。在我的机器上并从开发分支编译,块大小是8192。减少PyUFunc_ReduceWrapper()。调用c来设置迭代器(与块大小相等),并调用在ufunc_object.c中为reduce_loop()的循环函数。

reduce_loop() basically just uses the iterator and calls another innerloop() function for each chunk. The innerloop function is found in loops.c.src. For a boolean array and our case of all/logical_and, the appropriate innerloop function is BOOL_logical_and. You can find the right function by searching for BOOLEAN LOOPS and then it is the second function below that (it is hard to find due to the template-like programming used here). There you will find that short circuiting is in fact being done for each chunk.

reduce_loop()基本上只使用迭代器,并为每个块调用另一个innerloop()函数。内环函数是在loops.c.src中找到的。对于一个布尔数组和all/logical_and,适当的innerloop函数是BOOL_logical_and。通过搜索布尔循环,您可以找到正确的函数,然后它是下面的第二个函数(由于这里使用的模板式编程,很难找到它)。在那里,你会发现短路实际上是为每一段做的。

How to change the buffer size used in ufunctions (and thus in any/all)

You can get the chunk/buffer size with np.getbuffersize(). For me, that returns 8192 without manually setting it which matches what I found by printing out the buffer size in the code. You can use np.setbuffersize() to change the chunk size.

可以使用np.getbuffersize()获取块/缓冲区大小。对于我来说,它返回8192,而无需手动设置它,它与我在代码中打印的缓冲区大小匹配。您可以使用np.setbuffersize()来更改块大小。

Results using a bigger buffer size

I changed your benchmark code to the following:

我将您的基准代码更改为以下内容:

import timeit
import numpy as np
print 'Numpy v%s' %np.version.full_version
stmt = "np.all(x)"
for ii in xrange(9):
    setup = "import numpy as np; x = np.zeros(%d,dtype=np.bool); np.setbufsize(%d)" %(10**ii, max(8192, min(10**ii, 10**7)))
    timer = timeit.Timer(stmt,setup)
    n,r = 1,3
    t = np.min(timer.repeat(r,n))
    while t < 0.2:
        n *= 10
        t = np.min(timer.repeat(r,n))
    t /= n
    if t < 1E-3:
        timestr = "%1.3f us" %(t*1E6)
    elif t < 1:
        timestr = "%1.3f ms" %(t*1E3)
    else:
        timestr = "%1.3f s" %t
    print "Array size: 1E%i, %i loops, best of %i: %s/loop" %(ii,n,r,timestr)

Numpy doesn't like the buffer size being too small or too big so I made sure that it didn't get smaller than 8192 or larger than 1E7 because Numpy didn't like a buffer size of 1E8. Otherwise, I was setting the buffer size to the size of the array being processed. I only went up to 1E8 because my machine only has 4GB of memory at the moment. Here are the results:

Numpy不喜欢缓冲区大小太小或太大,所以我确保它不会小于8192或大于1E7,因为Numpy不喜欢缓冲区大小为1E8。否则,我将缓冲区大小设置为正在处理的数组的大小。我只升到1E8,因为我的机器现在只有4GB的内存。这里是结果:

Numpy v1.8.0.dev-2a5c2c8
Array size: 1E0, 100000 loops, best of 3: 5.351 us/loop
Array size: 1E1, 100000 loops, best of 3: 5.390 us/loop
Array size: 1E2, 100000 loops, best of 3: 5.366 us/loop
Array size: 1E3, 100000 loops, best of 3: 5.360 us/loop
Array size: 1E4, 100000 loops, best of 3: 5.433 us/loop
Array size: 1E5, 100000 loops, best of 3: 5.400 us/loop
Array size: 1E6, 100000 loops, best of 3: 5.397 us/loop
Array size: 1E7, 100000 loops, best of 3: 5.381 us/loop
Array size: 1E8, 100000 loops, best of 3: 6.126 us/loop

There is a small uptick in the last timing because there are multiple chunks being processed due to the limitations on how big the buffer size can be.

由于受缓冲区大小的限制,有多个块正在处理,所以最后一个时间点有一个小的提升。

#1


26  

As has been guessed in the comments, I can confirm that the processing of the array is being done in chunks. First, I will show you where things are in the code and then I will show you how you can change the chunk size and the effect that doing so has on your benchmark.

正如在评论中所猜测的那样,我可以确认数组的处理是在块中完成的。首先,我将向您展示代码中的内容,然后我将向您展示如何更改数据块大小以及在基准测试中这样做的效果。

Where to find the reduction processing in the Numpy source files

np.all(x) is the same as x.all(). all() really calls np.core.umath.logical_and.reduce(x).

np.all(x)与x.all()相同。所有()调用np.core.umath.logical_and.reduce(x)。

If you want to dig into the numpy source, I will try to guide you through finding that a buffer/chunk size is used. The folder with all of the code we will be looking at is numpy/core/src/umath/.

如果您想深入了解numpy源代码,我将尝试指导您查找使用了缓冲区/块大小。包含所有代码的文件夹是numpy/core/src/umath/。

PyUFunc_Reduce() in ufunc_object.c is the C function that handles the reduce. In PyUFunc_Reduce(), the chunk, or buffer, size is found by looking up the value for reduce in some global dictionary via the PyUFunc_GetPyValues() function (ufunc_object.c). On my machine and compiling from the development branch, the chunk size is 8192. PyUFunc_ReduceWrapper() in reduction.c is called to set-up the iterator (with a stride equal to the chunk size) and it calls the passed in loop function which is reduce_loop() in ufunc_object.c.

在ufunc_object PyUFunc_Reduce()。c是处理reduce的c函数。在PyUFunc_Reduce()中,块(或缓冲区)的大小是通过通过PyUFunc_GetPyValues()函数(ufunc_object.c)在某个全局字典中查找reduce的值来找到的。在我的机器上并从开发分支编译,块大小是8192。减少PyUFunc_ReduceWrapper()。调用c来设置迭代器(与块大小相等),并调用在ufunc_object.c中为reduce_loop()的循环函数。

reduce_loop() basically just uses the iterator and calls another innerloop() function for each chunk. The innerloop function is found in loops.c.src. For a boolean array and our case of all/logical_and, the appropriate innerloop function is BOOL_logical_and. You can find the right function by searching for BOOLEAN LOOPS and then it is the second function below that (it is hard to find due to the template-like programming used here). There you will find that short circuiting is in fact being done for each chunk.

reduce_loop()基本上只使用迭代器,并为每个块调用另一个innerloop()函数。内环函数是在loops.c.src中找到的。对于一个布尔数组和all/logical_and,适当的innerloop函数是BOOL_logical_and。通过搜索布尔循环,您可以找到正确的函数,然后它是下面的第二个函数(由于这里使用的模板式编程,很难找到它)。在那里,你会发现短路实际上是为每一段做的。

How to change the buffer size used in ufunctions (and thus in any/all)

You can get the chunk/buffer size with np.getbuffersize(). For me, that returns 8192 without manually setting it which matches what I found by printing out the buffer size in the code. You can use np.setbuffersize() to change the chunk size.

可以使用np.getbuffersize()获取块/缓冲区大小。对于我来说,它返回8192,而无需手动设置它,它与我在代码中打印的缓冲区大小匹配。您可以使用np.setbuffersize()来更改块大小。

Results using a bigger buffer size

I changed your benchmark code to the following:

我将您的基准代码更改为以下内容:

import timeit
import numpy as np
print 'Numpy v%s' %np.version.full_version
stmt = "np.all(x)"
for ii in xrange(9):
    setup = "import numpy as np; x = np.zeros(%d,dtype=np.bool); np.setbufsize(%d)" %(10**ii, max(8192, min(10**ii, 10**7)))
    timer = timeit.Timer(stmt,setup)
    n,r = 1,3
    t = np.min(timer.repeat(r,n))
    while t < 0.2:
        n *= 10
        t = np.min(timer.repeat(r,n))
    t /= n
    if t < 1E-3:
        timestr = "%1.3f us" %(t*1E6)
    elif t < 1:
        timestr = "%1.3f ms" %(t*1E3)
    else:
        timestr = "%1.3f s" %t
    print "Array size: 1E%i, %i loops, best of %i: %s/loop" %(ii,n,r,timestr)

Numpy doesn't like the buffer size being too small or too big so I made sure that it didn't get smaller than 8192 or larger than 1E7 because Numpy didn't like a buffer size of 1E8. Otherwise, I was setting the buffer size to the size of the array being processed. I only went up to 1E8 because my machine only has 4GB of memory at the moment. Here are the results:

Numpy不喜欢缓冲区大小太小或太大,所以我确保它不会小于8192或大于1E7,因为Numpy不喜欢缓冲区大小为1E8。否则,我将缓冲区大小设置为正在处理的数组的大小。我只升到1E8,因为我的机器现在只有4GB的内存。这里是结果:

Numpy v1.8.0.dev-2a5c2c8
Array size: 1E0, 100000 loops, best of 3: 5.351 us/loop
Array size: 1E1, 100000 loops, best of 3: 5.390 us/loop
Array size: 1E2, 100000 loops, best of 3: 5.366 us/loop
Array size: 1E3, 100000 loops, best of 3: 5.360 us/loop
Array size: 1E4, 100000 loops, best of 3: 5.433 us/loop
Array size: 1E5, 100000 loops, best of 3: 5.400 us/loop
Array size: 1E6, 100000 loops, best of 3: 5.397 us/loop
Array size: 1E7, 100000 loops, best of 3: 5.381 us/loop
Array size: 1E8, 100000 loops, best of 3: 6.126 us/loop

There is a small uptick in the last timing because there are multiple chunks being processed due to the limitations on how big the buffer size can be.

由于受缓冲区大小的限制,有多个块正在处理,所以最后一个时间点有一个小的提升。