使用大的Numpy数组的技术?

时间:2022-12-06 21:28:57

There are times when you have to perform many intermediate operations on one, or more, large Numpy arrays. This can quickly result in MemoryErrors. In my research so far, U have found that Pickling (Pickle, CPickle, Pytables, etc.) and gc.collect() are ways to mitigate this. I was wondering if there are any other techniques experienced programmers use when dealing with large quantities of data (other than removing redundancies in your strategy/code, of course).

有时,您必须对一个或多个大的Numpy数组执行许多中间操作。这会很快导致内存错误。在我目前的研究中,U发现腌制(Pickle、CPickle、Pytables等)和gc.collect()是缓解这一问题的方法。我想知道在处理大量数据时,是否有经验丰富的程序员使用的其他技术(当然,除了删除策略/代码中的冗余之外)。

Also, if there's one thing I'm sure of is that nothing is free. With some of these techniques, what are the trade-offs (i.e., speed, robustness, etc.)?

而且,如果有一件事是我确信的,那就是没有什么是免费的。有了这些技术,什么是折衷(即:、速度、鲁棒性等)?

4 个解决方案

#1


19  

I feel your pain... You sometimes end up storing several times the size of your array in values you will later discard. When processing one item in your array at a time, this is irrelevant, but can kill you when vectorizing.

我觉得你的痛苦…有时,您会将数组大小的几倍存储在稍后丢弃的值中。当一次处理数组中的一个项时,这是不相关的,但是在向量化时可能会杀死您。

I'll use an example from work for illustration purposes. I recently coded the algorithm described here using numpy. It is a color map algorithm, which takes an RGB image, and converts it into a CMYK image. The process, which is repeated for every pixel, is as follows:

我将使用工作中的一个例子来说明。我最近用numpy编写了这里描述的算法。它是一种颜色映射算法,它采用RGB图像,并将其转换为CMYK图像。对每个像素重复的过程如下:

  1. Use the most significant 4 bits of every RGB value, as indices into a three-dimensional look up table. This determines the CMYK values for the 8 vertices of a cube within the LUT.
  2. 使用每个RGB值中最重要的4位,作为索引到三维查找表中。这决定了LUT中一个立方体的8个顶点的CMYK值。
  3. Use the least significant 4 bits of every RGB value to interpolate within that cube, based on the vertex values from the previous step. The most efficient way of doing this requires computing 16 arrays of uint8s the size of the image being processed. For a 24bit RGB image that is equivalent to needing storage of x6 times that of the image to process it.
  4. 根据前面步骤中的顶点值,使用每个RGB值中最不重要的4位进行插入。最有效的方法是计算处理图像大小的16个uint8s数组。对于一个24位的RGB图像,相当于需要存储x6倍的图像来处理它。

A couple of things you can do to handle this:

要解决这个问题,你可以做以下几件事:

1. Divide and conquer

Maybe you cannot process a 1,000x1,000 array in a single pass. But if you can do it with a python for loop iterating over 10 arrays of 100x1,000, it is still going to beat by a very far margin a python iterator over 1,000,000 items! It´s going to be slower, yes, but not as much.

也许您不能在一次遍历中处理一个1000x1000的数组。但是,如果您可以使用python for循环迭代10个100x1000的数组,那么它仍然会比python迭代器的范围大得多,一个python迭代器会遍历1,000,000个条目!它´s慢,是的,但不是很多。

2. Cache expensive computations

This relates directly to my interpolation example above, and is harder to come across, although worth keeping an eye open for it. Because I am interpolating on a three-dimensional cube with 4 bits in each dimension, there are only 16x16x16 possible outcomes, which can be stored in 16 arrays of 16x16x16 bytes. So I can precompute them and store them using 64KB of memory, and look-up the values one by one for the whole image, rather than redoing the same operations for every pixel at huge memory cost. This already pays-off for images as small as 64x64 pixels, and basically allows processing images with x6 times the amount of pixels without having to subdivide the array.

这与我上面的插值示例直接相关,而且更难理解,尽管值得关注。因为我在一个每个维度都有4位的三维立方体上进行插值,所以只有16x16x16的可能结果,这些结果可以存储在16个16x16x16字节的数组中。因此,我可以使用64KB的内存对它们进行预计算和存储,并逐个查找整个图像的值,而不是以巨大的内存成本对每个像素重新执行相同的操作。这已经为小到64x64像素的图像提供了回报,并且基本上允许使用x6倍的像素来处理图像,而无需对数组进行细分。

3. Use your dtypes wisely

If your intermediate values can fit in a single uint8, don't use an array of int32s! This can turn into a nightmare of mysterious errors due to silent overflows, but if you are careful, it can provide a big saving of resources.

如果您的中间值可以容纳一个uint8,不要使用一个int32s数组!这可能会变成由于静默溢出而导致的神秘错误的噩梦,但是如果您小心,它可以提供大量的资源节约。

#2


9  

First most important trick: allocate a few big arrays, and use and recycle portions of them, instead of bringing into life and discarding/garbage collecting lots of temporary arrays. Sounds a little bit old-fashioned, but with careful programming speed-up can be impressive. (You have better control of alignment and data locality, so numeric code can be made more efficient.)

第一个最重要的技巧是:分配几个大数组,并使用和回收其中的一部分,而不是将大量的临时数组丢弃或垃圾回收。听起来有点过时了,但是通过精心的编程,加速会让人印象深刻。(您可以更好地控制对齐和数据位置,从而使数字代码更有效。)

Second: use numpy.memmap and hope that OS caching of accesses to the disk are efficient enough.

第二:使用numpy。memmap并希望OS缓存对磁盘的访问足够有效。

Third: as pointed out by @Jaime, work un block sub-matrices, if the whole matrix is to big.

第三:正如@Jaime所指出的,如果整个矩阵是大的,则使用un块子矩阵。

EDIT:

编辑:

Avoid unecessary list comprehension, as pointed out in this answer in SE.

避免不必要的列表理解,就像本文中指出的那样。

#3


3  

The dask.array library provides a numpy interface that uses blocked algorithms to handle larger-than-memory arrays with multiple cores.

dask。数组库提供了一个numpy接口,该接口使用被阻塞的算法来处理具有多个核的具有较大内存的数组。

You could also look into Spartan, Distarray, and Biggus.

你也可以看看斯巴达人,迪亚斯塔雷和比格斯。

#4


3  

If it is possible for you, use numexpr. For numeric calculations like a**2 + b**2 + 2*a*b (for a and b being arrays) it

如果可能的话,使用numexpr。对于数值计算,比如a* 2 + b* 2 + 2*a*b (a和b是数组)

  1. will compile machine code that will execute fast and with minimal memory overhead, taking care of memory locality stuff (and thus cache optimization) if the same array occurs several times in your expression,

    将编译机器代码,该代码将以最小的内存开销快速执行,如果在表达式中多次出现相同的数组,则将处理内存局部性内容(并因此进行缓存优化),

  2. uses all cores of your dual or quad core CPU,

    使用双核或四核CPU的所有核心,

  3. is an extension to numpy, not an alternative.

    是numpy的扩展,而不是替代。

For medium and large sized arrays, it is faster that numpy alone.

对于中型和大型数组,单是numpy就更快了。

Take a look at the web page given above, there are examples that will help you understand if numexpr is for you.

看看上面给出的web页面,有一些例子可以帮助您理解numexpr是否适合您。

#1


19  

I feel your pain... You sometimes end up storing several times the size of your array in values you will later discard. When processing one item in your array at a time, this is irrelevant, but can kill you when vectorizing.

我觉得你的痛苦…有时,您会将数组大小的几倍存储在稍后丢弃的值中。当一次处理数组中的一个项时,这是不相关的,但是在向量化时可能会杀死您。

I'll use an example from work for illustration purposes. I recently coded the algorithm described here using numpy. It is a color map algorithm, which takes an RGB image, and converts it into a CMYK image. The process, which is repeated for every pixel, is as follows:

我将使用工作中的一个例子来说明。我最近用numpy编写了这里描述的算法。它是一种颜色映射算法,它采用RGB图像,并将其转换为CMYK图像。对每个像素重复的过程如下:

  1. Use the most significant 4 bits of every RGB value, as indices into a three-dimensional look up table. This determines the CMYK values for the 8 vertices of a cube within the LUT.
  2. 使用每个RGB值中最重要的4位,作为索引到三维查找表中。这决定了LUT中一个立方体的8个顶点的CMYK值。
  3. Use the least significant 4 bits of every RGB value to interpolate within that cube, based on the vertex values from the previous step. The most efficient way of doing this requires computing 16 arrays of uint8s the size of the image being processed. For a 24bit RGB image that is equivalent to needing storage of x6 times that of the image to process it.
  4. 根据前面步骤中的顶点值,使用每个RGB值中最不重要的4位进行插入。最有效的方法是计算处理图像大小的16个uint8s数组。对于一个24位的RGB图像,相当于需要存储x6倍的图像来处理它。

A couple of things you can do to handle this:

要解决这个问题,你可以做以下几件事:

1. Divide and conquer

Maybe you cannot process a 1,000x1,000 array in a single pass. But if you can do it with a python for loop iterating over 10 arrays of 100x1,000, it is still going to beat by a very far margin a python iterator over 1,000,000 items! It´s going to be slower, yes, but not as much.

也许您不能在一次遍历中处理一个1000x1000的数组。但是,如果您可以使用python for循环迭代10个100x1000的数组,那么它仍然会比python迭代器的范围大得多,一个python迭代器会遍历1,000,000个条目!它´s慢,是的,但不是很多。

2. Cache expensive computations

This relates directly to my interpolation example above, and is harder to come across, although worth keeping an eye open for it. Because I am interpolating on a three-dimensional cube with 4 bits in each dimension, there are only 16x16x16 possible outcomes, which can be stored in 16 arrays of 16x16x16 bytes. So I can precompute them and store them using 64KB of memory, and look-up the values one by one for the whole image, rather than redoing the same operations for every pixel at huge memory cost. This already pays-off for images as small as 64x64 pixels, and basically allows processing images with x6 times the amount of pixels without having to subdivide the array.

这与我上面的插值示例直接相关,而且更难理解,尽管值得关注。因为我在一个每个维度都有4位的三维立方体上进行插值,所以只有16x16x16的可能结果,这些结果可以存储在16个16x16x16字节的数组中。因此,我可以使用64KB的内存对它们进行预计算和存储,并逐个查找整个图像的值,而不是以巨大的内存成本对每个像素重新执行相同的操作。这已经为小到64x64像素的图像提供了回报,并且基本上允许使用x6倍的像素来处理图像,而无需对数组进行细分。

3. Use your dtypes wisely

If your intermediate values can fit in a single uint8, don't use an array of int32s! This can turn into a nightmare of mysterious errors due to silent overflows, but if you are careful, it can provide a big saving of resources.

如果您的中间值可以容纳一个uint8,不要使用一个int32s数组!这可能会变成由于静默溢出而导致的神秘错误的噩梦,但是如果您小心,它可以提供大量的资源节约。

#2


9  

First most important trick: allocate a few big arrays, and use and recycle portions of them, instead of bringing into life and discarding/garbage collecting lots of temporary arrays. Sounds a little bit old-fashioned, but with careful programming speed-up can be impressive. (You have better control of alignment and data locality, so numeric code can be made more efficient.)

第一个最重要的技巧是:分配几个大数组,并使用和回收其中的一部分,而不是将大量的临时数组丢弃或垃圾回收。听起来有点过时了,但是通过精心的编程,加速会让人印象深刻。(您可以更好地控制对齐和数据位置,从而使数字代码更有效。)

Second: use numpy.memmap and hope that OS caching of accesses to the disk are efficient enough.

第二:使用numpy。memmap并希望OS缓存对磁盘的访问足够有效。

Third: as pointed out by @Jaime, work un block sub-matrices, if the whole matrix is to big.

第三:正如@Jaime所指出的,如果整个矩阵是大的,则使用un块子矩阵。

EDIT:

编辑:

Avoid unecessary list comprehension, as pointed out in this answer in SE.

避免不必要的列表理解,就像本文中指出的那样。

#3


3  

The dask.array library provides a numpy interface that uses blocked algorithms to handle larger-than-memory arrays with multiple cores.

dask。数组库提供了一个numpy接口,该接口使用被阻塞的算法来处理具有多个核的具有较大内存的数组。

You could also look into Spartan, Distarray, and Biggus.

你也可以看看斯巴达人,迪亚斯塔雷和比格斯。

#4


3  

If it is possible for you, use numexpr. For numeric calculations like a**2 + b**2 + 2*a*b (for a and b being arrays) it

如果可能的话,使用numexpr。对于数值计算,比如a* 2 + b* 2 + 2*a*b (a和b是数组)

  1. will compile machine code that will execute fast and with minimal memory overhead, taking care of memory locality stuff (and thus cache optimization) if the same array occurs several times in your expression,

    将编译机器代码,该代码将以最小的内存开销快速执行,如果在表达式中多次出现相同的数组,则将处理内存局部性内容(并因此进行缓存优化),

  2. uses all cores of your dual or quad core CPU,

    使用双核或四核CPU的所有核心,

  3. is an extension to numpy, not an alternative.

    是numpy的扩展,而不是替代。

For medium and large sized arrays, it is faster that numpy alone.

对于中型和大型数组,单是numpy就更快了。

Take a look at the web page given above, there are examples that will help you understand if numexpr is for you.

看看上面给出的web页面,有一些例子可以帮助您理解numexpr是否适合您。