numpy中巨大数组的点积

I have a huge array and I want to calculate dot product with a small array. But I am getting 'array is too big' Is there a work around?

我有一个巨大的阵列,我想用小数组计算点积。但我得到'阵列太大了'有没有解决方法?

import numpy as np

eMatrix = np.random.random_integers(low=0,high=100,size=(20000000,50))
pMatrix = np.random.random_integers(low=0,high=10,size=(50,50))

a = np.dot(eMatrix,pMatrix)

Error:
/Library/Python/2.7/site-packages/numpy/random/mtrand.so in mtrand.RandomState.random_integers (numpy/random/mtrand/mtrand.c:9385)()

/Library/Python/2.7/site-packages/numpy/random/mtrand.so in mtrand.RandomState.randint (numpy/random/mtrand/mtrand.c:7051)()

ValueError: array is too big.

3 个解决方案

#1

That error is raised when figuring the total size of the array, if it overflows the native int type, see here for the exact source code line.

在计算数组的总大小时会引发该错误,如果它溢出了native int类型,请参阅此处获取确切的源代码行。

For this to happen, regardless of your machine being 64 bits, you are almost certainly running 32 bit versions of Python (and NumPy). You can check if that is the case by doing:

要做到这一点,无论你的机器是64位,你几乎肯定会运行32位版本的Python(和NumPy)。您可以通过执行以下操作来检查是否是这种情况:

>>> import sys
>>> sys.maxsize
2147483647 # <--- 2**31 - 1, on a 64 bit version you would get 2**63 - 1

Then again, you array is "only" 20000000 * 50 = 1000000000, which is just under 2**30. If I try to reproduce your results on a 32-bit numpy, I get a MemoryError:

然后,你的数组是“仅”20000000 * 50 = 1000000000,这是不到2 ** 30。如果我尝试在32位numpy上重现你的结果,我得到一个MemoryError:

>>> np.random.random_integers(low=0,high=100,size=(20000000,50))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "mtrand.pyx", line 1420, in mtrand.RandomState.random_integers (numpy\random\mtrand\mtrand.c:12943)
  File "mtrand.pyx", line 938, in mtrand.RandomState.randint (numpy\random\mtrand\mtrand.c:10338)
MemoryError

unless I increase the size beyond the magic 2**31 - 1 threshold

除非我增加超出魔法2 ** 31 - 1门槛的大小

>>> np.random.random_integers(low=0,high=100,size=(2**30, 2))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "mtrand.pyx", line 1420, in mtrand.RandomState.random_integers (numpy\random\mtrand\mtrand.c:12943)
  File "mtrand.pyx", line 938, in mtrand.RandomState.randint (numpy\random\mtrand\mtrand.c:10338)
ValueError: array is too big.

Given the difference in the line numbers in your traceback and mine, I suspect you are using an older version. What does this output on your system:

鉴于你的追踪和我的线号不同,我怀疑你使用的是旧版本。这个输出在你的系统上是什么:

>>> np.__version__
'1.10.0.dev-9c50f98'

#2

I think the only "simple" answer is get more RAM.

我认为唯一的“简单”答案是获得更多内存。

It took 15GB, but I was able to do this on my macbook.

花了15GB,但我能够在我的macbook上做到这一点。

In [1]: import numpy
In [2]: e = numpy.random.random_integers(low=0, high=100, size=(20000000, 50))
In [3]: p = numpy.random.random_integers(low=0, high=10, size=(50, 50))
In [4]: a = numpy.dot(e, p)
In [5]: a[0]
Out[5]:
array([14753, 12720, 15324, 13588, 16667, 16055, 14144, 15239, 15166,
       14293, 16786, 12358, 14880, 13846, 11950, 13836, 13393, 14679,
       15292, 15472, 15734, 12095, 14264, 12242, 12684, 11596, 15987,
       15275, 13572, 14534, 16472, 14818, 13374, 14115, 13171, 11927,
       14226, 13312, 16070, 13524, 16591, 16533, 15466, 15440, 15595,
       13164, 14278, 13692, 12415, 13314])

A possible solution might be using a sparse matrix and the sparse matrix dot operator.

可能的解决方案可能是使用稀疏矩阵和稀疏矩阵点运算符。

For example, on my machine constructing just e as a dense matrix used 8GB of ram. Constructing a similar sparse matrix eprime:

例如,在我的机器上构建e作为密集矩阵使用8GB的ram。构造一个类似的稀疏矩阵eprime:

In [1]: from scipy.sparse import rand
In [2]: eprime = rand(20000000, 50)

Has neglible cost in terms of memory.

在记忆方面成本可忽略不计。

#3

I believe the answer is you do not have enough RAM and also possibly you are running a 32 bit version of python. Maybe clarify what OS you are running. Many OSes will run both 32 and 64 bit programs.

我相信答案是你没有足够的RAM,也可能你正在运行32位版本的python。也许澄清你正在运行的操作系统。许多操作系统将运行32位和64位程序。

#1