Based on the answers here it doesn't seem like there's an easy way to fill a 2D numpy array with data from a generator.
根据这里的答案,似乎没有一种简单的方法可以用生成器中的数据填充2D numpy数组。
However, if someone can think of a way to vectorize or otherwise speed up the following function I would appreciate it.
但是,如果有人能够想出一种矢量化或以其他方式加速以下功能的方法,我将不胜感激。
The difference here is that I want to process the values from the generator in batches rather than create the whole array in memory. The only way I could think of doing that was with a for loop.
这里的区别在于我想要批量处理生成器中的值,而不是在内存中创建整个数组。我能想到的唯一方法是使用for循环。
import numpy as np
from itertools import permutations
permutations_of_values = permutations(range(1,20), 7)
def array_from_generator(generator, arr):
"""Fills the numpy array provided with values from
the generator provided. Number of columns in arr
must match the number of values yielded by the
generator."""
count = 0
for row in arr:
try:
item = next(generator)
except StopIteration:
break
row[:] = item
count += 1
return arr[:count,:]
batch_size = 100000
empty_array = np.empty((batch_size, 7), dtype=int)
batch_of_values = array_from_generator(permutations_of_values, empty_array)
print(batch_of_values[0:5])
Output:
[[ 1 2 3 4 5 6 7]
[ 1 2 3 4 5 6 8]
[ 1 2 3 4 5 6 9]
[ 1 2 3 4 5 6 10]
[ 1 2 3 4 5 6 11]]
Speed test:
%timeit array_from_generator(permutations_of_values, empty_array)
10 loops, best of 3: 137 ms per loop
ADDITION:
As suggested by @COLDSPEED (thanks) here is a version that uses a list to gather the data from the generator. It's about twice as fast as above code. Can anyone improve on this:
正如@COLDSPEED(谢谢)所建议的,这是一个使用列表从发生器收集数据的版本。它的速度大约是上面代码的两倍。任何人都可以改进:
permutations_of_values = permutations(range(1,20), 7)
def array_from_generator2(generator, rows=batch_size):
"""Creates a numpy array from a specified number
of values from the generator provided."""
data = []
for row in range(rows):
try:
data.append(next(generator))
except StopIteration:
break
return np.array(data)
batch_size = 100000
batch_of_values = array_from_generator2(permutations_of_values, rows=100000)
print(batch_of_values[0:5])
Output:
[[ 1 2 3 4 5 6 7]
[ 1 2 3 4 5 6 8]
[ 1 2 3 4 5 6 9]
[ 1 2 3 4 5 6 10]
[ 1 2 3 4 5 6 11]]
Speed test:
%timeit array_from_generator2(permutations_of_values, rows=100000)
10 loops, best of 3: 85.6 ms per loop
1 个解决方案
#1
3
You can calculate the sizes ahead in essentially constant time. Just do that, and use numpy.fromiter
:
您可以在基本恒定的时间内计算出前方的尺寸。就这样做,并使用numpy.fromiter:
In [1]: import math, from itertools import permutations, chain
In [2]: def n_chose_k(n, k, fac=math.factorial):
...: return fac(n)/fac(n-k)
...:
In [3]: def permutations_to_array(r, k):
...: n = len(r)
...: size = int(n_chose_k(n, k))
...: it = permutations(r, k)
...: arr = np.fromiter(chain.from_iterable(it),
...: count=size, dtype=int)
...: arr.size = size//k, k
...: return arr
...:
In [4]: arr = permutations_to_array(range(1,20), 7)
In [5]: arr.shape
Out[5]: (36279360, 7)
In [6]: arr[0:5]
Out[6]:
array([[ 1, 2, 3, 4, 5, 6, 7],
[ 1, 2, 3, 4, 5, 6, 8],
[ 1, 2, 3, 4, 5, 6, 9],
[ 1, 2, 3, 4, 5, 6, 10],
[ 1, 2, 3, 4, 5, 6, 11]])
This will work as long as r
is limited to sequences with a len
.
只要r仅限于具有len的序列,这将起作用。
Edited to add an implementation I cooked up for a generator of batchsize*k
chunks, with a trim option!
编辑添加一个实现,我为batchsize * k块的生成器做了一个实例,带有一个修剪选项!
import math
from itertools import repeat, chain
import numpy as np
def n_chose_k(n, k, fac=math.factorial):
return fac(n)/fac(n-k)
def permutations_in_batches(r, k, batchsize=None, fill=0, dtype=int, trim=False):
n = len(r)
size = int(n_chose_k(n, k))
if batchsize is None or batchsize > size:
batchsize = size
perms = chain.from_iterable(permutations(r, k))
count = batchsize*k
remaining = size - count
while remaining > 0:
current = np.fromiter(perms, count=count, dtype=dtype)
current.shape = batchsize, k
yield current
remaining -= count
if remaining: # remaining is negative
remaining = -remaining
if not trim:
padding = repeat(fill, remaining)
finalcount = count
finalshape = batchsize, k
else:
q = remaining//k # always divisible q%k==0
finalcount = q*k
padding = repeat(fill, remaining)
finalshape = q, k
current = np.fromiter(chain(perms, padding), count=finalcount, dtype=dtype)
current.shape = finalshape
else: # remaining is 0
current = np.fromiter(perms, count=batchsize, dtype=dtype)
current.shape = batchsize, k
yield current
#1
3
You can calculate the sizes ahead in essentially constant time. Just do that, and use numpy.fromiter
:
您可以在基本恒定的时间内计算出前方的尺寸。就这样做,并使用numpy.fromiter:
In [1]: import math, from itertools import permutations, chain
In [2]: def n_chose_k(n, k, fac=math.factorial):
...: return fac(n)/fac(n-k)
...:
In [3]: def permutations_to_array(r, k):
...: n = len(r)
...: size = int(n_chose_k(n, k))
...: it = permutations(r, k)
...: arr = np.fromiter(chain.from_iterable(it),
...: count=size, dtype=int)
...: arr.size = size//k, k
...: return arr
...:
In [4]: arr = permutations_to_array(range(1,20), 7)
In [5]: arr.shape
Out[5]: (36279360, 7)
In [6]: arr[0:5]
Out[6]:
array([[ 1, 2, 3, 4, 5, 6, 7],
[ 1, 2, 3, 4, 5, 6, 8],
[ 1, 2, 3, 4, 5, 6, 9],
[ 1, 2, 3, 4, 5, 6, 10],
[ 1, 2, 3, 4, 5, 6, 11]])
This will work as long as r
is limited to sequences with a len
.
只要r仅限于具有len的序列,这将起作用。
Edited to add an implementation I cooked up for a generator of batchsize*k
chunks, with a trim option!
编辑添加一个实现,我为batchsize * k块的生成器做了一个实例,带有一个修剪选项!
import math
from itertools import repeat, chain
import numpy as np
def n_chose_k(n, k, fac=math.factorial):
return fac(n)/fac(n-k)
def permutations_in_batches(r, k, batchsize=None, fill=0, dtype=int, trim=False):
n = len(r)
size = int(n_chose_k(n, k))
if batchsize is None or batchsize > size:
batchsize = size
perms = chain.from_iterable(permutations(r, k))
count = batchsize*k
remaining = size - count
while remaining > 0:
current = np.fromiter(perms, count=count, dtype=dtype)
current.shape = batchsize, k
yield current
remaining -= count
if remaining: # remaining is negative
remaining = -remaining
if not trim:
padding = repeat(fill, remaining)
finalcount = count
finalshape = batchsize, k
else:
q = remaining//k # always divisible q%k==0
finalcount = q*k
padding = repeat(fill, remaining)
finalshape = q, k
current = np.fromiter(chain(perms, padding), count=finalcount, dtype=dtype)
current.shape = finalshape
else: # remaining is 0
current = np.fromiter(perms, count=batchsize, dtype=dtype)
current.shape = batchsize, k
yield current