I generate a list of one dimensional numpy arrays in a loop and later convert this list to a 2d numpy array. I would've preallocated a 2d numpy array if i knew the number of items ahead of time, but I don't, therefore I put everything in a list.
我在循环中生成一个一维numpy数组的列表,然后将该列表转换为2d numpy数组。如果我提前知道了项目的数量,我就会预先分配一个二维的numpy数组,但是我不知道,所以我把所有东西都列在一个列表中。
The mock up is below:
模型如下:
>>> list_of_arrays = map(lambda x: x*ones(2), range(5))
>>> list_of_arrays
[array([ 0., 0.]), array([ 1., 1.]), array([ 2., 2.]), array([ 3., 3.]), array([ 4., 4.])]
>>> arr = array(list_of_arrays)
>>> arr
array([[ 0., 0.],
[ 1., 1.],
[ 2., 2.],
[ 3., 3.],
[ 4., 4.]])
My question is the following:
我的问题是:
Is there a better way (performancewise) to go about the task of collecting sequential numerical data (in my case numpy arrays) than putting them in a list and then making a numpy.array out of it (I am creating a new obj and copying the data)? Is there an "expandable" matrix data structure available in a well tested module?
是否有更好的方法(从性能上)来执行收集顺序数字数据(在我的例子中是numpy数组)的任务,而不是将它们放入一个列表中,然后生成一个numpy。数组(我正在创建一个新的obj并复制数据)?在经过良好测试的模块中是否存在“可扩展”的矩阵数据结构?
A typical size of my 2d matrix would be between 100x10 and 5000x10 floats
二维矩阵的一个典型大小是100x10到5000x10浮点数
EDIT: In this example i'm using map, but in my actual application I have a for loop
编辑:在这个示例中,我使用map,但是在实际的应用程序中,我有一个for循环
5 个解决方案
#1
16
Suppose you know that the final array arr
will never be larger than 5000x10. Then you could pre-allocate an array of maximum size, populate it with data as you go through the loop, and then use arr.resize
to cut it down to the discovered size after exiting the loop.
假设您知道最终的数组arr永远不会大于5000x10。然后您可以预先分配一个最大大小的数组,在循环过程中使用数据填充它,然后使用arr。在退出循环之后,调整大小以将其减少到已发现的大小。
The tests below suggest doing so will be slightly faster than constructing intermediate python lists no matter what the ultimate size of the array is.
下面的测试表明,无论数组的最终大小如何,这样做都比构建中间的python列表要快一些。
Also, arr.resize
de-allocates the unused memory, so the final (though maybe not the intermediate) memory footprint is smaller than what is used by python_lists_to_array
.
同时,加勒比海盗。resize去分配未使用的内存,因此最终(尽管可能不是中间内存)的内存占用比python_lists_to_array使用的内存占用要少。
This shows numpy_all_the_way
is faster:
这表明numpy_all_the_way更快:
% python -mtimeit -s"import test" "test.numpy_all_the_way(100)"
100 loops, best of 3: 1.78 msec per loop
% python -mtimeit -s"import test" "test.numpy_all_the_way(1000)"
100 loops, best of 3: 18.1 msec per loop
% python -mtimeit -s"import test" "test.numpy_all_the_way(5000)"
10 loops, best of 3: 90.4 msec per loop
% python -mtimeit -s"import test" "test.python_lists_to_array(100)"
1000 loops, best of 3: 1.97 msec per loop
% python -mtimeit -s"import test" "test.python_lists_to_array(1000)"
10 loops, best of 3: 20.3 msec per loop
% python -mtimeit -s"import test" "test.python_lists_to_array(5000)"
10 loops, best of 3: 101 msec per loop
This shows numpy_all_the_way
uses less memory:
这表明numpy_all_the_way使用的内存更少:
% test.py
Initial memory usage: 19788
After python_lists_to_array: 20976
After numpy_all_the_way: 20348
test.py:
test.py:
#!/usr/bin/env python
import numpy as np
import os
def memory_usage():
pid=os.getpid()
return next(line for line in open('/proc/%s/status'%pid).read().splitlines()
if line.startswith('VmSize')).split()[-2]
N,M=5000,10
def python_lists_to_array(k):
list_of_arrays = map(lambda x: x*np.ones(M), range(k))
arr = np.array(list_of_arrays)
return arr
def numpy_all_the_way(k):
arr=np.empty((N,M))
for x in range(k):
arr[x]=x*np.ones(M)
arr.resize((k,M))
return arr
if __name__=='__main__':
print('Initial memory usage: %s'%memory_usage())
arr=python_lists_to_array(5000)
print('After python_lists_to_array: %s'%memory_usage())
arr=numpy_all_the_way(5000)
print('After numpy_all_the_way: %s'%memory_usage())
#2
14
Convenient way, using numpy.concatenate
. I believe it's also faster, than @unutbu's answer:
方便的方式,使用numpy.concatenate。我相信这也比@unutbu的回答更快。
In [32]: import numpy as np
In [33]: list_of_arrays = list(map(lambda x: x * np.ones(2), range(5)))
In [34]: list_of_arrays
Out[34]:
[array([ 0., 0.]),
array([ 1., 1.]),
array([ 2., 2.]),
array([ 3., 3.]),
array([ 4., 4.])]
In [37]: shape = list(list_of_arrays[0].shape)
In [38]: shape
Out[38]: [2]
In [39]: shape[:0] = [len(list_of_arrays)]
In [40]: shape
Out[40]: [5, 2]
In [41]: arr = np.concatenate(list_of_arrays).reshape(shape)
In [42]: arr
Out[42]:
array([[ 0., 0.],
[ 1., 1.],
[ 2., 2.],
[ 3., 3.],
[ 4., 4.]])
#3
2
What you are doing is the standard way. A property of numpy arrays is that they need contiguous memory. The only possibility of "holes" that I can think of is possible with the strides
member of PyArrayObject
, but that doesn't affect the discussion here. Since numpy arrays have contiguous memory and are "preallocated", adding a new row/column means allocating new memory, copying data, and then freeing the old memory. If you do that a lot, it is not very efficient.
你所做的是标准的方式。numpy数组的属性是它们需要连续的内存。我所能想到的“漏洞”的唯一可能是PyArrayObject的大步流星,但这并不影响这里的讨论。由于numpy数组具有连续的内存并且是“预分配的”,添加新的行/列意味着分配新内存、复制数据,然后释放旧内存。如果你经常这样做,效率就不是很高。
One case where someone might not want to create a list and then convert it to a numpy array in the end is when the list contains a lot of numbers: a numpy array of numbers takes much less space than a native Python list of numbers (since the native Python list stores Python objects). For your typical array sizes, I don't think that is an issue.
一个案例是这样的,有人可能不希望创建一个列表,然后把它转换成一系列numpy最后是当列表包含大量的数字:numpy数组的数字需要更少的空间比本机Python列表的数字(因为本机Python列表存储Python对象)。对于典型的数组大小,我不认为这是一个问题。
When you create your final array from a list of arrays, you are copying all the data to a new location for the new (2-d in your example) array. This is still much more efficient than having a numpy array and doing next = numpy.vstack((next, new_row))
every time you get new data. vstack()
will copy all the data for every "row".
当您从数组列表中创建最后的数组时,您正在将所有数据复制到新的数组(在示例中为2d)的新位置。这仍然比使用numpy数组和执行next = numpy有效得多。每次获得新数据时,都要使用vstack(next, new_row)。vstack()将复制每个“行”的所有数据。
There was a thread on numpy-discussion mailing list some time ago which discussed the possibility of adding a new numpy array type that allows efficient extending/appending. It seems there was significant interest in this at that time, although I don't know if something came out of it. You might want to look at that thread.
在一段时间以前的numpy讨论邮件列表中有一个线程,它讨论了添加一个新的numpy数组类型的可能性,该类型允许有效的扩展/追加。当时似乎有很多人对此很感兴趣,虽然我不知道是否有什么结果。你可能想看看这个线程。
I would say that what you're doing is very Pythonic, and efficient, so unless you really need something else (more space efficiency, maybe?), you should be okay. That is how I create my numpy arrays when I don't know the number of elements in the array in the beginning.
我想说的是,你所做的是非常符合python和效率的,所以除非你真的需要别的东西(也许是更多的空间效率),你应该没问题。这就是我创建numpy数组的方式,当我开始不知道数组中元素的数量时。
#4
2
I'll add my own version of ~unutbu's answer. Similar to numpy_all_the way, but you dynamically resize if you have an index error. I thought it would have been a little faster for small data sets, but it's a little slower - the bounds checking slows things down too much.
我将添加我自己版本的~unutbu的答案。类似于numpy_all_way,但如果有索引错误,则动态调整大小。我原以为对于小数据集来说会快一点,但它会慢一点——边界检查会让事情慢得太多。
initial_guess = 1000
def my_numpy_all_the_way(k):
arr=np.empty((initial_guess,M))
for x,row in enumerate(make_test_data(k)):
try:
arr[x]=row
except IndexError:
arr.resize((arr.shape[0]*2, arr.shape[1]))
arr[x]=row
arr.resize((k,M))
return arr
#5
2
Even simpler than @Gill Bates' answer, here is an one line code:
比@Gill Bates的回答更简单,这里有一行代码:
np.stack(list_of_arrays, axis=0)
#1
16
Suppose you know that the final array arr
will never be larger than 5000x10. Then you could pre-allocate an array of maximum size, populate it with data as you go through the loop, and then use arr.resize
to cut it down to the discovered size after exiting the loop.
假设您知道最终的数组arr永远不会大于5000x10。然后您可以预先分配一个最大大小的数组,在循环过程中使用数据填充它,然后使用arr。在退出循环之后,调整大小以将其减少到已发现的大小。
The tests below suggest doing so will be slightly faster than constructing intermediate python lists no matter what the ultimate size of the array is.
下面的测试表明,无论数组的最终大小如何,这样做都比构建中间的python列表要快一些。
Also, arr.resize
de-allocates the unused memory, so the final (though maybe not the intermediate) memory footprint is smaller than what is used by python_lists_to_array
.
同时,加勒比海盗。resize去分配未使用的内存,因此最终(尽管可能不是中间内存)的内存占用比python_lists_to_array使用的内存占用要少。
This shows numpy_all_the_way
is faster:
这表明numpy_all_the_way更快:
% python -mtimeit -s"import test" "test.numpy_all_the_way(100)"
100 loops, best of 3: 1.78 msec per loop
% python -mtimeit -s"import test" "test.numpy_all_the_way(1000)"
100 loops, best of 3: 18.1 msec per loop
% python -mtimeit -s"import test" "test.numpy_all_the_way(5000)"
10 loops, best of 3: 90.4 msec per loop
% python -mtimeit -s"import test" "test.python_lists_to_array(100)"
1000 loops, best of 3: 1.97 msec per loop
% python -mtimeit -s"import test" "test.python_lists_to_array(1000)"
10 loops, best of 3: 20.3 msec per loop
% python -mtimeit -s"import test" "test.python_lists_to_array(5000)"
10 loops, best of 3: 101 msec per loop
This shows numpy_all_the_way
uses less memory:
这表明numpy_all_the_way使用的内存更少:
% test.py
Initial memory usage: 19788
After python_lists_to_array: 20976
After numpy_all_the_way: 20348
test.py:
test.py:
#!/usr/bin/env python
import numpy as np
import os
def memory_usage():
pid=os.getpid()
return next(line for line in open('/proc/%s/status'%pid).read().splitlines()
if line.startswith('VmSize')).split()[-2]
N,M=5000,10
def python_lists_to_array(k):
list_of_arrays = map(lambda x: x*np.ones(M), range(k))
arr = np.array(list_of_arrays)
return arr
def numpy_all_the_way(k):
arr=np.empty((N,M))
for x in range(k):
arr[x]=x*np.ones(M)
arr.resize((k,M))
return arr
if __name__=='__main__':
print('Initial memory usage: %s'%memory_usage())
arr=python_lists_to_array(5000)
print('After python_lists_to_array: %s'%memory_usage())
arr=numpy_all_the_way(5000)
print('After numpy_all_the_way: %s'%memory_usage())
#2
14
Convenient way, using numpy.concatenate
. I believe it's also faster, than @unutbu's answer:
方便的方式,使用numpy.concatenate。我相信这也比@unutbu的回答更快。
In [32]: import numpy as np
In [33]: list_of_arrays = list(map(lambda x: x * np.ones(2), range(5)))
In [34]: list_of_arrays
Out[34]:
[array([ 0., 0.]),
array([ 1., 1.]),
array([ 2., 2.]),
array([ 3., 3.]),
array([ 4., 4.])]
In [37]: shape = list(list_of_arrays[0].shape)
In [38]: shape
Out[38]: [2]
In [39]: shape[:0] = [len(list_of_arrays)]
In [40]: shape
Out[40]: [5, 2]
In [41]: arr = np.concatenate(list_of_arrays).reshape(shape)
In [42]: arr
Out[42]:
array([[ 0., 0.],
[ 1., 1.],
[ 2., 2.],
[ 3., 3.],
[ 4., 4.]])
#3
2
What you are doing is the standard way. A property of numpy arrays is that they need contiguous memory. The only possibility of "holes" that I can think of is possible with the strides
member of PyArrayObject
, but that doesn't affect the discussion here. Since numpy arrays have contiguous memory and are "preallocated", adding a new row/column means allocating new memory, copying data, and then freeing the old memory. If you do that a lot, it is not very efficient.
你所做的是标准的方式。numpy数组的属性是它们需要连续的内存。我所能想到的“漏洞”的唯一可能是PyArrayObject的大步流星,但这并不影响这里的讨论。由于numpy数组具有连续的内存并且是“预分配的”,添加新的行/列意味着分配新内存、复制数据,然后释放旧内存。如果你经常这样做,效率就不是很高。
One case where someone might not want to create a list and then convert it to a numpy array in the end is when the list contains a lot of numbers: a numpy array of numbers takes much less space than a native Python list of numbers (since the native Python list stores Python objects). For your typical array sizes, I don't think that is an issue.
一个案例是这样的,有人可能不希望创建一个列表,然后把它转换成一系列numpy最后是当列表包含大量的数字:numpy数组的数字需要更少的空间比本机Python列表的数字(因为本机Python列表存储Python对象)。对于典型的数组大小,我不认为这是一个问题。
When you create your final array from a list of arrays, you are copying all the data to a new location for the new (2-d in your example) array. This is still much more efficient than having a numpy array and doing next = numpy.vstack((next, new_row))
every time you get new data. vstack()
will copy all the data for every "row".
当您从数组列表中创建最后的数组时,您正在将所有数据复制到新的数组(在示例中为2d)的新位置。这仍然比使用numpy数组和执行next = numpy有效得多。每次获得新数据时,都要使用vstack(next, new_row)。vstack()将复制每个“行”的所有数据。
There was a thread on numpy-discussion mailing list some time ago which discussed the possibility of adding a new numpy array type that allows efficient extending/appending. It seems there was significant interest in this at that time, although I don't know if something came out of it. You might want to look at that thread.
在一段时间以前的numpy讨论邮件列表中有一个线程,它讨论了添加一个新的numpy数组类型的可能性,该类型允许有效的扩展/追加。当时似乎有很多人对此很感兴趣,虽然我不知道是否有什么结果。你可能想看看这个线程。
I would say that what you're doing is very Pythonic, and efficient, so unless you really need something else (more space efficiency, maybe?), you should be okay. That is how I create my numpy arrays when I don't know the number of elements in the array in the beginning.
我想说的是,你所做的是非常符合python和效率的,所以除非你真的需要别的东西(也许是更多的空间效率),你应该没问题。这就是我创建numpy数组的方式,当我开始不知道数组中元素的数量时。
#4
2
I'll add my own version of ~unutbu's answer. Similar to numpy_all_the way, but you dynamically resize if you have an index error. I thought it would have been a little faster for small data sets, but it's a little slower - the bounds checking slows things down too much.
我将添加我自己版本的~unutbu的答案。类似于numpy_all_way,但如果有索引错误,则动态调整大小。我原以为对于小数据集来说会快一点,但它会慢一点——边界检查会让事情慢得太多。
initial_guess = 1000
def my_numpy_all_the_way(k):
arr=np.empty((initial_guess,M))
for x,row in enumerate(make_test_data(k)):
try:
arr[x]=row
except IndexError:
arr.resize((arr.shape[0]*2, arr.shape[1]))
arr[x]=row
arr.resize((k,M))
return arr
#5
2
Even simpler than @Gill Bates' answer, here is an one line code:
比@Gill Bates的回答更简单,这里有一行代码:
np.stack(list_of_arrays, axis=0)