如何在python中有效地扩展数组？

My question is how to efficiently expand an array, by copying itself many times. I am trying to expand my survey samples to the full size dataset, by copying every sample N times. N is the influence factor that signed to the sample. So I wrote two loops to do this task (script pasted below). It works, but is slow. My sample size is 20,000, and try to expand it into 3 million full size.. is there any function I can try? Thank you for your help!

我的问题是如何通过多次复制自己来有效地扩展数组。我试图通过将每个样本复制N次来将我的调查样本扩展到全尺寸数据集。 N是签署样本的影响因子。所以我写了两个循环来完成这个任务(下面粘贴的脚本)。它有效,但速度很慢。我的样本量是20,000,并尝试将其扩展到300万全尺寸..我可以尝试任何功能吗?谢谢您的帮助!

----My script----

lines = np.asarray(person.read().split('\n'))
df_array = np.asarray(lines[0].split(' '))
for j in range(1,len(lines)-1):
    subarray = np.asarray(lines[j].split(' '))
    factor = int(round(float(subarray[-1]),0))
    for i in range(1,factor):
        df_array = np.vstack((df_array, subarray))
print len(df_array)

3 个解决方案

#1

First, you can try to load data all together with numpy.loadtxt.

首先,您可以尝试使用numpy.loadtxt一起加载数据。

Then, to repeat according to the last column, use numpy.repeat:

然后,根据最后一列重复,使用numpy.repeat:

>>> data = np.array([[1, 2, 3],
...                  [4, 5, 6]])
>>> np.repeat(data, data[:,-1], axis=0)
array([[1, 2, 3],
       [1, 2, 3],
       [1, 2, 3],
       [4, 5, 6],
       [4, 5, 6],
       [4, 5, 6],
       [4, 5, 6],
       [4, 5, 6],
       [4, 5, 6]])

Finally, if you need to round data[:,-1], replace it with np.round(data[:,-1]).astype(int).

最后,如果需要舍入数据[:, - 1],请将其替换为np.round(data [:, - 1])。astype(int)。

#2

Stacking numpy arrays over and over is not very efficient, because they're not really optimized for dynamic growth like that. Every time you vstack, it's allocating a whole new chunk of memory for the size of your data at that point.

一遍又一遍地堆叠numpy数组效率不高,因为它们并没有真正针对动态增长进行优化。每次vstack时,它都会为此时的数据大小分配一大块新内存。

Use lists then build your array right at the end, maybe something with a generator like this:

使用列表然后在最后构建你的数组,也许用这样的生成器:

def upsample(stream):
    for line in stream:
        rec = line.strip().split()
        factor = int(round(float(rec[-1]),0))
        for i in xrange(factor):
            yield rec

df_array = np.array(list(upsample(person)))

#3

The concept you are looking for is called broadcasting. It allows you to fill an n dimensional array with an n-1 dimensional array's contents.

您正在寻找的概念称为广播。它允许您使用n-1维数组的内容填充n维数组。

Looking at your code example, you are calling np.vstack() in a loop. Broadcasting will eliminate the loop.

查看代码示例,您将在循环中调用np.vstack()。广播将消除循环。

For example, if you have a 1D array of n elements,

例如,如果您有一个包含n个元素的一维数组,

>>> n = 5
>>> df_array = np.arange(n)
>>> df_array
array([0, 1, 2, 3, 4])

you can then create a new n x 10 array:

然后,您可以创建一个新的n x 10数组:

>>> bigger_array = np.empty([10,n])
>>> bigger_array[:] = df_array
>>> bigger_array
array([[ 0.,  1.,  2.,  3.,  4.],
       [ 0.,  1.,  2.,  3.,  4.],
       [ 0.,  1.,  2.,  3.,  4.],
       [ 0.,  1.,  2.,  3.,  4.],
       [ 0.,  1.,  2.,  3.,  4.],
       [ 0.,  1.,  2.,  3.,  4.],
       [ 0.,  1.,  2.,  3.,  4.],
       [ 0.,  1.,  2.,  3.,  4.],
       [ 0.,  1.,  2.,  3.,  4.],
       [ 0.,  1.,  2.,  3.,  4.]])

So with a single line of code, you can fill it with the contents of the smaller array.

因此,使用单行代码,您可以使用较小数组的内容填充它。

bigger_array[:] = df_array

greater_array [:] = df_array

NB. Avoid using python lists. They are far, far slower than the Numpy ndarray.

NB。避免使用python列表。它们比Numpy ndarray慢得多。

#1