My question is how to efficiently expand an array, by copying itself many times. I am trying to expand my survey samples to the full size dataset, by copying every sample N times. N is the influence factor that signed to the sample. So I wrote two loops to do this task (script pasted below). It works, but is slow. My sample size is 20,000, and try to expand it into 3 million full size.. is there any function I can try? Thank you for your help!
我的问题是如何通过多次复制自己来有效地扩展数组。我试图通过将每个样本复制N次来将我的调查样本扩展到全尺寸数据集。 N是签署样本的影响因子。所以我写了两个循环来完成这个任务(下面粘贴的脚本)。它有效,但速度很慢。我的样本量是20,000,并尝试将其扩展到300万全尺寸..我可以尝试任何功能吗?谢谢您的帮助!
----My script----
lines = np.asarray(person.read().split('\n'))
df_array = np.asarray(lines[0].split(' '))
for j in range(1,len(lines)-1):
subarray = np.asarray(lines[j].split(' '))
factor = int(round(float(subarray[-1]),0))
for i in range(1,factor):
df_array = np.vstack((df_array, subarray))
print len(df_array)
3 个解决方案
#1
2
First, you can try to load data all together with numpy.loadtxt
.
首先,您可以尝试使用numpy.loadtxt一起加载数据。
Then, to repeat according to the last column, use numpy.repeat
:
然后,根据最后一列重复,使用numpy.repeat:
>>> data = np.array([[1, 2, 3],
... [4, 5, 6]])
>>> np.repeat(data, data[:,-1], axis=0)
array([[1, 2, 3],
[1, 2, 3],
[1, 2, 3],
[4, 5, 6],
[4, 5, 6],
[4, 5, 6],
[4, 5, 6],
[4, 5, 6],
[4, 5, 6]])
Finally, if you need to round data[:,-1]
, replace it with np.round(data[:,-1]).astype(int)
.
最后,如果需要舍入数据[:, - 1],请将其替换为np.round(data [:, - 1])。astype(int)。
#2
1
Stacking numpy arrays over and over is not very efficient, because they're not really optimized for dynamic growth like that. Every time you vstack, it's allocating a whole new chunk of memory for the size of your data at that point.
一遍又一遍地堆叠numpy数组效率不高,因为它们并没有真正针对动态增长进行优化。每次vstack时,它都会为此时的数据大小分配一大块新内存。
Use lists then build your array right at the end, maybe something with a generator like this:
使用列表然后在最后构建你的数组,也许用这样的生成器:
def upsample(stream):
for line in stream:
rec = line.strip().split()
factor = int(round(float(rec[-1]),0))
for i in xrange(factor):
yield rec
df_array = np.array(list(upsample(person)))
#3
1
The concept you are looking for is called broadcasting
. It allows you to fill an n dimensional
array with an n-1 dimensional
array's contents.
您正在寻找的概念称为广播。它允许您使用n-1维数组的内容填充n维数组。
Looking at your code example, you are calling np.vstack()
in a loop. Broadcasting will eliminate the loop.
查看代码示例,您将在循环中调用np.vstack()。广播将消除循环。
For example, if you have a 1D array of n
elements,
例如,如果您有一个包含n个元素的一维数组,
>>> n = 5 >>> df_array = np.arange(n) >>> df_array array([0, 1, 2, 3, 4])
you can then create a new n x 10
array:
然后,您可以创建一个新的n x 10数组:
>>> bigger_array = np.empty([10,n]) >>> bigger_array[:] = df_array >>> bigger_array array([[ 0., 1., 2., 3., 4.], [ 0., 1., 2., 3., 4.], [ 0., 1., 2., 3., 4.], [ 0., 1., 2., 3., 4.], [ 0., 1., 2., 3., 4.], [ 0., 1., 2., 3., 4.], [ 0., 1., 2., 3., 4.], [ 0., 1., 2., 3., 4.], [ 0., 1., 2., 3., 4.], [ 0., 1., 2., 3., 4.]])
So with a single line of code, you can fill it with the contents of the smaller array.
因此,使用单行代码,您可以使用较小数组的内容填充它。
bigger_array[:] = df_array
greater_array [:] = df_array
NB. Avoid using python lists. They are far, far slower than the Numpy ndarray.
NB。避免使用python列表。它们比Numpy ndarray慢得多。
#1
2
First, you can try to load data all together with numpy.loadtxt
.
首先,您可以尝试使用numpy.loadtxt一起加载数据。
Then, to repeat according to the last column, use numpy.repeat
:
然后,根据最后一列重复,使用numpy.repeat:
>>> data = np.array([[1, 2, 3],
... [4, 5, 6]])
>>> np.repeat(data, data[:,-1], axis=0)
array([[1, 2, 3],
[1, 2, 3],
[1, 2, 3],
[4, 5, 6],
[4, 5, 6],
[4, 5, 6],
[4, 5, 6],
[4, 5, 6],
[4, 5, 6]])
Finally, if you need to round data[:,-1]
, replace it with np.round(data[:,-1]).astype(int)
.
最后,如果需要舍入数据[:, - 1],请将其替换为np.round(data [:, - 1])。astype(int)。
#2
1
Stacking numpy arrays over and over is not very efficient, because they're not really optimized for dynamic growth like that. Every time you vstack, it's allocating a whole new chunk of memory for the size of your data at that point.
一遍又一遍地堆叠numpy数组效率不高,因为它们并没有真正针对动态增长进行优化。每次vstack时,它都会为此时的数据大小分配一大块新内存。
Use lists then build your array right at the end, maybe something with a generator like this:
使用列表然后在最后构建你的数组,也许用这样的生成器:
def upsample(stream):
for line in stream:
rec = line.strip().split()
factor = int(round(float(rec[-1]),0))
for i in xrange(factor):
yield rec
df_array = np.array(list(upsample(person)))
#3
1
The concept you are looking for is called broadcasting
. It allows you to fill an n dimensional
array with an n-1 dimensional
array's contents.
您正在寻找的概念称为广播。它允许您使用n-1维数组的内容填充n维数组。
Looking at your code example, you are calling np.vstack()
in a loop. Broadcasting will eliminate the loop.
查看代码示例,您将在循环中调用np.vstack()。广播将消除循环。
For example, if you have a 1D array of n
elements,
例如,如果您有一个包含n个元素的一维数组,
>>> n = 5 >>> df_array = np.arange(n) >>> df_array array([0, 1, 2, 3, 4])
you can then create a new n x 10
array:
然后,您可以创建一个新的n x 10数组:
>>> bigger_array = np.empty([10,n]) >>> bigger_array[:] = df_array >>> bigger_array array([[ 0., 1., 2., 3., 4.], [ 0., 1., 2., 3., 4.], [ 0., 1., 2., 3., 4.], [ 0., 1., 2., 3., 4.], [ 0., 1., 2., 3., 4.], [ 0., 1., 2., 3., 4.], [ 0., 1., 2., 3., 4.], [ 0., 1., 2., 3., 4.], [ 0., 1., 2., 3., 4.], [ 0., 1., 2., 3., 4.]])
So with a single line of code, you can fill it with the contents of the smaller array.
因此,使用单行代码,您可以使用较小数组的内容填充它。
bigger_array[:] = df_array
greater_array [:] = df_array
NB. Avoid using python lists. They are far, far slower than the Numpy ndarray.
NB。避免使用python列表。它们比Numpy ndarray慢得多。