Numpy仅按行重新排列多维数组,保持列顺序不变

时间:2021-12-16 21:22:43

How can I shuffle a multidimensional array by row only in Python (so do not shuffle the columns).

我怎样才能在Python中按行对多维数组进行洗牌(所以不要随意乱洗列)。

I am looking for the most efficient solution, because my matrix is very huge. Is it also possible to do this highly efficient on the original array (to save memory)?

我正在寻找最有效的解决方案,因为我的矩阵非常庞大。是否也可以在原始阵列上高效地工作(以节省内存)?

Example:

例:

import numpy as np
X = np.random.random((6, 2))
print(X)
Y = ???shuffle by row only not colls???
print(Y)

What I expect now is original matrix:

我现在期望的是原始矩阵:

[[ 0.48252164  0.12013048]
 [ 0.77254355  0.74382174]
 [ 0.45174186  0.8782033 ]
 [ 0.75623083  0.71763107]
 [ 0.26809253  0.75144034]
 [ 0.23442518  0.39031414]]

Output shuffle the rows not cols e.g.:

输出洗牌行没有cols例如:

[[ 0.45174186  0.8782033 ]
 [ 0.48252164  0.12013048]
 [ 0.77254355  0.74382174]
 [ 0.75623083  0.71763107]
 [ 0.23442518  0.39031414]
 [ 0.26809253  0.75144034]]

3 个解决方案

#1


16  

That's what numpy.random.shuffle() is for :

这就是numpy.random.shuffle()的用途:

>>> X = np.random.random((6, 2))
>>> X
array([[ 0.9818058 ,  0.67513579],
       [ 0.82312674,  0.82768118],
       [ 0.29468324,  0.59305925],
       [ 0.25731731,  0.16676408],
       [ 0.27402974,  0.55215778],
       [ 0.44323485,  0.78779887]])

>>> np.random.shuffle(X)
>>> X
array([[ 0.9818058 ,  0.67513579],
       [ 0.44323485,  0.78779887],
       [ 0.82312674,  0.82768118],
       [ 0.29468324,  0.59305925],
       [ 0.25731731,  0.16676408],
       [ 0.27402974,  0.55215778]])

#2


11  

You can also use np.random.permutation to generate random permutation of row indices and then index into the rows of X using np.take with axis=0. Also, np.take facilitates overwriting to the input array X itself with out= option, which would save us memory. Thus, the implementation would look like this -

您还可以使用np.random.permutation生成行索引的随机排列,然后使用轴= 0的np.take将索引转换为X行。此外,np.take有助于使用out =选项覆盖输入数组X本身,这将节省我们的内存。因此,实现看起来像这样 -

np.take(X,np.random.permutation(X.shape[0]),axis=0,out=X)

Sample run -

样品运行 -

In [23]: X
Out[23]: 
array([[ 0.60511059,  0.75001599],
       [ 0.30968339,  0.09162172],
       [ 0.14673218,  0.09089028],
       [ 0.31663128,  0.10000309],
       [ 0.0957233 ,  0.96210485],
       [ 0.56843186,  0.36654023]])

In [24]: np.take(X,np.random.permutation(X.shape[0]),axis=0,out=X);

In [25]: X
Out[25]: 
array([[ 0.14673218,  0.09089028],
       [ 0.31663128,  0.10000309],
       [ 0.30968339,  0.09162172],
       [ 0.56843186,  0.36654023],
       [ 0.0957233 ,  0.96210485],
       [ 0.60511059,  0.75001599]])

Additional performance boost

额外的性能提升

Here's a trick to speed up np.random.permutation(X.shape[0]) with np.argsort() -

这是使用np.argsort()加速np.random.permutation(X.shape [0])的技巧 -

np.random.rand(X.shape[0]).argsort()

Speedup results -

加速结果 -

In [32]: X = np.random.random((6000, 2000))

In [33]: %timeit np.random.permutation(X.shape[0])
1000 loops, best of 3: 510 µs per loop

In [34]: %timeit np.random.rand(X.shape[0]).argsort()
1000 loops, best of 3: 297 µs per loop

Thus, the shuffling solution could be modified to -

因此,改组解决方案可以修改为 -

np.take(X,np.random.rand(X.shape[0]).argsort(),axis=0,out=X)

Runtime tests -

运行时测试 -

These tests include the two approaches listed in this post and np.shuffle based one in @Kasramvd's solution.

这些测试包括本文中列出的两种方法和基于@ Kasramvd解决方案的np.shuffle。

In [40]: X = np.random.random((6000, 2000))

In [41]: %timeit np.random.shuffle(X)
10 loops, best of 3: 25.2 ms per loop

In [42]: %timeit np.take(X,np.random.permutation(X.shape[0]),axis=0,out=X)
10 loops, best of 3: 53.3 ms per loop

In [43]: %timeit np.take(X,np.random.rand(X.shape[0]).argsort(),axis=0,out=X)
10 loops, best of 3: 53.2 ms per loop

So, it seems using these np.take based could be used only if memory is a concern or else np.random.shuffle based solution looks like the way to go.

因此,似乎使用这些基于np.take的方法只有在内存成为问题时才能使用,否则基于np.random.shuffle的解决方案就像是要走的路。

#3


2  

After a bit experiment i found most memory and time efficient way to shuffle data(row wise) of nd-array is, shuffle the index and get the data from shuffled index

经过一些实验,我发现大多数内存和时间有效的方式来重新排列nd-array的数据(行方式),将索引洗牌并从混洗索引中获取数据

rand_num2 = np.random.randint(5, size=(6000, 2000))
perm = np.arange(rand_num2.shape[0])
np.random.shuffle(perm)
rand_num2 = rand_num2[perm]

in more details
Here, I am using memory_profiler to find memory usage and python's builtin "time" module to record time and comparing all previous answers

更多细节在这里,我使用memory_profiler查找内存使用情况和python的内置“时间”模块来记录时间并比较所有以前的答案

def main():
    # shuffle data itself
    rand_num = np.random.randint(5, size=(6000, 2000))
    start = time.time()
    np.random.shuffle(rand_num)
    print('Time for direct shuffle: {0}'.format((time.time() - start)))

    # Shuffle index and get data from shuffled index
    rand_num2 = np.random.randint(5, size=(6000, 2000))
    start = time.time()
    perm = np.arange(rand_num2.shape[0])
    np.random.shuffle(perm)
    rand_num2 = rand_num2[perm]
    print('Time for shuffling index: {0}'.format((time.time() - start)))

    # using np.take()
    rand_num3 = np.random.randint(5, size=(6000, 2000))
    start = time.time()
    np.take(rand_num3, np.random.rand(rand_num3.shape[0]).argsort(), axis=0, out=rand_num3)
    print("Time taken by np.take, {0}".format((time.time() - start)))

Result for Time

时间的结果

Time for direct shuffle: 0.03345608711242676   # 33.4msec
Time for shuffling index: 0.019818782806396484 # 19.8msec
Time taken by np.take, 0.06726956367492676     # 67.2msec

Memory profiler Result

内存分析器结果

Line #    Mem usage    Increment   Line Contents
================================================
    39  117.422 MiB    0.000 MiB   @profile
    40                             def main():
    41                                 # shuffle data itself
    42  208.977 MiB   91.555 MiB       rand_num = np.random.randint(5, size=(6000, 2000))
    43  208.977 MiB    0.000 MiB       start = time.time()
    44  208.977 MiB    0.000 MiB       np.random.shuffle(rand_num)
    45  208.977 MiB    0.000 MiB       print('Time for direct shuffle: {0}'.format((time.time() - start)))
    46                             
    47                                 # Shuffle index and get data from shuffled index
    48  300.531 MiB   91.555 MiB       rand_num2 = np.random.randint(5, size=(6000, 2000))
    49  300.531 MiB    0.000 MiB       start = time.time()
    50  300.535 MiB    0.004 MiB       perm = np.arange(rand_num2.shape[0])
    51  300.539 MiB    0.004 MiB       np.random.shuffle(perm)
    52  300.539 MiB    0.000 MiB       rand_num2 = rand_num2[perm]
    53  300.539 MiB    0.000 MiB       print('Time for shuffling index: {0}'.format((time.time() - start)))
    54                             
    55                                 # using np.take()
    56  392.094 MiB   91.555 MiB       rand_num3 = np.random.randint(5, size=(6000, 2000))
    57  392.094 MiB    0.000 MiB       start = time.time()
    58  392.242 MiB    0.148 MiB       np.take(rand_num3, np.random.rand(rand_num3.shape[0]).argsort(), axis=0, out=rand_num3)
    59  392.242 MiB    0.000 MiB       print("Time taken by np.take, {0}".format((time.time() - start)))

#1


16  

That's what numpy.random.shuffle() is for :

这就是numpy.random.shuffle()的用途:

>>> X = np.random.random((6, 2))
>>> X
array([[ 0.9818058 ,  0.67513579],
       [ 0.82312674,  0.82768118],
       [ 0.29468324,  0.59305925],
       [ 0.25731731,  0.16676408],
       [ 0.27402974,  0.55215778],
       [ 0.44323485,  0.78779887]])

>>> np.random.shuffle(X)
>>> X
array([[ 0.9818058 ,  0.67513579],
       [ 0.44323485,  0.78779887],
       [ 0.82312674,  0.82768118],
       [ 0.29468324,  0.59305925],
       [ 0.25731731,  0.16676408],
       [ 0.27402974,  0.55215778]])

#2


11  

You can also use np.random.permutation to generate random permutation of row indices and then index into the rows of X using np.take with axis=0. Also, np.take facilitates overwriting to the input array X itself with out= option, which would save us memory. Thus, the implementation would look like this -

您还可以使用np.random.permutation生成行索引的随机排列,然后使用轴= 0的np.take将索引转换为X行。此外,np.take有助于使用out =选项覆盖输入数组X本身,这将节省我们的内存。因此,实现看起来像这样 -

np.take(X,np.random.permutation(X.shape[0]),axis=0,out=X)

Sample run -

样品运行 -

In [23]: X
Out[23]: 
array([[ 0.60511059,  0.75001599],
       [ 0.30968339,  0.09162172],
       [ 0.14673218,  0.09089028],
       [ 0.31663128,  0.10000309],
       [ 0.0957233 ,  0.96210485],
       [ 0.56843186,  0.36654023]])

In [24]: np.take(X,np.random.permutation(X.shape[0]),axis=0,out=X);

In [25]: X
Out[25]: 
array([[ 0.14673218,  0.09089028],
       [ 0.31663128,  0.10000309],
       [ 0.30968339,  0.09162172],
       [ 0.56843186,  0.36654023],
       [ 0.0957233 ,  0.96210485],
       [ 0.60511059,  0.75001599]])

Additional performance boost

额外的性能提升

Here's a trick to speed up np.random.permutation(X.shape[0]) with np.argsort() -

这是使用np.argsort()加速np.random.permutation(X.shape [0])的技巧 -

np.random.rand(X.shape[0]).argsort()

Speedup results -

加速结果 -

In [32]: X = np.random.random((6000, 2000))

In [33]: %timeit np.random.permutation(X.shape[0])
1000 loops, best of 3: 510 µs per loop

In [34]: %timeit np.random.rand(X.shape[0]).argsort()
1000 loops, best of 3: 297 µs per loop

Thus, the shuffling solution could be modified to -

因此,改组解决方案可以修改为 -

np.take(X,np.random.rand(X.shape[0]).argsort(),axis=0,out=X)

Runtime tests -

运行时测试 -

These tests include the two approaches listed in this post and np.shuffle based one in @Kasramvd's solution.

这些测试包括本文中列出的两种方法和基于@ Kasramvd解决方案的np.shuffle。

In [40]: X = np.random.random((6000, 2000))

In [41]: %timeit np.random.shuffle(X)
10 loops, best of 3: 25.2 ms per loop

In [42]: %timeit np.take(X,np.random.permutation(X.shape[0]),axis=0,out=X)
10 loops, best of 3: 53.3 ms per loop

In [43]: %timeit np.take(X,np.random.rand(X.shape[0]).argsort(),axis=0,out=X)
10 loops, best of 3: 53.2 ms per loop

So, it seems using these np.take based could be used only if memory is a concern or else np.random.shuffle based solution looks like the way to go.

因此,似乎使用这些基于np.take的方法只有在内存成为问题时才能使用,否则基于np.random.shuffle的解决方案就像是要走的路。

#3


2  

After a bit experiment i found most memory and time efficient way to shuffle data(row wise) of nd-array is, shuffle the index and get the data from shuffled index

经过一些实验,我发现大多数内存和时间有效的方式来重新排列nd-array的数据(行方式),将索引洗牌并从混洗索引中获取数据

rand_num2 = np.random.randint(5, size=(6000, 2000))
perm = np.arange(rand_num2.shape[0])
np.random.shuffle(perm)
rand_num2 = rand_num2[perm]

in more details
Here, I am using memory_profiler to find memory usage and python's builtin "time" module to record time and comparing all previous answers

更多细节在这里,我使用memory_profiler查找内存使用情况和python的内置“时间”模块来记录时间并比较所有以前的答案

def main():
    # shuffle data itself
    rand_num = np.random.randint(5, size=(6000, 2000))
    start = time.time()
    np.random.shuffle(rand_num)
    print('Time for direct shuffle: {0}'.format((time.time() - start)))

    # Shuffle index and get data from shuffled index
    rand_num2 = np.random.randint(5, size=(6000, 2000))
    start = time.time()
    perm = np.arange(rand_num2.shape[0])
    np.random.shuffle(perm)
    rand_num2 = rand_num2[perm]
    print('Time for shuffling index: {0}'.format((time.time() - start)))

    # using np.take()
    rand_num3 = np.random.randint(5, size=(6000, 2000))
    start = time.time()
    np.take(rand_num3, np.random.rand(rand_num3.shape[0]).argsort(), axis=0, out=rand_num3)
    print("Time taken by np.take, {0}".format((time.time() - start)))

Result for Time

时间的结果

Time for direct shuffle: 0.03345608711242676   # 33.4msec
Time for shuffling index: 0.019818782806396484 # 19.8msec
Time taken by np.take, 0.06726956367492676     # 67.2msec

Memory profiler Result

内存分析器结果

Line #    Mem usage    Increment   Line Contents
================================================
    39  117.422 MiB    0.000 MiB   @profile
    40                             def main():
    41                                 # shuffle data itself
    42  208.977 MiB   91.555 MiB       rand_num = np.random.randint(5, size=(6000, 2000))
    43  208.977 MiB    0.000 MiB       start = time.time()
    44  208.977 MiB    0.000 MiB       np.random.shuffle(rand_num)
    45  208.977 MiB    0.000 MiB       print('Time for direct shuffle: {0}'.format((time.time() - start)))
    46                             
    47                                 # Shuffle index and get data from shuffled index
    48  300.531 MiB   91.555 MiB       rand_num2 = np.random.randint(5, size=(6000, 2000))
    49  300.531 MiB    0.000 MiB       start = time.time()
    50  300.535 MiB    0.004 MiB       perm = np.arange(rand_num2.shape[0])
    51  300.539 MiB    0.004 MiB       np.random.shuffle(perm)
    52  300.539 MiB    0.000 MiB       rand_num2 = rand_num2[perm]
    53  300.539 MiB    0.000 MiB       print('Time for shuffling index: {0}'.format((time.time() - start)))
    54                             
    55                                 # using np.take()
    56  392.094 MiB   91.555 MiB       rand_num3 = np.random.randint(5, size=(6000, 2000))
    57  392.094 MiB    0.000 MiB       start = time.time()
    58  392.242 MiB    0.148 MiB       np.take(rand_num3, np.random.rand(rand_num3.shape[0]).argsort(), axis=0, out=rand_num3)
    59  392.242 MiB    0.000 MiB       print("Time taken by np.take, {0}".format((time.time() - start)))