改组数组中每行的非零元素--Python / NumPy

时间:2021-01-20 21:21:53

I have a an array that is relatively sparse, and I would like to go through each row and shuffle only the non-zero elements.

我有一个相对稀疏的数组,我想通过每一行,只调整非零元素。

Example Input:

示例输入:

[2,3,1,0]
[0,0,2,1]

Example Output:

示例输出:

[2,1,3,0]
[0,0,1,2]

Note how the zeros have not changed position.

注意零如何改变位置。

To shuffle all elements in each row (including zeros) I can do this:

要重新排列每行中的所有元素(包括零),我可以这样做:

for i in range(len(X)):
    np.random.shuffle(X[i, :])

What I tried to do then is this:

我试图做的是这样的:

for i in range(len(X)):
    np.random.shuffle(X[i, np.nonzero(X[i, :])])

But it has no effect. I've noticed that the return type of X[i, np.nonzero(X[i, :])] is different from X[i, :] which might be the cause.

但它没有效果。我注意到X [i,np.nonzero(X [i,:])]的返回类型与X [i,:]不同,这可能是原因。

In[30]: X[i, np.nonzero(X[i, :])]
Out[30]: array([[23,  5, 29, 11, 17]])

In[31]: X[i, :]
Out[31]: array([23,  5, 29, 11, 17])

7 个解决方案

#1


15  

You could use the non-inplace numpy.random.permutation with explicit non-zero indexing:

您可以使用带有显式非零索引的非inplace numpy.random.permutation:

>>> X = np.array([[2,3,1,0], [0,0,2,1]])
>>> for i in range(len(X)):
...     idx = np.nonzero(X[i])
...     X[i][idx] = np.random.permutation(X[i][idx])
... 
>>> X
array([[3, 2, 1, 0],
       [0, 0, 2, 1]])

#2


8  

I think I found the three-liner?

我想我找到了三班轮?

i, j = np.nonzero(a.astype(bool))
k = np.argsort(i + np.random.rand(i.size))
a[i,j] = a[i,j[k]]

#3


8  

As promised, this being the fourth day of the bounty period, here's my attempt at a vectorized solution. The steps involved are explained in some details below :

正如所承诺的那样,这是赏金期的第四天,这是我对矢量化解决方案的尝试。所涉及的步骤将在下面详细说明:

  • For easy reference, let's call the input array as a. Generate unique indices per row that covers the range for row length. For this, we can simply generate random numbers of the same shape as the input array and get the argsort indices along each row, which would be those unique indices. This idea has been explored before in this post.

    为了便于参考,让我们将输入数组称为a。每行生成唯一的索引,涵盖行长度的范围。为此,我们可以简单地生成与输入数组相同形状的随机数,并获得每行的argsort索引,这将是那些唯一索引。这篇文章之前已经探讨了这个想法。

  • Index into each row of input array with those indices as columns indices. Thus, we would need advanced-indexing here. Now, this gives us an array with each row being shuffled. Let's call it b.

    将这些索引作为列索引索引到输入数组的每一行。因此,我们需要在这里进行高级索引。现在,这给了我们一个数组,每行都被洗牌。我们称之为b。

  • Since the shuffling is restricted to per row, if we simply use the boolean-indexing : b[b!=0], we would get the non-zero elements being shuffled and also being restricted to lengths of non-zeros per row. This is because of the fact that the elements in a NumPy array are stored in row-major order, so with boolean-indexing it would have selected shuffled non-zero elements on each row first before moving onto the next row. Again, if we use boolean-indexing similarly for a, i.e. a[a!=0], we would have similarly gotten the non-zero elements on each row first before moving onto the next row and these would be in their original order. So, the final step would be to just grab masked elements b[b!=0] and assign into the masked places a[a!=0].

    由于混洗仅限于每行,如果我们只使用布尔索引:b [b!= 0],我们将得到非零元素的混洗,并且还限制为每行的非零长度。这是因为NumPy数组中的元素以行主顺序存储,因此使用布尔索引时,它会在移动到下一行之前先在每行上选择混洗的非零元素。同样,如果我们对a使用布尔索引类似,即a [a!= 0],我们在移动到下一行之前会先在每行上获得非零元素,这些元素将按原始顺序排列。因此,最后一步是只抓住屏蔽元素b [b!= 0]并将掩码分配给[a!= 0]。

Thus, an implementation covering the above mentioned three steps would be -

因此,涵盖上述三个步骤的实施将是 -

m,n = a.shape
rand_idx = np.random.rand(m,n).argsort(axis=1) #step1
b = a[np.arange(m)[:,None], rand_idx]          #step2  
a[a!=0] = b[b!=0]                              #step3 

A sample step-by-step run might make things clearer -

逐步运行的示例可能会使事情变得更加清晰 -

In [50]: a # Input array
Out[50]: 
array([[ 8,  5,  0, -4],
       [ 0,  6,  0,  3],
       [ 8,  5,  0, -4]])

In [51]: m,n = a.shape # Store shape information

# Unique indices per row that covers the range for row length
In [52]: rand_idx = np.random.rand(m,n).argsort(axis=1)

In [53]: rand_idx
Out[53]: 
array([[0, 2, 3, 1],
       [1, 0, 3, 2],
       [2, 3, 0, 1]])

# Get corresponding indexed array
In [54]: b = a[np.arange(m)[:,None], rand_idx]

# Do a check on the shuffling being restricted to per row
In [55]: a[a!=0]
Out[55]: array([ 8,  5, -4,  6,  3,  8,  5, -4])

In [56]: b[b!=0]
Out[56]: array([ 8, -4,  5,  6,  3, -4,  8,  5])

# Finally do the assignment based on masking on a and b
In [57]: a[a!=0] = b[b!=0]

In [58]: a # Final verification on desired result
Out[58]: 
array([[ 8, -4,  0,  5],
       [ 0,  6,  0,  3],
       [-4,  8,  0,  5]])

#4


7  

Benchmarking for the vectorized solutions

We are looking to benchmark vectorized solutions in this post. Now, the vectorization tries to avoid the looping that we would loop through each row and do the shuffling. So, the setup for the input array involves a greater number of rows.

我们期待在这篇文章中对矢量化解决方案进行基准测试现在,向量化试图避免循环,我们将遍历每一行并进行重排。因此,输入数组的设置涉及更多行。

Approaches -

方法 -

def app1(a): # @Daniel F's soln
    i, j = np.nonzero(a.astype(bool))
    k = np.argsort(i + np.random.rand(i.size))
    a[i,j] = a[i,j[k]]
    return a

def app2(x): # @kazemakase's soln
    r, c = np.where(x != 0)
    n = c.size
    perm = np.random.permutation(n)
    i = np.argsort(perm + r * n)
    x[r, c] = x[r, c[i]]
    return x

def app3(a): # @Divakar's soln
    m,n = a.shape
    rand_idx = np.random.rand(m,n).argsort(axis=1)
    b = a[np.arange(m)[:,None], rand_idx]
    a[a!=0] = b[b!=0]
    return a

from scipy.ndimage.measurements import labeled_comprehension
def app4(a): # @FabienP's soln
    def func(array, idx):
        r[idx] = np.random.permutation(array)
        return True
    labels, idx = nz = a.nonzero()
    r = a[nz]
    labeled_comprehension(a[nz], labels + 1, np.unique(labels + 1),\
                                func, int, 0, pass_positions=True)
    a[nz] = r
    return a

Benchmarking procedure #1

基准程序#1

For a fair benchmarking, it seemed reasonable to use OP's sample and simply stack those as more rows to get a bigger dataset. Thus, with that setup we could create two cases with 2 million and 20 million rows datasets.

对于公平的基准测试,使用OP的示例似乎是合理的,只需将它们堆叠为更多行以获得更大的数据集。因此,通过该设置,我们可以创建两个包含200万和2000万行数据集的案例。

Case #1 : Large dataset with 2*1000,000 rows

案例#1:具有2 * 1000,000行的大型数据集

In [174]: a = np.array([[2,3,1,0],[0,0,2,1]])

In [175]: a = np.vstack([a]*1000000)

In [176]: %timeit app1(a)
     ...: %timeit app2(a)
     ...: %timeit app3(a)
     ...: %timeit app4(a)
     ...: 
1 loop, best of 3: 264 ms per loop
1 loop, best of 3: 422 ms per loop
1 loop, best of 3: 254 ms per loop
1 loop, best of 3: 14.3 s per loop

Case #2 : Larger dataset with 2*10,000,000 rows

案例#2:具有2 * 10,000,000行的较大数据集

In [177]: a = np.array([[2,3,1,0],[0,0,2,1]])

In [178]: a = np.vstack([a]*10000000)

# app4 skipped here as it was slower on the previous smaller dataset
In [179]: %timeit app1(a)
     ...: %timeit app2(a)
     ...: %timeit app3(a)
     ...: 
1 loop, best of 3: 2.86 s per loop
1 loop, best of 3: 4.62 s per loop
1 loop, best of 3: 2.55 s per loop

Benchmarking procedure #2 : Extensive one

基准程序#2:广泛的一个

To cover all cases of varying percentage of non-zeros in the input array, we are covering some extensive benchmarking scenarios. Also, since app4 seemed much slower than others, for a closer inspection we are skipping that one in this section.

为了涵盖输入数组中不同百分比的非零的所有情况,我们将介绍一些广泛的基准测试场景。此外,由于app4似乎比其他人慢得多,为了进一步检查,我们在本节中跳过了那个。

Helper function to setup input array :

辅助函数来设置输入数组:

def in_data(n_col, nnz_ratio):
    # max no. of elems that my system can handle, i.e. stretching it to limits.
    # The idea is to use this to decide the number of rows and always use
    # max. possible dataset size
    num_elem = 10000000

    n_row = num_elem//n_col
    a = np.zeros((n_row, n_col),dtype=int)
    L = int(round(a.size*nnz_ratio))
    a.ravel()[np.random.choice(a.size, L, replace=0)] = np.random.randint(1,6,L)
    return a

Main timing and plotting script (Uses IPython magic functions. So, needs to be run opon copying and pasting onto IPython console) -

主要时间和绘图脚本(使用IPython魔术功能。所以,需要运行opon复制并粘贴到IPython控制台上) -

import matplotlib.pyplot as plt

# Setup input params
nnz_ratios = np.array([0.2, 0.4, 0.6, 0.8])
n_cols = np.array([4, 5, 8, 10, 15, 20, 25, 50])

init_arr1 = np.zeros((len(nnz_ratios), len(n_cols) ))
init_arr2 = np.zeros((len(nnz_ratios), len(n_cols) ))
init_arr3 = np.zeros((len(nnz_ratios), len(n_cols) ))

timings = {app1:init_arr1, app2:init_arr2, app3:init_arr3}
for i,nnz_ratio in enumerate(nnz_ratios):
    for j,n_col in enumerate(n_cols):
        a = in_data(n_col, nnz_ratio=nnz_ratio)
        for func in timings:
            res = %timeit -oq func(a)
            timings[func][i,j] = res.best
            print func.__name__, i, j, res.best

fig = plt.figure(1)
colors = ['b','k','r']
for i in range(len(nnz_ratios)):
    ax = plt.subplot(2,2,i+1)
    for f,func in enumerate(timings):
        ax.plot(n_cols, 
                [time for time in timings[func][i]], 
                label=str(func.__name__), color=colors[f])
    ax.set_xlabel('No. of cols')
    ax.set_ylabel('time [seconds]')
    ax.grid(which='both')
    ax.legend()
    plt.tight_layout()
    plt.title('Percentage non-zeros : '+str(int(100*nnz_ratios[i])) + '%')
plt.subplots_adjust(wspace=0.2, hspace=0.2)

Timings output -

时间输出 -

改组数组中每行的非零元素--Python / NumPy

Observations :

观察:

  • Approaches #1, #2 does argsort on the non-zero elements across the entire input array. As such, it performs better with lesser percentage of non-zeros.

    方法#1,#2对整个输入数组中的非零元素进行argsort。因此,它使用较少的非零值来表现更好。

  • Approach #3 creates random numbers of the same shape as the input array and then gets argsort indices per row. Thus, for a given number of non-zeros in the input, the timings for it are more steep-ish than first two approaches.

    方法#3创建与输入数组相同形状的随机数,然后获得每行的argsort索引。因此,对于输入中给定数量的非零,其时序比前两种方法更陡峭。

Conclusion :

结论:

Approach #1 seems to be doing pretty well until 60% non-zero mark. For more non-zeros and if the row-lengths are small, approach #3 seems to be performing decently.

方法#1似乎做得很好,直到60%非零标记。对于更多的非零值,如果行长度很小,方法#3似乎表现得不错。

#5


4  

I came up with that:

我想到了:

nz = a.nonzero()                      # Get nonzero indexes
a[nz] = np.random.permutation(a[nz])  # Shuffle nonzero values with mask

Which look simpler (and a little bit faster?) than other proposed solutions.

哪个看起来比其他提议的解决方案更简单(并且更快一点?)。


EDIT: new version that does not mix rows

编辑:不混合行的新版本

 labels, *idx = nz = a.nonzero()                                    # get masks
 a[nz] = np.concatenate([np.random.permutation(a[nz][labels == i])  # permute values
                         for i in np.unique(labels)])               # for each label

Where the first array of a.nonzero() (indexes of non zero values in axis0) is used as labels. This is the trick that does not mix rows.

其中a.nonzero()的第一个数组(axis0中非零值的索引)用作标签。这是不混合行的技巧。

Then np.random.permutation is applied on a[a.nonzero()] for each "label" independently.

然后np.random.permutation独立地应用于每个“标签”的[a.nonzero()]。

Supposedly scipy.ndimage.measurements.labeled_comprehension can be used here, by it seems to fail with np.random.permutation.

据说scipy.ndimage.measurements.labeled_comprehension可以在这里使用,因为np.random.permutation似乎失败了。

And I finally saw that it looks a lot like what @randomir proposed. Anyway, it was just for the challenge of getting it to work.

我终于看到它看起来很像@randomir提出的那样。无论如何,它只是为了让它发挥作用的挑战。


EDIT2:

EDIT2:

Finally got it working with scipy.ndimage.measurements.labeled_comprehension

最后让它使用scipy.ndimage.measurements.labeled_comprehension

def shuffle_rows(a):
    def func(array, idx):
        r[idx] = np.random.permutation(array)
        return True
    labels, *idx = nz = a.nonzero()
    r = a[nz]
    labeled_comprehension(a[nz], labels + 1, np.unique(labels + 1), func, int, 0, pass_positions=True)
    a[nz] = r
    return a

Where:

哪里:

  1. func() shuffles the non zero values
  2. func()改组非零值
  3. labeled_comprehension applies func() label-wise
  4. labeled_comprehension以标签方式应用func()

This replaces the previous for loop and will be faster on arrays with many rows.

这将取代之前的for循环,并且在具有许多行的数组上会更快。

#6


3  

This is one possibility for a vectorized solution:

这是矢量化解决方案的一种可能性:

r, c = np.where(x > 0)
n = c.size

perm = np.random.permutation(n)
i = np.argsort(perm + r * n)

x[r, c] = x[r, c[i]]

The challenge in vectorizing this problem is that np.random.permutation gives only flat indices, which would shuffle the array elements across rows. Sorting the permuted values with an offset added makes sure no shuffling across rows occurs.

向量化这个问题的挑战是np.random.permutation只给出平坦的索引,这会使行间的数组元素混洗。使用添加的偏移量对置换值进行排序可确保不会跨行进行混洗。

#7


3  

Here's your two liner without needing to install numpy.

这是你的两个衬垫,无需安装numpy。

from random import random

def shuffle_nonzeros(input_list):
    ''' returns a list with the non-zero values shuffled '''
    shuffled_nonzero = iter(sorted((i for i in input_list if i!=0), key=lambda k: random()))
    print([i for i in (i if i==0 else next(shuffled_nonzero) for i in input_list)])

if you dont like one liners though, you can either make this a generator with

如果你不喜欢一个衬垫,你可以把它变成一个发电机

def shuffle_nonzeros(input_list):
    ''' generator that yields a list with the non-zero values shuffled '''
    random_nonzero_values = iter(sorted((i for i in input_list if i!=0), key=lambda k: random()))
    for i in iterable:
        if i==0:
            yield i
        else:
            yield next(random_nonzero_values)

or if you want a list as the output, and dont like one line comprehensions

或者如果你想要一个列表作为输出,并且不喜欢一行理解

def shuffle_nonzeros(input_list):
    ''' returns a list with the non-zero values shuffled '''
    out = []
    random_nonzero_values = iter(sorted((i for i in input_list if i!=0), key=lambda k: random()))
    for i in iterable:
        if i==0:
            out.append(i)
        else:
            out.append(next(random_nonzero_values))
    return out

#1


15  

You could use the non-inplace numpy.random.permutation with explicit non-zero indexing:

您可以使用带有显式非零索引的非inplace numpy.random.permutation:

>>> X = np.array([[2,3,1,0], [0,0,2,1]])
>>> for i in range(len(X)):
...     idx = np.nonzero(X[i])
...     X[i][idx] = np.random.permutation(X[i][idx])
... 
>>> X
array([[3, 2, 1, 0],
       [0, 0, 2, 1]])

#2


8  

I think I found the three-liner?

我想我找到了三班轮?

i, j = np.nonzero(a.astype(bool))
k = np.argsort(i + np.random.rand(i.size))
a[i,j] = a[i,j[k]]

#3


8  

As promised, this being the fourth day of the bounty period, here's my attempt at a vectorized solution. The steps involved are explained in some details below :

正如所承诺的那样,这是赏金期的第四天,这是我对矢量化解决方案的尝试。所涉及的步骤将在下面详细说明:

  • For easy reference, let's call the input array as a. Generate unique indices per row that covers the range for row length. For this, we can simply generate random numbers of the same shape as the input array and get the argsort indices along each row, which would be those unique indices. This idea has been explored before in this post.

    为了便于参考,让我们将输入数组称为a。每行生成唯一的索引,涵盖行长度的范围。为此,我们可以简单地生成与输入数组相同形状的随机数,并获得每行的argsort索引,这将是那些唯一索引。这篇文章之前已经探讨了这个想法。

  • Index into each row of input array with those indices as columns indices. Thus, we would need advanced-indexing here. Now, this gives us an array with each row being shuffled. Let's call it b.

    将这些索引作为列索引索引到输入数组的每一行。因此,我们需要在这里进行高级索引。现在,这给了我们一个数组,每行都被洗牌。我们称之为b。

  • Since the shuffling is restricted to per row, if we simply use the boolean-indexing : b[b!=0], we would get the non-zero elements being shuffled and also being restricted to lengths of non-zeros per row. This is because of the fact that the elements in a NumPy array are stored in row-major order, so with boolean-indexing it would have selected shuffled non-zero elements on each row first before moving onto the next row. Again, if we use boolean-indexing similarly for a, i.e. a[a!=0], we would have similarly gotten the non-zero elements on each row first before moving onto the next row and these would be in their original order. So, the final step would be to just grab masked elements b[b!=0] and assign into the masked places a[a!=0].

    由于混洗仅限于每行,如果我们只使用布尔索引:b [b!= 0],我们将得到非零元素的混洗,并且还限制为每行的非零长度。这是因为NumPy数组中的元素以行主顺序存储,因此使用布尔索引时,它会在移动到下一行之前先在每行上选择混洗的非零元素。同样,如果我们对a使用布尔索引类似,即a [a!= 0],我们在移动到下一行之前会先在每行上获得非零元素,这些元素将按原始顺序排列。因此,最后一步是只抓住屏蔽元素b [b!= 0]并将掩码分配给[a!= 0]。

Thus, an implementation covering the above mentioned three steps would be -

因此,涵盖上述三个步骤的实施将是 -

m,n = a.shape
rand_idx = np.random.rand(m,n).argsort(axis=1) #step1
b = a[np.arange(m)[:,None], rand_idx]          #step2  
a[a!=0] = b[b!=0]                              #step3 

A sample step-by-step run might make things clearer -

逐步运行的示例可能会使事情变得更加清晰 -

In [50]: a # Input array
Out[50]: 
array([[ 8,  5,  0, -4],
       [ 0,  6,  0,  3],
       [ 8,  5,  0, -4]])

In [51]: m,n = a.shape # Store shape information

# Unique indices per row that covers the range for row length
In [52]: rand_idx = np.random.rand(m,n).argsort(axis=1)

In [53]: rand_idx
Out[53]: 
array([[0, 2, 3, 1],
       [1, 0, 3, 2],
       [2, 3, 0, 1]])

# Get corresponding indexed array
In [54]: b = a[np.arange(m)[:,None], rand_idx]

# Do a check on the shuffling being restricted to per row
In [55]: a[a!=0]
Out[55]: array([ 8,  5, -4,  6,  3,  8,  5, -4])

In [56]: b[b!=0]
Out[56]: array([ 8, -4,  5,  6,  3, -4,  8,  5])

# Finally do the assignment based on masking on a and b
In [57]: a[a!=0] = b[b!=0]

In [58]: a # Final verification on desired result
Out[58]: 
array([[ 8, -4,  0,  5],
       [ 0,  6,  0,  3],
       [-4,  8,  0,  5]])

#4


7  

Benchmarking for the vectorized solutions

We are looking to benchmark vectorized solutions in this post. Now, the vectorization tries to avoid the looping that we would loop through each row and do the shuffling. So, the setup for the input array involves a greater number of rows.

我们期待在这篇文章中对矢量化解决方案进行基准测试现在,向量化试图避免循环,我们将遍历每一行并进行重排。因此,输入数组的设置涉及更多行。

Approaches -

方法 -

def app1(a): # @Daniel F's soln
    i, j = np.nonzero(a.astype(bool))
    k = np.argsort(i + np.random.rand(i.size))
    a[i,j] = a[i,j[k]]
    return a

def app2(x): # @kazemakase's soln
    r, c = np.where(x != 0)
    n = c.size
    perm = np.random.permutation(n)
    i = np.argsort(perm + r * n)
    x[r, c] = x[r, c[i]]
    return x

def app3(a): # @Divakar's soln
    m,n = a.shape
    rand_idx = np.random.rand(m,n).argsort(axis=1)
    b = a[np.arange(m)[:,None], rand_idx]
    a[a!=0] = b[b!=0]
    return a

from scipy.ndimage.measurements import labeled_comprehension
def app4(a): # @FabienP's soln
    def func(array, idx):
        r[idx] = np.random.permutation(array)
        return True
    labels, idx = nz = a.nonzero()
    r = a[nz]
    labeled_comprehension(a[nz], labels + 1, np.unique(labels + 1),\
                                func, int, 0, pass_positions=True)
    a[nz] = r
    return a

Benchmarking procedure #1

基准程序#1

For a fair benchmarking, it seemed reasonable to use OP's sample and simply stack those as more rows to get a bigger dataset. Thus, with that setup we could create two cases with 2 million and 20 million rows datasets.

对于公平的基准测试,使用OP的示例似乎是合理的,只需将它们堆叠为更多行以获得更大的数据集。因此,通过该设置,我们可以创建两个包含200万和2000万行数据集的案例。

Case #1 : Large dataset with 2*1000,000 rows

案例#1:具有2 * 1000,000行的大型数据集

In [174]: a = np.array([[2,3,1,0],[0,0,2,1]])

In [175]: a = np.vstack([a]*1000000)

In [176]: %timeit app1(a)
     ...: %timeit app2(a)
     ...: %timeit app3(a)
     ...: %timeit app4(a)
     ...: 
1 loop, best of 3: 264 ms per loop
1 loop, best of 3: 422 ms per loop
1 loop, best of 3: 254 ms per loop
1 loop, best of 3: 14.3 s per loop

Case #2 : Larger dataset with 2*10,000,000 rows

案例#2:具有2 * 10,000,000行的较大数据集

In [177]: a = np.array([[2,3,1,0],[0,0,2,1]])

In [178]: a = np.vstack([a]*10000000)

# app4 skipped here as it was slower on the previous smaller dataset
In [179]: %timeit app1(a)
     ...: %timeit app2(a)
     ...: %timeit app3(a)
     ...: 
1 loop, best of 3: 2.86 s per loop
1 loop, best of 3: 4.62 s per loop
1 loop, best of 3: 2.55 s per loop

Benchmarking procedure #2 : Extensive one

基准程序#2:广泛的一个

To cover all cases of varying percentage of non-zeros in the input array, we are covering some extensive benchmarking scenarios. Also, since app4 seemed much slower than others, for a closer inspection we are skipping that one in this section.

为了涵盖输入数组中不同百分比的非零的所有情况,我们将介绍一些广泛的基准测试场景。此外,由于app4似乎比其他人慢得多,为了进一步检查,我们在本节中跳过了那个。

Helper function to setup input array :

辅助函数来设置输入数组:

def in_data(n_col, nnz_ratio):
    # max no. of elems that my system can handle, i.e. stretching it to limits.
    # The idea is to use this to decide the number of rows and always use
    # max. possible dataset size
    num_elem = 10000000

    n_row = num_elem//n_col
    a = np.zeros((n_row, n_col),dtype=int)
    L = int(round(a.size*nnz_ratio))
    a.ravel()[np.random.choice(a.size, L, replace=0)] = np.random.randint(1,6,L)
    return a

Main timing and plotting script (Uses IPython magic functions. So, needs to be run opon copying and pasting onto IPython console) -

主要时间和绘图脚本(使用IPython魔术功能。所以,需要运行opon复制并粘贴到IPython控制台上) -

import matplotlib.pyplot as plt

# Setup input params
nnz_ratios = np.array([0.2, 0.4, 0.6, 0.8])
n_cols = np.array([4, 5, 8, 10, 15, 20, 25, 50])

init_arr1 = np.zeros((len(nnz_ratios), len(n_cols) ))
init_arr2 = np.zeros((len(nnz_ratios), len(n_cols) ))
init_arr3 = np.zeros((len(nnz_ratios), len(n_cols) ))

timings = {app1:init_arr1, app2:init_arr2, app3:init_arr3}
for i,nnz_ratio in enumerate(nnz_ratios):
    for j,n_col in enumerate(n_cols):
        a = in_data(n_col, nnz_ratio=nnz_ratio)
        for func in timings:
            res = %timeit -oq func(a)
            timings[func][i,j] = res.best
            print func.__name__, i, j, res.best

fig = plt.figure(1)
colors = ['b','k','r']
for i in range(len(nnz_ratios)):
    ax = plt.subplot(2,2,i+1)
    for f,func in enumerate(timings):
        ax.plot(n_cols, 
                [time for time in timings[func][i]], 
                label=str(func.__name__), color=colors[f])
    ax.set_xlabel('No. of cols')
    ax.set_ylabel('time [seconds]')
    ax.grid(which='both')
    ax.legend()
    plt.tight_layout()
    plt.title('Percentage non-zeros : '+str(int(100*nnz_ratios[i])) + '%')
plt.subplots_adjust(wspace=0.2, hspace=0.2)

Timings output -

时间输出 -

改组数组中每行的非零元素--Python / NumPy

Observations :

观察:

  • Approaches #1, #2 does argsort on the non-zero elements across the entire input array. As such, it performs better with lesser percentage of non-zeros.

    方法#1,#2对整个输入数组中的非零元素进行argsort。因此,它使用较少的非零值来表现更好。

  • Approach #3 creates random numbers of the same shape as the input array and then gets argsort indices per row. Thus, for a given number of non-zeros in the input, the timings for it are more steep-ish than first two approaches.

    方法#3创建与输入数组相同形状的随机数,然后获得每行的argsort索引。因此,对于输入中给定数量的非零,其时序比前两种方法更陡峭。

Conclusion :

结论:

Approach #1 seems to be doing pretty well until 60% non-zero mark. For more non-zeros and if the row-lengths are small, approach #3 seems to be performing decently.

方法#1似乎做得很好,直到60%非零标记。对于更多的非零值,如果行长度很小,方法#3似乎表现得不错。

#5


4  

I came up with that:

我想到了:

nz = a.nonzero()                      # Get nonzero indexes
a[nz] = np.random.permutation(a[nz])  # Shuffle nonzero values with mask

Which look simpler (and a little bit faster?) than other proposed solutions.

哪个看起来比其他提议的解决方案更简单(并且更快一点?)。


EDIT: new version that does not mix rows

编辑:不混合行的新版本

 labels, *idx = nz = a.nonzero()                                    # get masks
 a[nz] = np.concatenate([np.random.permutation(a[nz][labels == i])  # permute values
                         for i in np.unique(labels)])               # for each label

Where the first array of a.nonzero() (indexes of non zero values in axis0) is used as labels. This is the trick that does not mix rows.

其中a.nonzero()的第一个数组(axis0中非零值的索引)用作标签。这是不混合行的技巧。

Then np.random.permutation is applied on a[a.nonzero()] for each "label" independently.

然后np.random.permutation独立地应用于每个“标签”的[a.nonzero()]。

Supposedly scipy.ndimage.measurements.labeled_comprehension can be used here, by it seems to fail with np.random.permutation.

据说scipy.ndimage.measurements.labeled_comprehension可以在这里使用,因为np.random.permutation似乎失败了。

And I finally saw that it looks a lot like what @randomir proposed. Anyway, it was just for the challenge of getting it to work.

我终于看到它看起来很像@randomir提出的那样。无论如何,它只是为了让它发挥作用的挑战。


EDIT2:

EDIT2:

Finally got it working with scipy.ndimage.measurements.labeled_comprehension

最后让它使用scipy.ndimage.measurements.labeled_comprehension

def shuffle_rows(a):
    def func(array, idx):
        r[idx] = np.random.permutation(array)
        return True
    labels, *idx = nz = a.nonzero()
    r = a[nz]
    labeled_comprehension(a[nz], labels + 1, np.unique(labels + 1), func, int, 0, pass_positions=True)
    a[nz] = r
    return a

Where:

哪里:

  1. func() shuffles the non zero values
  2. func()改组非零值
  3. labeled_comprehension applies func() label-wise
  4. labeled_comprehension以标签方式应用func()

This replaces the previous for loop and will be faster on arrays with many rows.

这将取代之前的for循环,并且在具有许多行的数组上会更快。

#6


3  

This is one possibility for a vectorized solution:

这是矢量化解决方案的一种可能性:

r, c = np.where(x > 0)
n = c.size

perm = np.random.permutation(n)
i = np.argsort(perm + r * n)

x[r, c] = x[r, c[i]]

The challenge in vectorizing this problem is that np.random.permutation gives only flat indices, which would shuffle the array elements across rows. Sorting the permuted values with an offset added makes sure no shuffling across rows occurs.

向量化这个问题的挑战是np.random.permutation只给出平坦的索引,这会使行间的数组元素混洗。使用添加的偏移量对置换值进行排序可确保不会跨行进行混洗。

#7


3  

Here's your two liner without needing to install numpy.

这是你的两个衬垫,无需安装numpy。

from random import random

def shuffle_nonzeros(input_list):
    ''' returns a list with the non-zero values shuffled '''
    shuffled_nonzero = iter(sorted((i for i in input_list if i!=0), key=lambda k: random()))
    print([i for i in (i if i==0 else next(shuffled_nonzero) for i in input_list)])

if you dont like one liners though, you can either make this a generator with

如果你不喜欢一个衬垫,你可以把它变成一个发电机

def shuffle_nonzeros(input_list):
    ''' generator that yields a list with the non-zero values shuffled '''
    random_nonzero_values = iter(sorted((i for i in input_list if i!=0), key=lambda k: random()))
    for i in iterable:
        if i==0:
            yield i
        else:
            yield next(random_nonzero_values)

or if you want a list as the output, and dont like one line comprehensions

或者如果你想要一个列表作为输出,并且不喜欢一行理解

def shuffle_nonzeros(input_list):
    ''' returns a list with the non-zero values shuffled '''
    out = []
    random_nonzero_values = iter(sorted((i for i in input_list if i!=0), key=lambda k: random()))
    for i in iterable:
        if i==0:
            out.append(i)
        else:
            out.append(next(random_nonzero_values))
    return out