Scipy.sparse。csr_matrix:如何获得前十位值和索引?

时间:2022-11-27 21:23:19

I have a large csr_matrix and I am interested in the top ten values and their indices each row. But I did not find a decent way to manipulate the matrix.

我有一个大的csr_matrix,我对前10个值及其每一行的索引感兴趣。但是我没有找到一个合适的方法来操纵矩阵。

Here is my current solution and the main idea is to process them row by row:

这是我目前的解决方案,主要思路是逐行处理:

row = csr_matrix.getrow(row_number).toarray()[0].ravel()
top_ten_indicies = row.argsort()[-10:]
top_ten_values = row[row.argsort()[-10:]]

By doing this, the advantages of csr_matrix is not fully used. It's more like a brute force solution.

通过这样做,csr_matrix的优点没有得到充分利用。这更像是一种蛮力解决方案。

2 个解决方案

#1


7  

I don't see what the advantages of csr format are in this case. Sure, all the nonzero values are collected in one .data array, with the corresponding column indexes in .indices. But they are in blocks of varying length. And that means they can't be processed in parallel or with numpy array strides.

我不认为csr格式在这种情况下有什么好处。当然,所有非零值都在一个.data数组中收集,相应的列索引在.indices中。但是它们是不同长度的块。这意味着它们不能被并行处理或使用numpy数组步长。

One solution is the pad those blocks into common length blocks. That's what .toarray() does. Then you can find the maximum values with argsort(axis=1) or withargpartition`.

一种解决方案是将这些块填充到公共长度块中。这就是.toarray()。然后可以使用argsort(axis=1)或使用argpartition '找到最大值。

Another is to break them into row sized blocks, and process each of those. That's what you are doing with the .getrow. Another way of breaking them up is convert to lil format, and process the sublists of the .data and .rows arrays.

另一种方法是将它们分解成行大小的块,并对每个块进行处理。这就是你对。getrow所做的。另一种方法是将它们转换为lil格式,并处理.data和.rows数组的子列表。

A possible third option is to use the ufunc reduceat method. This lets you apply ufunc reduction methods to sequential blocks of an array. There are established ufunc like np.add that take advantage of this. argsort is not such a function. But there is a way of constructing a ufunc from a Python function, and gain some modest speed over regular Python iteration. [I need to look up a recent SO question that illustrates this.]

第三种可能的选择是使用ufunc reduceat方法。这使您可以将ufunc约简方法应用到数组的顺序块。有确定的ufunc如np。加上利用这一点。argsort不是这样的函数。但是有一种方法可以从Python函数中构建ufunc,并在常规Python迭代中获得一定的速度。[我需要查阅最近提出的一个问题来说明这一点。]

I'll illustrate some of this with a simpler function, sum over rows.

我将用一个更简单的函数来说明其中的一些,对行求和。

If A2 is a csr matrix.

如果A2是csr矩阵。

A2.sum(axis=1)  # the fastest compile csr method
A2.A.sum(axis=1)  # same, but with a dense intermediary
[np.sum(l.data) for l in A2]  # iterate over the rows of A2
[np.sum(A2.getrow(i).data) for i in range(A2.shape[0])]  # iterate with index
[np.sum(l) for l in A2.tolil().data]  # sum the sublists of lil format
np.add.reduceat(A2.data, A2.indptr[:-1])  # with reduceat

A2.sum(axis=1) is implemented as a matrix multiplication. That's not relevant to the sort problem, but still an interesting way of looking at the summation problem. Remember csr format was developed for efficient multiplication.

a .sum(axis=1)实现为矩阵乘法。这与排序问题无关,但仍然是一个有趣的求和问题。记住,csr格式是为高效乘法而开发的。

For a my current sample matrix (created for another SO sparse question)

对于我当前的样本矩阵(为另一个如此稀疏的问题创建)

<8x47752 sparse matrix of type '<class 'numpy.float32'>'
     with 32 stored elements in Compressed Sparse Row format>

some comparative times are

一些比较次

In [694]: timeit np.add.reduceat(A2.data, A2.indptr[:-1])
100000 loops, best of 3: 7.41 µs per loop

In [695]: timeit A2.sum(axis=1)
10000 loops, best of 3: 71.6 µs per loop

In [696]: timeit [np.sum(l) for l in A2.tolil().data]
1000 loops, best of 3: 280 µs per loop

Everything else is 1ms or more.

其他的都是1ms或更多。

I suggest focusing on developing your one-row function, something like:

我建议专注于开发你的单行函数,比如:

def max_n(row_data, row_indices, n):
    i = row_data.argsort()[-n:]
    # i = row_data.argpartition(-n)[-n:]
    top_values = row_data[i]
    top_indices = row_indices[i]  # do the sparse indices matter?
    return top_values, top_indices, i

Then see how if fits in one of these iteration methods. tolil() looks most promising.

然后看看是否适合这些迭代方法中的一种。tolil()看起来最有前途的。

I haven't addressed the question of how to collect these results. Should they be lists of lists, array with 10 columns, another sparse matrix with 10 values per row, etc.?

我还没有解决如何收集这些结果的问题。它们应该是列表列表、包含10列的数组、每个行包含10个值的稀疏矩阵等等吗?


sorting each row of a large sparse & saving top K values & column index - Similar question from several years back, but unanswered.

排序一个大的稀疏和保存*K值和列索引的每一行——几年前类似的问题,但没有答案。

Argmax of each row or column in scipy sparse matrix - Recent question seeking argmax for rows of csr. I discuss some of the same issues.

scipy稀疏矩阵中每一行或列的Argmax——最近的一个问题,求csr的行的Argmax。我讨论了一些相同的问题。

how to speed up loop in numpy? - example of how to use np.frompyfunc to create a ufunc. I don't know if the resulting function has the .reduceat method.

如何在numpy中加速循环?-如何使用np.frompyfunc创建ufunc的示例。我不知道结果函数是否有。reduceat方法。

Increasing value of top k elements in sparse matrix - get the top k elements of csr (not by row). Case for argpartition.

在稀疏矩阵中增加top k元素的值——得到csr的top k元素(不是按行)。argpartition。


The row summation implemented with np.frompyfunc:

np.frompyfunc实现的行求和:

In [741]: def foo(a,b):
    return a+b  
In [742]: vfoo=np.frompyfunc(foo,2,1)
In [743]: timeit vfoo.reduceat(A2.data,A2.indptr[:-1],dtype=object).astype(float)
10000 loops, best of 3: 26.2 µs per loop

That's respectable speed. But I can't think of a way of writing a binary function (takes to 2 arguments) that would implement argsort via reduction. So this is probably a deadend for this problem.

这是体面的速度。但是我想不出一种方法来写一个二进制函数(取2个参数)通过约简实现argsort。所以这可能是这个问题的一个死胡同。

#2


0  

Just to answer the original question (for people like me who found this question looking for copy-pasta), here's a solution using multiprocessing based on @hpaulj's suggestion of converting to lil_matrix, and iterating over rows

为了回答最初的问题(对于像我这样发现这个问题的人来说,他们正在寻找复制意面),这里有一个解决方案,使用基于@hpaulj关于转换为lil_matrix和遍历行的建议的多处理

from multiprocessing import Pool

def _top_k(args):
    """
    Helper function to process a single row of top_k
    """
    data, row = args
    data, row = zip(*sorted(zip(data, row), reverse=True)[:k])
    return data, row

def top_k(m, k):
    """
    Keep only the top k elements of each row in a csr_matrix
    """
    ml = m.tolil()
    with Pool() as p:
        ms = p.map(_top_k, zip(ml.data, ml.rows))
    ml.data, ml.rows = zip(*ms)
    return ml.tocsr()

#1


7  

I don't see what the advantages of csr format are in this case. Sure, all the nonzero values are collected in one .data array, with the corresponding column indexes in .indices. But they are in blocks of varying length. And that means they can't be processed in parallel or with numpy array strides.

我不认为csr格式在这种情况下有什么好处。当然,所有非零值都在一个.data数组中收集,相应的列索引在.indices中。但是它们是不同长度的块。这意味着它们不能被并行处理或使用numpy数组步长。

One solution is the pad those blocks into common length blocks. That's what .toarray() does. Then you can find the maximum values with argsort(axis=1) or withargpartition`.

一种解决方案是将这些块填充到公共长度块中。这就是.toarray()。然后可以使用argsort(axis=1)或使用argpartition '找到最大值。

Another is to break them into row sized blocks, and process each of those. That's what you are doing with the .getrow. Another way of breaking them up is convert to lil format, and process the sublists of the .data and .rows arrays.

另一种方法是将它们分解成行大小的块,并对每个块进行处理。这就是你对。getrow所做的。另一种方法是将它们转换为lil格式,并处理.data和.rows数组的子列表。

A possible third option is to use the ufunc reduceat method. This lets you apply ufunc reduction methods to sequential blocks of an array. There are established ufunc like np.add that take advantage of this. argsort is not such a function. But there is a way of constructing a ufunc from a Python function, and gain some modest speed over regular Python iteration. [I need to look up a recent SO question that illustrates this.]

第三种可能的选择是使用ufunc reduceat方法。这使您可以将ufunc约简方法应用到数组的顺序块。有确定的ufunc如np。加上利用这一点。argsort不是这样的函数。但是有一种方法可以从Python函数中构建ufunc,并在常规Python迭代中获得一定的速度。[我需要查阅最近提出的一个问题来说明这一点。]

I'll illustrate some of this with a simpler function, sum over rows.

我将用一个更简单的函数来说明其中的一些,对行求和。

If A2 is a csr matrix.

如果A2是csr矩阵。

A2.sum(axis=1)  # the fastest compile csr method
A2.A.sum(axis=1)  # same, but with a dense intermediary
[np.sum(l.data) for l in A2]  # iterate over the rows of A2
[np.sum(A2.getrow(i).data) for i in range(A2.shape[0])]  # iterate with index
[np.sum(l) for l in A2.tolil().data]  # sum the sublists of lil format
np.add.reduceat(A2.data, A2.indptr[:-1])  # with reduceat

A2.sum(axis=1) is implemented as a matrix multiplication. That's not relevant to the sort problem, but still an interesting way of looking at the summation problem. Remember csr format was developed for efficient multiplication.

a .sum(axis=1)实现为矩阵乘法。这与排序问题无关,但仍然是一个有趣的求和问题。记住,csr格式是为高效乘法而开发的。

For a my current sample matrix (created for another SO sparse question)

对于我当前的样本矩阵(为另一个如此稀疏的问题创建)

<8x47752 sparse matrix of type '<class 'numpy.float32'>'
     with 32 stored elements in Compressed Sparse Row format>

some comparative times are

一些比较次

In [694]: timeit np.add.reduceat(A2.data, A2.indptr[:-1])
100000 loops, best of 3: 7.41 µs per loop

In [695]: timeit A2.sum(axis=1)
10000 loops, best of 3: 71.6 µs per loop

In [696]: timeit [np.sum(l) for l in A2.tolil().data]
1000 loops, best of 3: 280 µs per loop

Everything else is 1ms or more.

其他的都是1ms或更多。

I suggest focusing on developing your one-row function, something like:

我建议专注于开发你的单行函数,比如:

def max_n(row_data, row_indices, n):
    i = row_data.argsort()[-n:]
    # i = row_data.argpartition(-n)[-n:]
    top_values = row_data[i]
    top_indices = row_indices[i]  # do the sparse indices matter?
    return top_values, top_indices, i

Then see how if fits in one of these iteration methods. tolil() looks most promising.

然后看看是否适合这些迭代方法中的一种。tolil()看起来最有前途的。

I haven't addressed the question of how to collect these results. Should they be lists of lists, array with 10 columns, another sparse matrix with 10 values per row, etc.?

我还没有解决如何收集这些结果的问题。它们应该是列表列表、包含10列的数组、每个行包含10个值的稀疏矩阵等等吗?


sorting each row of a large sparse & saving top K values & column index - Similar question from several years back, but unanswered.

排序一个大的稀疏和保存*K值和列索引的每一行——几年前类似的问题,但没有答案。

Argmax of each row or column in scipy sparse matrix - Recent question seeking argmax for rows of csr. I discuss some of the same issues.

scipy稀疏矩阵中每一行或列的Argmax——最近的一个问题,求csr的行的Argmax。我讨论了一些相同的问题。

how to speed up loop in numpy? - example of how to use np.frompyfunc to create a ufunc. I don't know if the resulting function has the .reduceat method.

如何在numpy中加速循环?-如何使用np.frompyfunc创建ufunc的示例。我不知道结果函数是否有。reduceat方法。

Increasing value of top k elements in sparse matrix - get the top k elements of csr (not by row). Case for argpartition.

在稀疏矩阵中增加top k元素的值——得到csr的top k元素(不是按行)。argpartition。


The row summation implemented with np.frompyfunc:

np.frompyfunc实现的行求和:

In [741]: def foo(a,b):
    return a+b  
In [742]: vfoo=np.frompyfunc(foo,2,1)
In [743]: timeit vfoo.reduceat(A2.data,A2.indptr[:-1],dtype=object).astype(float)
10000 loops, best of 3: 26.2 µs per loop

That's respectable speed. But I can't think of a way of writing a binary function (takes to 2 arguments) that would implement argsort via reduction. So this is probably a deadend for this problem.

这是体面的速度。但是我想不出一种方法来写一个二进制函数(取2个参数)通过约简实现argsort。所以这可能是这个问题的一个死胡同。

#2


0  

Just to answer the original question (for people like me who found this question looking for copy-pasta), here's a solution using multiprocessing based on @hpaulj's suggestion of converting to lil_matrix, and iterating over rows

为了回答最初的问题(对于像我这样发现这个问题的人来说,他们正在寻找复制意面),这里有一个解决方案,使用基于@hpaulj关于转换为lil_matrix和遍历行的建议的多处理

from multiprocessing import Pool

def _top_k(args):
    """
    Helper function to process a single row of top_k
    """
    data, row = args
    data, row = zip(*sorted(zip(data, row), reverse=True)[:k])
    return data, row

def top_k(m, k):
    """
    Keep only the top k elements of each row in a csr_matrix
    """
    ml = m.tolil()
    with Pool() as p:
        ms = p.map(_top_k, zip(ml.data, ml.rows))
    ml.data, ml.rows = zip(*ms)
    return ml.tocsr()