I'm wondering what the best way is to iterate nonzero entries of sparse matrices with scipy.sparse. For example, if I do the following:
我想知道用scipy.sparse迭代稀疏矩阵的非零项最好的方法是什么。例如,如果我执行以下操作:
from scipy.sparse import lil_matrix
x = lil_matrix( (20,1) )
x[13,0] = 1
x[15,0] = 2
c = 0
for i in x:
print c, i
c = c+1
the output is
输出是
0
1
2
3
4
5
6
7
8
9
10
11
12
13 (0, 0) 1.0
14
15 (0, 0) 2.0
16
17
18
19
so it appears the iterator is touching every element, not just the nonzero entries. I've had a look at the API
因此看起来迭代器正在触及每个元素,而不仅仅是非零条目。我看过API了
http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.lil_matrix.html
http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.lil_matrix.html
and searched around a bit, but I can't seem to find a solution that works.
并搜索了一下,但我似乎无法找到一个有效的解决方案。
6 个解决方案
#1
48
Edit: bbtrb's method (using coo_matrix) is much faster than my original suggestion, using nonzero. Sven Marnach's suggestion to use itertools.izip
also improves the speed. Current fastest is using_tocoo_izip
:
编辑:bbtrb的方法(使用coo_matrix)比我原来的建议快得多,使用非零。 Sven Marnach建议使用itertools.izip也可以提高速度。目前最快的是using_tocoo_izip:
import scipy.sparse
import random
import itertools
def using_nonzero(x):
rows,cols = x.nonzero()
for row,col in zip(rows,cols):
((row,col), x[row,col])
def using_coo(x):
cx = scipy.sparse.coo_matrix(x)
for i,j,v in zip(cx.row, cx.col, cx.data):
(i,j,v)
def using_tocoo(x):
cx = x.tocoo()
for i,j,v in zip(cx.row, cx.col, cx.data):
(i,j,v)
def using_tocoo_izip(x):
cx = x.tocoo()
for i,j,v in itertools.izip(cx.row, cx.col, cx.data):
(i,j,v)
N=200
x = scipy.sparse.lil_matrix( (N,N) )
for _ in xrange(N):
x[random.randint(0,N-1),random.randint(0,N-1)]=random.randint(1,100)
yields these timeit
results:
产生这些时间结果:
% python -mtimeit -s'import test' 'test.using_tocoo_izip(test.x)'
1000 loops, best of 3: 670 usec per loop
% python -mtimeit -s'import test' 'test.using_tocoo(test.x)'
1000 loops, best of 3: 706 usec per loop
% python -mtimeit -s'import test' 'test.using_coo(test.x)'
1000 loops, best of 3: 802 usec per loop
% python -mtimeit -s'import test' 'test.using_nonzero(test.x)'
100 loops, best of 3: 5.25 msec per loop
#2
28
The fastest way should be by converting to a coo_matrix
:
最快的方法应该是转换为coo_matrix:
cx = scipy.sparse.coo_matrix(x)
for i,j,v in zip(cx.row, cx.col, cx.data):
print "(%d, %d), %s" % (i,j,v)
#3
2
To loop a variety of sparse matrices from the scipy.sparse
code section I would use this small wrapper function (note that for Python-2 you are encouraged to use xrange
and izip
for better performance on large matrices):
要从scipy.sparse代码部分循环各种稀疏矩阵,我将使用这个小包装函数(请注意,对于Python-2,我们鼓励您使用xrange和izip在大型矩阵上获得更好的性能):
from scipy.sparse import *
def iter_spmatrix(matrix):
""" Iterator for iterating the elements in a ``scipy.sparse.*_matrix``
This will always return:
>>> (row, column, matrix-element)
Currently this can iterate `coo`, `csc`, `lil` and `csr`, others may easily be added.
Parameters
----------
matrix : ``scipy.sparse.sp_matrix``
the sparse matrix to iterate non-zero elements
"""
if isspmatrix_coo(matrix):
for r, c, m in zip(matrix.row, matrix.col, matrix.data):
yield r, c, m
elif isspmatrix_csc(matrix):
for c in range(matrix.shape[1]):
for ind in range(matrix.indptr[c], matrix.indptr[c+1]):
yield matrix.indices[ind], c, matrix.data[ind]
elif isspmatrix_csr(matrix):
for r in range(matrix.shape[0]):
for ind in range(matrix.indptr[r], matrix.indptr[r+1]):
yield r, matrix.indices[ind], matrix.data[ind]
elif isspmatrix_lil(matrix):
for r in range(matrix.shape[0]):
for c, d in zip(matrix.rows[r], matrix.data[r]):
yield r, c, d
else:
raise NotImplementedError("The iterator for this sparse matrix has not been implemented")
#4
1
I had the same problem and actually, if your concern is only speed, the fastest way (more than 1 order of magnitude faster) is to convert the sparse matrix to a dense one (x.todense()), and iterating over the nonzero elements in the dense matrix. (Though, of course, this approach requires a lot more memory)
我遇到了同样的问题,实际上,如果你关心的只是速度,那么最快的方法(超过1个数量级)就是将稀疏矩阵转换为密集矩阵(x.todense()),并迭代非零密集矩阵中的元素。 (当然,这种方法需要更多的内存)
#5
1
tocoo() materializes the entire matrix into a different structure, which is not the preferred MO for python 3. You can also consider this iterator, which is especially useful for large matrices.
tocoo()将整个矩阵具体化为不同的结构,这不是python 3的首选MO。您还可以考虑这个迭代器,它对大型矩阵特别有用。
from itertools import chain, repeat
def iter_csr(matrix):
for (row, col, val) in zip(
chain(*(
repeat(i, r)
for (i,r) in enumerate(comparisons.indptr[1:] - comparisons.indptr[:-1])
)),
matrix.indices,
matrix.data
):
yield (row, col, val)
I have to admit that I'm using a lot of python-constructs which possibly should be replaced by numpy-constructs (especially enumerate).
我不得不承认我使用了很多python-constructs,它们可能应该被numpy-constructs(尤其是enumerate)取代。
NB:
注意:
In [43]: t=time.time(); sum(1 for x in rather_dense_sparse_matrix.data); print(time.time()-t)
52.48686504364014
In [44]: t=time.time(); sum(1 for x in enumerate(rather_dense_sparse_matrix.data)); print(time.time()-t)
70.19013023376465
In [45]: rather_dense_sparse_matrix
<99829x99829 sparse matrix of type '<class 'numpy.float16'>'
with 757622819 stored elements in Compressed Sparse Row format>
So yes, enumerate is somewhat slow(ish)
所以是的,枚举有点慢(ish)
For the iterator:
对于迭代器:
In [47]: it = iter_csr(rather_dense_sparse_matrix)
In [48]: t=time.time(); sum(1 for x in it); print(time.time()-t)
113.something something
So you decide whether this overhead is acceptable, in my case the tocoo caused MemoryOverflows
's.
所以你决定这个开销是否可以接受,在我的情况下,tocoo导致MemoryOverflows。
IMHO: such an iterator should be part of the csr_matrix interface, similar to items() in a dict() :)
恕我直言:这样的迭代器应该是csr_matrix接口的一部分,类似于dict()中的items():)
#6
0
Try filter(lambda x:x, x)
instead of x
.
尝试过滤(lambda x:x,x)而不是x。
#1
48
Edit: bbtrb's method (using coo_matrix) is much faster than my original suggestion, using nonzero. Sven Marnach's suggestion to use itertools.izip
also improves the speed. Current fastest is using_tocoo_izip
:
编辑:bbtrb的方法(使用coo_matrix)比我原来的建议快得多,使用非零。 Sven Marnach建议使用itertools.izip也可以提高速度。目前最快的是using_tocoo_izip:
import scipy.sparse
import random
import itertools
def using_nonzero(x):
rows,cols = x.nonzero()
for row,col in zip(rows,cols):
((row,col), x[row,col])
def using_coo(x):
cx = scipy.sparse.coo_matrix(x)
for i,j,v in zip(cx.row, cx.col, cx.data):
(i,j,v)
def using_tocoo(x):
cx = x.tocoo()
for i,j,v in zip(cx.row, cx.col, cx.data):
(i,j,v)
def using_tocoo_izip(x):
cx = x.tocoo()
for i,j,v in itertools.izip(cx.row, cx.col, cx.data):
(i,j,v)
N=200
x = scipy.sparse.lil_matrix( (N,N) )
for _ in xrange(N):
x[random.randint(0,N-1),random.randint(0,N-1)]=random.randint(1,100)
yields these timeit
results:
产生这些时间结果:
% python -mtimeit -s'import test' 'test.using_tocoo_izip(test.x)'
1000 loops, best of 3: 670 usec per loop
% python -mtimeit -s'import test' 'test.using_tocoo(test.x)'
1000 loops, best of 3: 706 usec per loop
% python -mtimeit -s'import test' 'test.using_coo(test.x)'
1000 loops, best of 3: 802 usec per loop
% python -mtimeit -s'import test' 'test.using_nonzero(test.x)'
100 loops, best of 3: 5.25 msec per loop
#2
28
The fastest way should be by converting to a coo_matrix
:
最快的方法应该是转换为coo_matrix:
cx = scipy.sparse.coo_matrix(x)
for i,j,v in zip(cx.row, cx.col, cx.data):
print "(%d, %d), %s" % (i,j,v)
#3
2
To loop a variety of sparse matrices from the scipy.sparse
code section I would use this small wrapper function (note that for Python-2 you are encouraged to use xrange
and izip
for better performance on large matrices):
要从scipy.sparse代码部分循环各种稀疏矩阵,我将使用这个小包装函数(请注意,对于Python-2,我们鼓励您使用xrange和izip在大型矩阵上获得更好的性能):
from scipy.sparse import *
def iter_spmatrix(matrix):
""" Iterator for iterating the elements in a ``scipy.sparse.*_matrix``
This will always return:
>>> (row, column, matrix-element)
Currently this can iterate `coo`, `csc`, `lil` and `csr`, others may easily be added.
Parameters
----------
matrix : ``scipy.sparse.sp_matrix``
the sparse matrix to iterate non-zero elements
"""
if isspmatrix_coo(matrix):
for r, c, m in zip(matrix.row, matrix.col, matrix.data):
yield r, c, m
elif isspmatrix_csc(matrix):
for c in range(matrix.shape[1]):
for ind in range(matrix.indptr[c], matrix.indptr[c+1]):
yield matrix.indices[ind], c, matrix.data[ind]
elif isspmatrix_csr(matrix):
for r in range(matrix.shape[0]):
for ind in range(matrix.indptr[r], matrix.indptr[r+1]):
yield r, matrix.indices[ind], matrix.data[ind]
elif isspmatrix_lil(matrix):
for r in range(matrix.shape[0]):
for c, d in zip(matrix.rows[r], matrix.data[r]):
yield r, c, d
else:
raise NotImplementedError("The iterator for this sparse matrix has not been implemented")
#4
1
I had the same problem and actually, if your concern is only speed, the fastest way (more than 1 order of magnitude faster) is to convert the sparse matrix to a dense one (x.todense()), and iterating over the nonzero elements in the dense matrix. (Though, of course, this approach requires a lot more memory)
我遇到了同样的问题,实际上,如果你关心的只是速度,那么最快的方法(超过1个数量级)就是将稀疏矩阵转换为密集矩阵(x.todense()),并迭代非零密集矩阵中的元素。 (当然,这种方法需要更多的内存)
#5
1
tocoo() materializes the entire matrix into a different structure, which is not the preferred MO for python 3. You can also consider this iterator, which is especially useful for large matrices.
tocoo()将整个矩阵具体化为不同的结构,这不是python 3的首选MO。您还可以考虑这个迭代器,它对大型矩阵特别有用。
from itertools import chain, repeat
def iter_csr(matrix):
for (row, col, val) in zip(
chain(*(
repeat(i, r)
for (i,r) in enumerate(comparisons.indptr[1:] - comparisons.indptr[:-1])
)),
matrix.indices,
matrix.data
):
yield (row, col, val)
I have to admit that I'm using a lot of python-constructs which possibly should be replaced by numpy-constructs (especially enumerate).
我不得不承认我使用了很多python-constructs,它们可能应该被numpy-constructs(尤其是enumerate)取代。
NB:
注意:
In [43]: t=time.time(); sum(1 for x in rather_dense_sparse_matrix.data); print(time.time()-t)
52.48686504364014
In [44]: t=time.time(); sum(1 for x in enumerate(rather_dense_sparse_matrix.data)); print(time.time()-t)
70.19013023376465
In [45]: rather_dense_sparse_matrix
<99829x99829 sparse matrix of type '<class 'numpy.float16'>'
with 757622819 stored elements in Compressed Sparse Row format>
So yes, enumerate is somewhat slow(ish)
所以是的,枚举有点慢(ish)
For the iterator:
对于迭代器:
In [47]: it = iter_csr(rather_dense_sparse_matrix)
In [48]: t=time.time(); sum(1 for x in it); print(time.time()-t)
113.something something
So you decide whether this overhead is acceptable, in my case the tocoo caused MemoryOverflows
's.
所以你决定这个开销是否可以接受,在我的情况下,tocoo导致MemoryOverflows。
IMHO: such an iterator should be part of the csr_matrix interface, similar to items() in a dict() :)
恕我直言:这样的迭代器应该是csr_matrix接口的一部分,类似于dict()中的items():)
#6
0
Try filter(lambda x:x, x)
instead of x
.
尝试过滤(lambda x:x,x)而不是x。