I have a coo scipy sparse matrix 1000 x 12000 columns. want to write on disk file following this format : By row, all no-zeros columns :
我有一个co scipy稀疏矩阵1000 x 12000列。想要按照以下格式写入磁盘文件:按行,所有no-zeros列:
col_id1:value col_id2:value .... col_id2:value ....
col_id1:value col_id2:value .... col_id2:value ....
Is there a way to do in fast way ? (without iterating manually)
有办法快速吗? (无需手动迭代)
1 个解决方案
#1
1
An example of what I suggested in the comment:
我在评论中建议的一个例子:
In [2]: from scipy import sparse
In [3]: M = sparse.random(10,10,.2)
In [4]: M
Out[4]:
<10x10 sparse matrix of type '<class 'numpy.float64'>'
with 20 stored elements in COOrdinate format>
In [5]: print(M)
(1, 9) 0.61465832998
(8, 8) 0.894080347124
(2, 7) 0.709001342736
(3, 2) 0.809025517922
(9, 8) 0.974650428753
(7, 8) 0.495271225449
(5, 6) 0.356408870324
(0, 8) 0.57026318308
(3, 6) 0.69919575217
(5, 8) 0.226445982654
(5, 1) 0.191857394963
(7, 9) 0.121634028589
(6, 6) 0.815836601813
(7, 3) 0.585401171842
(6, 7) 0.526762154792
(6, 9) 0.775136319014
(4, 1) 0.517647147906
(0, 5) 0.484673192725
(7, 5) 0.72827335905
(2, 8) 0.527635893465
lil
format collects values by row:
lil格式按行收集值:
In [6]: Ml = M.tolil()
In [7]: Ml.rows
Out[7]:
array([list([5, 8]), list([9]), list([7, 8]), list([2, 6]), list([1]),
list([1, 6, 8]), list([6, 7, 9]), list([3, 5, 8, 9]), list([8]),
list([8])], dtype=object)
In [8]: Ml.data
Out[8]:
array([list([0.4846731927245771, 0.5702631830799726]),
list([0.6146583299803253]),
list([0.7090013427361257, 0.5276358934648013]),
list([0.8090255179222732, 0.6991957521702542]),
list([0.5176471479060225]),
list([0.19185739496268694, 0.3564088703236009, 0.2264459826535451]),
list([0.8158366018134895, 0.5267621547920701, 0.7751363190143352]),
list([0.5854011718424482, 0.7282733590496102, 0.49527122544858804, 0.12163402858941941]),
list([0.8940803471238159]), list([0.9746504287533381])], dtype=object)
Format lines according to your specs with a loop and list comprehension:
根据您的规范格式化循环和列表理解:
In [9]: for r,d in zip(Ml.rows, Ml.data):
...: print(' '.join(['%s:%s'%(r1,d1) for r1,d1 in zip(r,d)]))
...:
5:0.4846731927245771 8:0.5702631830799726
9:0.6146583299803253
7:0.7090013427361257 8:0.5276358934648013
2:0.8090255179222732 6:0.6991957521702542
1:0.5176471479060225
1:0.19185739496268694 6:0.3564088703236009 8:0.2264459826535451
6:0.8158366018134895 7:0.5267621547920701 9:0.7751363190143352
3:0.5854011718424482 5:0.7282733590496102 8:0.49527122544858804 9:0.12163402858941941
8:0.8940803471238159
8:0.9746504287533381
Substitute your file write line for the print.
替换文件写行以进行打印。
We are looping 'manually', but access time to the data elements is relatively fast. Certainly faster than indexing M[i,j]
, which isn't possible with coo
format anyways.
我们正在“手动”循环,但是访问数据元素的时间相对较快。肯定比索引M [i,j]更快,这无论如何都不可能使用coo格式。
Fast row access via the csr
format attributes is also possible, but requires a bit more knowledge of how that data is stored.
也可以通过csr格式属性进行快速行访问,但需要更多地了解数据的存储方式。
Your :
syntax is not common, so you'll have do that formatting regardless. How are intending to read this file?
你的:语法不常见,所以无论如何都要做格式化。打算如何阅读此文件?
#1
1
An example of what I suggested in the comment:
我在评论中建议的一个例子:
In [2]: from scipy import sparse
In [3]: M = sparse.random(10,10,.2)
In [4]: M
Out[4]:
<10x10 sparse matrix of type '<class 'numpy.float64'>'
with 20 stored elements in COOrdinate format>
In [5]: print(M)
(1, 9) 0.61465832998
(8, 8) 0.894080347124
(2, 7) 0.709001342736
(3, 2) 0.809025517922
(9, 8) 0.974650428753
(7, 8) 0.495271225449
(5, 6) 0.356408870324
(0, 8) 0.57026318308
(3, 6) 0.69919575217
(5, 8) 0.226445982654
(5, 1) 0.191857394963
(7, 9) 0.121634028589
(6, 6) 0.815836601813
(7, 3) 0.585401171842
(6, 7) 0.526762154792
(6, 9) 0.775136319014
(4, 1) 0.517647147906
(0, 5) 0.484673192725
(7, 5) 0.72827335905
(2, 8) 0.527635893465
lil
format collects values by row:
lil格式按行收集值:
In [6]: Ml = M.tolil()
In [7]: Ml.rows
Out[7]:
array([list([5, 8]), list([9]), list([7, 8]), list([2, 6]), list([1]),
list([1, 6, 8]), list([6, 7, 9]), list([3, 5, 8, 9]), list([8]),
list([8])], dtype=object)
In [8]: Ml.data
Out[8]:
array([list([0.4846731927245771, 0.5702631830799726]),
list([0.6146583299803253]),
list([0.7090013427361257, 0.5276358934648013]),
list([0.8090255179222732, 0.6991957521702542]),
list([0.5176471479060225]),
list([0.19185739496268694, 0.3564088703236009, 0.2264459826535451]),
list([0.8158366018134895, 0.5267621547920701, 0.7751363190143352]),
list([0.5854011718424482, 0.7282733590496102, 0.49527122544858804, 0.12163402858941941]),
list([0.8940803471238159]), list([0.9746504287533381])], dtype=object)
Format lines according to your specs with a loop and list comprehension:
根据您的规范格式化循环和列表理解:
In [9]: for r,d in zip(Ml.rows, Ml.data):
...: print(' '.join(['%s:%s'%(r1,d1) for r1,d1 in zip(r,d)]))
...:
5:0.4846731927245771 8:0.5702631830799726
9:0.6146583299803253
7:0.7090013427361257 8:0.5276358934648013
2:0.8090255179222732 6:0.6991957521702542
1:0.5176471479060225
1:0.19185739496268694 6:0.3564088703236009 8:0.2264459826535451
6:0.8158366018134895 7:0.5267621547920701 9:0.7751363190143352
3:0.5854011718424482 5:0.7282733590496102 8:0.49527122544858804 9:0.12163402858941941
8:0.8940803471238159
8:0.9746504287533381
Substitute your file write line for the print.
替换文件写行以进行打印。
We are looping 'manually', but access time to the data elements is relatively fast. Certainly faster than indexing M[i,j]
, which isn't possible with coo
format anyways.
我们正在“手动”循环,但是访问数据元素的时间相对较快。肯定比索引M [i,j]更快,这无论如何都不可能使用coo格式。
Fast row access via the csr
format attributes is also possible, but requires a bit more knowledge of how that data is stored.
也可以通过csr格式属性进行快速行访问,但需要更多地了解数据的存储方式。
Your :
syntax is not common, so you'll have do that formatting regardless. How are intending to read this file?
你的:语法不常见,所以无论如何都要做格式化。打算如何阅读此文件?