将scipy稀疏矩阵存储为HDF5

时间:2021-03-03 21:24:46

I want to compress and store a humongous Scipy matrix in HDF5 format. How do I do this? I've tried the below code:

我想压缩并存储一个HDF5格式的巨大Scipy矩阵。我该怎么做呢?我试过下面的代码:

a = csr_matrix((dat, (row, col)), shape=(947969, 36039))
f = h5py.File('foo.h5','w')    
dset = f.create_dataset("init", data=a, dtype = int, compression='gzip')

I get errors like these,

我得到这样的错误,

TypeError: Scalar datasets don't support chunk/filter options
IOError: Can't prepare for writing data (No appropriate function for conversion path)

I can't convert it to numpy array as there will be memory overflow. What is the best method?

我不能把它转换成numpy数组因为会有内存溢出。最好的方法是什么?

2 个解决方案

#1


2  

You can use scipy.sparse.save_npz method

您可以使用scipy.sparse。save_npz方法

Alternatively consider using Pandas.SparseDataFrame, but be aware that this method is very slow (thanks to @hpaulj for testing and pointing it out)

或者考虑使用熊猫。SparseDataFrame,但是要注意这个方法非常缓慢(感谢@hpaulj对其进行测试并指出)

Demo:

演示:

generating sparse matrix and SparseDataFrame

生成稀疏矩阵和SparseDataFrame。

In [55]: import pandas as pd

In [56]: from scipy.sparse import *

In [57]: m = csr_matrix((20, 10), dtype=np.int8)

In [58]: m
Out[58]:
<20x10 sparse matrix of type '<class 'numpy.int8'>'
        with 0 stored elements in Compressed Sparse Row format>

In [59]: sdf = pd.SparseDataFrame([pd.SparseSeries(m[i].toarray().ravel(), fill_value=0)
    ...:                           for i in np.arange(m.shape[0])])
    ...:

In [61]: type(sdf)
Out[61]: pandas.sparse.frame.SparseDataFrame

In [62]: sdf.info()
<class 'pandas.sparse.frame.SparseDataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 10 columns):
0    20 non-null int8
1    20 non-null int8
2    20 non-null int8
3    20 non-null int8
4    20 non-null int8
5    20 non-null int8
6    20 non-null int8
7    20 non-null int8
8    20 non-null int8
9    20 non-null int8
dtypes: int8(10)
memory usage: 280.0 bytes

saving SparseDataFrame to HDF file

保存SparseDataFrame到HDF文件

In [64]: sdf.to_hdf('d:/temp/sparse_df.h5', 'sparse_df')

reading from HDF file

阅读从HDF文件

In [65]: store = pd.HDFStore('d:/temp/sparse_df.h5')

In [66]: store
Out[66]:
<class 'pandas.io.pytables.HDFStore'>
File path: d:/temp/sparse_df.h5
/sparse_df            sparse_frame

In [67]: x = store['sparse_df']

In [68]: type(x)
Out[68]: pandas.sparse.frame.SparseDataFrame

In [69]: x.info()
<class 'pandas.sparse.frame.SparseDataFrame'>
Int64Index: 20 entries, 0 to 19
Data columns (total 10 columns):
0    20 non-null int8
1    20 non-null int8
2    20 non-null int8
3    20 non-null int8
4    20 non-null int8
5    20 non-null int8
6    20 non-null int8
7    20 non-null int8
8    20 non-null int8
9    20 non-null int8
dtypes: int8(10)
memory usage: 360.0 bytes

#2


5  

A csr matrix stores it's values in 3 arrays. It is not an array or array subclass, so h5py cannot save it directly. The best you can do is save the attributes, and recreate the matrix on loading:

csr矩阵将它的值存储在三个数组中。它不是数组或数组子类,因此h5py不能直接保存它。您所能做的最好的事情是保存属性,并在加载时重新创建矩阵:

In [248]: M = sparse.random(5,10,.1, 'csr')
In [249]: M
Out[249]: 
<5x10 sparse matrix of type '<class 'numpy.float64'>'
    with 5 stored elements in Compressed Sparse Row format>
In [250]: M.data
Out[250]: array([ 0.91615298,  0.49907752,  0.09197862,  0.90442401,  0.93772772])
In [251]: M.indptr
Out[251]: array([0, 0, 1, 2, 3, 5], dtype=int32)
In [252]: M.indices
Out[252]: array([5, 7, 5, 2, 6], dtype=int32)
In [253]: M.data
Out[253]: array([ 0.91615298,  0.49907752,  0.09197862,  0.90442401,  0.93772772])

coo format has data, row, col attributes, basically the same as the (dat, (row, col)) you use to create your a.

coo格式具有数据、行、col属性,基本与创建a所用的(dat, (row, col))相同。

In [254]: M.tocoo().row
Out[254]: array([1, 2, 3, 4, 4], dtype=int32)

The new save_npz function does:

新的save_npz函数是:

arrays_dict = dict(format=matrix.format, shape=matrix.shape, data=matrix.data)
if matrix.format in ('csc', 'csr', 'bsr'):
    arrays_dict.update(indices=matrix.indices, indptr=matrix.indptr)
...
elif matrix.format == 'coo':
    arrays_dict.update(row=matrix.row, col=matrix.col)
...
np.savez(file, **arrays_dict)

In other words it collects the relevant attributes in a dictionary and uses savez to create the zip archive.

换句话说,它在字典中收集相关的属性,并使用savez创建zip归档。

The same sort of method could be used with a h5py file. More on that save_npz in a recent SO question, with links to the source code.

类似的方法可以用于h5py文件。更多关于save_npz在最近的SO问题,与源代码的链接。

save_npz method missing from scipy.sparse

save_npz方法缺失

See if you can get this working. If you can create a csr matrix, you can recreate it from its attributes (or the coo equivalents). I can make a working example if needed.

看看你能不能把它修好。如果您可以创建csr矩阵,您可以从它的属性(或coo等价物)重新创建它。如果需要的话,我可以做一个工作示例。

csr to h5py example

import numpy as np
import h5py
from scipy import sparse

M = sparse.random(10,10,.2, 'csr')
print(repr(M))

print(M.data)
print(M.indices)
print(M.indptr)

f = h5py.File('sparse.h5','w')
g = f.create_group('Mcsr')
g.create_dataset('data',data=M.data)
g.create_dataset('indptr',data=M.indptr)
g.create_dataset('indices',data=M.indices)
g.attrs['shape'] = M.shape
f.close()

f = h5py.File('sparse.h5','r')
print(list(f.keys()))
print(list(f['Mcsr'].keys()))

g2 = f['Mcsr']
print(g2.attrs['shape'])

M1 = sparse.csr_matrix((g2['data'][:],g2['indices'][:],
    g2['indptr'][:]), g2.attrs['shape'])
print(repr(M1))
print(np.allclose(M1.A, M.A))
f.close()

producing

生产

1314:~/mypy$ python3 stack43390038.py 
<10x10 sparse matrix of type '<class 'numpy.float64'>'
    with 20 stored elements in Compressed Sparse Row format>
[ 0.13640389  0.92698959 ....  0.7762265 ]
[4 5 0 3 0 2 0 2 5 6 7 1 7 9 1 3 4 6 8 9]
[ 0  2  4  6  9 11 11 11 14 19 20]
['Mcsr']
['data', 'indices', 'indptr']
[10 10]
<10x10 sparse matrix of type '<class 'numpy.float64'>'
    with 20 stored elements in Compressed Sparse Row format>
True

coo alternative

Mo = M.tocoo()
g = f.create_group('Mcoo')
g.create_dataset('data', data=Mo.data)
g.create_dataset('row', data=Mo.row)
g.create_dataset('col', data=Mo.col)
g.attrs['shape'] = Mo.shape

g2 = f['Mcoo']
M2 = sparse.coo_matrix((g2['data'], (g2['row'], g2['col'])),
   g2.attrs['shape'])   # don't need the [:]
# could also use sparse.csr_matrix or M2.tocsr()

#1


2  

You can use scipy.sparse.save_npz method

您可以使用scipy.sparse。save_npz方法

Alternatively consider using Pandas.SparseDataFrame, but be aware that this method is very slow (thanks to @hpaulj for testing and pointing it out)

或者考虑使用熊猫。SparseDataFrame,但是要注意这个方法非常缓慢(感谢@hpaulj对其进行测试并指出)

Demo:

演示:

generating sparse matrix and SparseDataFrame

生成稀疏矩阵和SparseDataFrame。

In [55]: import pandas as pd

In [56]: from scipy.sparse import *

In [57]: m = csr_matrix((20, 10), dtype=np.int8)

In [58]: m
Out[58]:
<20x10 sparse matrix of type '<class 'numpy.int8'>'
        with 0 stored elements in Compressed Sparse Row format>

In [59]: sdf = pd.SparseDataFrame([pd.SparseSeries(m[i].toarray().ravel(), fill_value=0)
    ...:                           for i in np.arange(m.shape[0])])
    ...:

In [61]: type(sdf)
Out[61]: pandas.sparse.frame.SparseDataFrame

In [62]: sdf.info()
<class 'pandas.sparse.frame.SparseDataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 10 columns):
0    20 non-null int8
1    20 non-null int8
2    20 non-null int8
3    20 non-null int8
4    20 non-null int8
5    20 non-null int8
6    20 non-null int8
7    20 non-null int8
8    20 non-null int8
9    20 non-null int8
dtypes: int8(10)
memory usage: 280.0 bytes

saving SparseDataFrame to HDF file

保存SparseDataFrame到HDF文件

In [64]: sdf.to_hdf('d:/temp/sparse_df.h5', 'sparse_df')

reading from HDF file

阅读从HDF文件

In [65]: store = pd.HDFStore('d:/temp/sparse_df.h5')

In [66]: store
Out[66]:
<class 'pandas.io.pytables.HDFStore'>
File path: d:/temp/sparse_df.h5
/sparse_df            sparse_frame

In [67]: x = store['sparse_df']

In [68]: type(x)
Out[68]: pandas.sparse.frame.SparseDataFrame

In [69]: x.info()
<class 'pandas.sparse.frame.SparseDataFrame'>
Int64Index: 20 entries, 0 to 19
Data columns (total 10 columns):
0    20 non-null int8
1    20 non-null int8
2    20 non-null int8
3    20 non-null int8
4    20 non-null int8
5    20 non-null int8
6    20 non-null int8
7    20 non-null int8
8    20 non-null int8
9    20 non-null int8
dtypes: int8(10)
memory usage: 360.0 bytes

#2


5  

A csr matrix stores it's values in 3 arrays. It is not an array or array subclass, so h5py cannot save it directly. The best you can do is save the attributes, and recreate the matrix on loading:

csr矩阵将它的值存储在三个数组中。它不是数组或数组子类,因此h5py不能直接保存它。您所能做的最好的事情是保存属性,并在加载时重新创建矩阵:

In [248]: M = sparse.random(5,10,.1, 'csr')
In [249]: M
Out[249]: 
<5x10 sparse matrix of type '<class 'numpy.float64'>'
    with 5 stored elements in Compressed Sparse Row format>
In [250]: M.data
Out[250]: array([ 0.91615298,  0.49907752,  0.09197862,  0.90442401,  0.93772772])
In [251]: M.indptr
Out[251]: array([0, 0, 1, 2, 3, 5], dtype=int32)
In [252]: M.indices
Out[252]: array([5, 7, 5, 2, 6], dtype=int32)
In [253]: M.data
Out[253]: array([ 0.91615298,  0.49907752,  0.09197862,  0.90442401,  0.93772772])

coo format has data, row, col attributes, basically the same as the (dat, (row, col)) you use to create your a.

coo格式具有数据、行、col属性,基本与创建a所用的(dat, (row, col))相同。

In [254]: M.tocoo().row
Out[254]: array([1, 2, 3, 4, 4], dtype=int32)

The new save_npz function does:

新的save_npz函数是:

arrays_dict = dict(format=matrix.format, shape=matrix.shape, data=matrix.data)
if matrix.format in ('csc', 'csr', 'bsr'):
    arrays_dict.update(indices=matrix.indices, indptr=matrix.indptr)
...
elif matrix.format == 'coo':
    arrays_dict.update(row=matrix.row, col=matrix.col)
...
np.savez(file, **arrays_dict)

In other words it collects the relevant attributes in a dictionary and uses savez to create the zip archive.

换句话说,它在字典中收集相关的属性,并使用savez创建zip归档。

The same sort of method could be used with a h5py file. More on that save_npz in a recent SO question, with links to the source code.

类似的方法可以用于h5py文件。更多关于save_npz在最近的SO问题,与源代码的链接。

save_npz method missing from scipy.sparse

save_npz方法缺失

See if you can get this working. If you can create a csr matrix, you can recreate it from its attributes (or the coo equivalents). I can make a working example if needed.

看看你能不能把它修好。如果您可以创建csr矩阵,您可以从它的属性(或coo等价物)重新创建它。如果需要的话,我可以做一个工作示例。

csr to h5py example

import numpy as np
import h5py
from scipy import sparse

M = sparse.random(10,10,.2, 'csr')
print(repr(M))

print(M.data)
print(M.indices)
print(M.indptr)

f = h5py.File('sparse.h5','w')
g = f.create_group('Mcsr')
g.create_dataset('data',data=M.data)
g.create_dataset('indptr',data=M.indptr)
g.create_dataset('indices',data=M.indices)
g.attrs['shape'] = M.shape
f.close()

f = h5py.File('sparse.h5','r')
print(list(f.keys()))
print(list(f['Mcsr'].keys()))

g2 = f['Mcsr']
print(g2.attrs['shape'])

M1 = sparse.csr_matrix((g2['data'][:],g2['indices'][:],
    g2['indptr'][:]), g2.attrs['shape'])
print(repr(M1))
print(np.allclose(M1.A, M.A))
f.close()

producing

生产

1314:~/mypy$ python3 stack43390038.py 
<10x10 sparse matrix of type '<class 'numpy.float64'>'
    with 20 stored elements in Compressed Sparse Row format>
[ 0.13640389  0.92698959 ....  0.7762265 ]
[4 5 0 3 0 2 0 2 5 6 7 1 7 9 1 3 4 6 8 9]
[ 0  2  4  6  9 11 11 11 14 19 20]
['Mcsr']
['data', 'indices', 'indptr']
[10 10]
<10x10 sparse matrix of type '<class 'numpy.float64'>'
    with 20 stored elements in Compressed Sparse Row format>
True

coo alternative

Mo = M.tocoo()
g = f.create_group('Mcoo')
g.create_dataset('data', data=Mo.data)
g.create_dataset('row', data=Mo.row)
g.create_dataset('col', data=Mo.col)
g.attrs['shape'] = Mo.shape

g2 = f['Mcoo']
M2 = sparse.coo_matrix((g2['data'], (g2['row'], g2['col'])),
   g2.attrs['shape'])   # don't need the [:]
# could also use sparse.csr_matrix or M2.tocsr()