以便携式数据格式保存/加载scipy稀疏csr_matrix

时间:2022-10-03 21:22:55

How do you save/load a scipy sparse csr_matrix in a portable format? The scipy sparse matrix is created on Python 3 (Windows 64-bit) to run on Python 2 (Linux 64-bit). Initially, I used pickle (with protocol=2 and fix_imports=True) but this didn't work going from Python 3.2.2 (Windows 64-bit) to Python 2.7.2 (Windows 32-bit) and got the error:

如何以便携式格式保存/加载scipy稀疏csr_matrix? scipy稀疏矩阵在Python 3(Windows 64位)上创建,以在Python 2(Linux 64位)上运行。最初,我使用了pickle(使用protocol = 2和fix_imports = True),但这从Python 3.2.2(Windows 64位)到Python 2.7.2(Windows 32位)不起作用并得到错误:

TypeError: ('data type not understood', <built-in function _reconstruct>, (<type 'numpy.ndarray'>, (0,), '[98]')).

Next, tried numpy.save and numpy.load as well as scipy.io.mmwrite() and scipy.io.mmread() and none of these methods worked either.

接下来,尝试了numpy.save和numpy.load以及scipy.io.mmwrite()和scipy.io.mmread(),但这些方法都没有。

8 个解决方案

#1


82  

edit: SciPy 1.19 now has scipy.sparse.save_npz and scipy.sparse.load_npz.

编辑:SciPy 1.19现在有scipy.sparse.save_npz和scipy.sparse.load_npz。

from scipy import sparse

sparse.save_npz("yourmatrix.npz", your_matrix)
your_matrix_back = sparse.load_npz("yourmatrix.npz")

For both functions, the file argument may also be a file-like object (i.e. the result of open) instead of a filename.

对于这两个函数,file参数也可以是类文件对象(即打开的结果)而不是文件名。


Got an answer from the Scipy user group:

得到了Scipy用户组的回答:

A csr_matrix has 3 data attributes that matter: .data, .indices, and .indptr. All are simple ndarrays, so numpy.save will work on them. Save the three arrays with numpy.save or numpy.savez, load them back with numpy.load, and then recreate the sparse matrix object with:

csr_matrix有3个重要的数据属性:.data,.indices和.indptr。所有都是简单的ndarray,所以numpy.save将对它们起作用。使用numpy.save或numpy.savez保存三个数组,使用numpy.load加载它们,然后使用以下命令重新创建稀疏矩阵对象:

new_csr = csr_matrix((data, indices, indptr), shape=(M, N))

So for example:

例如:

def save_sparse_csr(filename, array):
    np.savez(filename, data=array.data, indices=array.indices,
             indptr=array.indptr, shape=array.shape)

def load_sparse_csr(filename):
    loader = np.load(filename)
    return csr_matrix((loader['data'], loader['indices'], loader['indptr']),
                      shape=loader['shape'])

#2


32  

Though you write, scipy.io.mmwrite and scipy.io.mmread don't work for you, I just want to add how they work. This question is the no. 1 Google hit, so I myself started with np.savez and pickle.dump before switching to the simple and obvious scipy-functions. They work for me and shouldn't be overseen by those who didn't tried them yet.

虽然你写,scipy.io.mmwrite和scipy.io.mmread不适合你,我只想添加它们的工作方式。这个问题是否定的。 1谷歌命中,所以我自己开始使用np.savez和pickle.dump,然后切换到简单而明显的scipy函数。他们为我工作,不应该被那些没有尝试过的人监督。

from scipy import sparse, io

m = sparse.csr_matrix([[0,0,0],[1,0,0],[0,1,0]])
m              # <3x3 sparse matrix of type '<type 'numpy.int64'>' with 2 stored elements in Compressed Sparse Row format>

io.mmwrite("test.mtx", m)
del m

newm = io.mmread("test.mtx")
newm           # <3x3 sparse matrix of type '<type 'numpy.int32'>' with 2 stored elements in COOrdinate format>
newm.tocsr()   # <3x3 sparse matrix of type '<type 'numpy.int32'>' with 2 stored elements in Compressed Sparse Row format>
newm.toarray() # array([[0, 0, 0], [1, 0, 0], [0, 1, 0]], dtype=int32)

#3


23  

Here is performance comparison of the three most upvoted answers using Jupyter notebook. The input is a 1M x 100K random sparse matrix with density 0.001, containing 100M non-zero values:

以下是使用Jupyter笔记本的三个最受欢迎的答案的性能比较。输入是一个1M x 100K随机稀疏矩阵,密度为0.001,包含100M非零值:

from scipy.sparse import random
matrix = random(1000000, 100000, density=0.001, format='csr')

matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>

io.mmwrite / io.mmread

from scipy.sparse import io

%time io.mmwrite('test_io.mtx', matrix)
CPU times: user 4min 37s, sys: 2.37 s, total: 4min 39s
Wall time: 4min 39s

%time matrix = io.mmread('test_io.mtx')
CPU times: user 2min 41s, sys: 1.63 s, total: 2min 43s
Wall time: 2min 43s    

matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in COOrdinate format>    

Filesize: 3.0G.

(note that the format has been changed from csr to coo).

(请注意,格式已从csr更改为coo)。

np.savez / np.load

import numpy as np
from scipy.sparse import csr_matrix

def save_sparse_csr(filename, array):
    # note that .npz extension is added automatically
    np.savez(filename, data=array.data, indices=array.indices,
             indptr=array.indptr, shape=array.shape)

def load_sparse_csr(filename):
    # here we need to add .npz extension manually
    loader = np.load(filename + '.npz')
    return csr_matrix((loader['data'], loader['indices'], loader['indptr']),
                      shape=loader['shape'])


%time save_sparse_csr('test_savez', matrix)
CPU times: user 1.26 s, sys: 1.48 s, total: 2.74 s
Wall time: 2.74 s    

%time matrix = load_sparse_csr('test_savez')
CPU times: user 1.18 s, sys: 548 ms, total: 1.73 s
Wall time: 1.73 s

matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>

Filesize: 1.1G.

cPickle

import cPickle as pickle

def save_pickle(matrix, filename):
    with open(filename, 'wb') as outfile:
        pickle.dump(matrix, outfile, pickle.HIGHEST_PROTOCOL)
def load_pickle(filename):
    with open(filename, 'rb') as infile:
        matrix = pickle.load(infile)    
    return matrix    

%time save_pickle(matrix, 'test_pickle.mtx')
CPU times: user 260 ms, sys: 888 ms, total: 1.15 s
Wall time: 1.15 s    

%time matrix = load_pickle('test_pickle.mtx')
CPU times: user 376 ms, sys: 988 ms, total: 1.36 s
Wall time: 1.37 s    

matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>

Filesize: 1.1G.

Note: cPickle does not work with very large objects (see this answer). In my experience, it didn't work for a 2.7M x 50k matrix with 270M non-zero values. np.savez solution worked well.

注意:cPickle不适用于非常大的对象(请参阅此答案)。根据我的经验,它不适用于具有270M非零值的2.7M x 50k矩阵。 np.savez解决方案效果很好。

Conclusion

(based on this simple test for CSR matrices) cPickle is the fastest method, but it doesn't work with very large matrices, np.savez is only slightly slower, while io.mmwrite is much slower, produces bigger file and restores to the wrong format. So np.savez is the winner here.

(基于这个简单的CSR矩阵测试)cPickle是最快的方法,但它不适用于非常大的矩阵,np.savez只是稍慢,而io.mmwrite慢得多,产生更大的文件并恢复到格式错误。所以np.savez就是胜利者。

#4


14  

Now you can use scipy.sparse.save_npz : https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.save_npz.html

现在你可以使用scipy.sparse.save_npz:https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.save_npz.html

#5


11  

Assuming you have scipy on both machines, you can just use pickle.

假设你在两台机器上都有scipy,你可以使用泡菜。

However, be sure to specify a binary protocol when pickling numpy arrays. Otherwise you'll wind up with a huge file.

但是,在pickling numpy数组时一定要指定二进制协议。否则你会得到一个巨大的文件。

At any rate, you should be able to do this:

无论如何,你应该能够做到这一点:

import cPickle as pickle
import numpy as np
import scipy.sparse

# Just for testing, let's make a dense array and convert it to a csr_matrix
x = np.random.random((10,10))
x = scipy.sparse.csr_matrix(x)

with open('test_sparse_array.dat', 'wb') as outfile:
    pickle.dump(x, outfile, pickle.HIGHEST_PROTOCOL)

You can then load it with:

然后你可以加载它:

import cPickle as pickle

with open('test_sparse_array.dat', 'rb') as infile:
    x = pickle.load(infile)

#6


6  

As of scipy 0.19.0, you can save and load sparse matrices this way:

从scipy 0.19.0开始,您可以这样保存和加载稀疏矩阵:

from scipy import sparse

data = sparse.csr_matrix((3, 4))

#Save
sparse.save_npz('data_sparse.npz', data)

#Load
data = sparse.load_npz("data_sparse.npz")

#7


0  

This is what I used to save a lil_matrix.

这就是我用来保存lil_matrix的方法。

import numpy as np
from scipy.sparse import lil_matrix

def save_sparse_lil(filename, array):
    # use np.savez_compressed(..) for compression
    np.savez(filename, dtype=array.dtype.str, data=array.data,
        rows=array.rows, shape=array.shape)

def load_sparse_lil(filename):
    loader = np.load(filename)
    result = lil_matrix(tuple(loader["shape"]), dtype=str(loader["dtype"]))
    result.data = loader["data"]
    result.rows = loader["rows"]
    return result

I must say I found NumPy's np.load(..) to be very slow. This is my current solution, I feel runs much faster:

我必须说我发现NumPy的np.load(..)非常慢。这是我目前的解决方案,我感觉跑得快得多:

from scipy.sparse import lil_matrix
import numpy as np
import json

def lil_matrix_to_dict(myarray):
    result = {
        "dtype": myarray.dtype.str,
        "shape": myarray.shape,
        "data":  myarray.data,
        "rows":  myarray.rows
    }
    return result

def lil_matrix_from_dict(mydict):
    result = lil_matrix(tuple(mydict["shape"]), dtype=mydict["dtype"])
    result.data = np.array(mydict["data"])
    result.rows = np.array(mydict["rows"])
    return result

def load_lil_matrix(filename):
    result = None
    with open(filename, "r", encoding="utf-8") as infile:
        mydict = json.load(infile)
        result = lil_matrix_from_dict(mydict)
    return result

def save_lil_matrix(filename, myarray):
    with open(filename, "w", encoding="utf-8") as outfile:
        mydict = lil_matrix_to_dict(myarray)
        json.dump(mydict, outfile)

#8


0  

I was asked to send the matrix in a simple and generic format:

我被要求以简单和通用的格式发送矩阵:

<x,y,value>

I ended up with this:

我最终得到了这个:

def save_sparse_matrix(m,filename):
    thefile = open(filename, 'w')
    nonZeros = np.array(m.nonzero())
    for entry in range(nonZeros.shape[1]):
        thefile.write("%s,%s,%s\n" % (nonZeros[0, entry], nonZeros[1, entry], m[nonZeros[0, entry], nonZeros[1, entry]]))

#1


82  

edit: SciPy 1.19 now has scipy.sparse.save_npz and scipy.sparse.load_npz.

编辑:SciPy 1.19现在有scipy.sparse.save_npz和scipy.sparse.load_npz。

from scipy import sparse

sparse.save_npz("yourmatrix.npz", your_matrix)
your_matrix_back = sparse.load_npz("yourmatrix.npz")

For both functions, the file argument may also be a file-like object (i.e. the result of open) instead of a filename.

对于这两个函数,file参数也可以是类文件对象(即打开的结果)而不是文件名。


Got an answer from the Scipy user group:

得到了Scipy用户组的回答:

A csr_matrix has 3 data attributes that matter: .data, .indices, and .indptr. All are simple ndarrays, so numpy.save will work on them. Save the three arrays with numpy.save or numpy.savez, load them back with numpy.load, and then recreate the sparse matrix object with:

csr_matrix有3个重要的数据属性:.data,.indices和.indptr。所有都是简单的ndarray,所以numpy.save将对它们起作用。使用numpy.save或numpy.savez保存三个数组,使用numpy.load加载它们,然后使用以下命令重新创建稀疏矩阵对象:

new_csr = csr_matrix((data, indices, indptr), shape=(M, N))

So for example:

例如:

def save_sparse_csr(filename, array):
    np.savez(filename, data=array.data, indices=array.indices,
             indptr=array.indptr, shape=array.shape)

def load_sparse_csr(filename):
    loader = np.load(filename)
    return csr_matrix((loader['data'], loader['indices'], loader['indptr']),
                      shape=loader['shape'])

#2


32  

Though you write, scipy.io.mmwrite and scipy.io.mmread don't work for you, I just want to add how they work. This question is the no. 1 Google hit, so I myself started with np.savez and pickle.dump before switching to the simple and obvious scipy-functions. They work for me and shouldn't be overseen by those who didn't tried them yet.

虽然你写,scipy.io.mmwrite和scipy.io.mmread不适合你,我只想添加它们的工作方式。这个问题是否定的。 1谷歌命中,所以我自己开始使用np.savez和pickle.dump,然后切换到简单而明显的scipy函数。他们为我工作,不应该被那些没有尝试过的人监督。

from scipy import sparse, io

m = sparse.csr_matrix([[0,0,0],[1,0,0],[0,1,0]])
m              # <3x3 sparse matrix of type '<type 'numpy.int64'>' with 2 stored elements in Compressed Sparse Row format>

io.mmwrite("test.mtx", m)
del m

newm = io.mmread("test.mtx")
newm           # <3x3 sparse matrix of type '<type 'numpy.int32'>' with 2 stored elements in COOrdinate format>
newm.tocsr()   # <3x3 sparse matrix of type '<type 'numpy.int32'>' with 2 stored elements in Compressed Sparse Row format>
newm.toarray() # array([[0, 0, 0], [1, 0, 0], [0, 1, 0]], dtype=int32)

#3


23  

Here is performance comparison of the three most upvoted answers using Jupyter notebook. The input is a 1M x 100K random sparse matrix with density 0.001, containing 100M non-zero values:

以下是使用Jupyter笔记本的三个最受欢迎的答案的性能比较。输入是一个1M x 100K随机稀疏矩阵,密度为0.001,包含100M非零值:

from scipy.sparse import random
matrix = random(1000000, 100000, density=0.001, format='csr')

matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>

io.mmwrite / io.mmread

from scipy.sparse import io

%time io.mmwrite('test_io.mtx', matrix)
CPU times: user 4min 37s, sys: 2.37 s, total: 4min 39s
Wall time: 4min 39s

%time matrix = io.mmread('test_io.mtx')
CPU times: user 2min 41s, sys: 1.63 s, total: 2min 43s
Wall time: 2min 43s    

matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in COOrdinate format>    

Filesize: 3.0G.

(note that the format has been changed from csr to coo).

(请注意,格式已从csr更改为coo)。

np.savez / np.load

import numpy as np
from scipy.sparse import csr_matrix

def save_sparse_csr(filename, array):
    # note that .npz extension is added automatically
    np.savez(filename, data=array.data, indices=array.indices,
             indptr=array.indptr, shape=array.shape)

def load_sparse_csr(filename):
    # here we need to add .npz extension manually
    loader = np.load(filename + '.npz')
    return csr_matrix((loader['data'], loader['indices'], loader['indptr']),
                      shape=loader['shape'])


%time save_sparse_csr('test_savez', matrix)
CPU times: user 1.26 s, sys: 1.48 s, total: 2.74 s
Wall time: 2.74 s    

%time matrix = load_sparse_csr('test_savez')
CPU times: user 1.18 s, sys: 548 ms, total: 1.73 s
Wall time: 1.73 s

matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>

Filesize: 1.1G.

cPickle

import cPickle as pickle

def save_pickle(matrix, filename):
    with open(filename, 'wb') as outfile:
        pickle.dump(matrix, outfile, pickle.HIGHEST_PROTOCOL)
def load_pickle(filename):
    with open(filename, 'rb') as infile:
        matrix = pickle.load(infile)    
    return matrix    

%time save_pickle(matrix, 'test_pickle.mtx')
CPU times: user 260 ms, sys: 888 ms, total: 1.15 s
Wall time: 1.15 s    

%time matrix = load_pickle('test_pickle.mtx')
CPU times: user 376 ms, sys: 988 ms, total: 1.36 s
Wall time: 1.37 s    

matrix
<1000000x100000 sparse matrix of type '<type 'numpy.float64'>'
with 100000000 stored elements in Compressed Sparse Row format>

Filesize: 1.1G.

Note: cPickle does not work with very large objects (see this answer). In my experience, it didn't work for a 2.7M x 50k matrix with 270M non-zero values. np.savez solution worked well.

注意:cPickle不适用于非常大的对象(请参阅此答案)。根据我的经验,它不适用于具有270M非零值的2.7M x 50k矩阵。 np.savez解决方案效果很好。

Conclusion

(based on this simple test for CSR matrices) cPickle is the fastest method, but it doesn't work with very large matrices, np.savez is only slightly slower, while io.mmwrite is much slower, produces bigger file and restores to the wrong format. So np.savez is the winner here.

(基于这个简单的CSR矩阵测试)cPickle是最快的方法,但它不适用于非常大的矩阵,np.savez只是稍慢,而io.mmwrite慢得多,产生更大的文件并恢复到格式错误。所以np.savez就是胜利者。

#4


14  

Now you can use scipy.sparse.save_npz : https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.save_npz.html

现在你可以使用scipy.sparse.save_npz:https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.save_npz.html

#5


11  

Assuming you have scipy on both machines, you can just use pickle.

假设你在两台机器上都有scipy,你可以使用泡菜。

However, be sure to specify a binary protocol when pickling numpy arrays. Otherwise you'll wind up with a huge file.

但是,在pickling numpy数组时一定要指定二进制协议。否则你会得到一个巨大的文件。

At any rate, you should be able to do this:

无论如何,你应该能够做到这一点:

import cPickle as pickle
import numpy as np
import scipy.sparse

# Just for testing, let's make a dense array and convert it to a csr_matrix
x = np.random.random((10,10))
x = scipy.sparse.csr_matrix(x)

with open('test_sparse_array.dat', 'wb') as outfile:
    pickle.dump(x, outfile, pickle.HIGHEST_PROTOCOL)

You can then load it with:

然后你可以加载它:

import cPickle as pickle

with open('test_sparse_array.dat', 'rb') as infile:
    x = pickle.load(infile)

#6


6  

As of scipy 0.19.0, you can save and load sparse matrices this way:

从scipy 0.19.0开始,您可以这样保存和加载稀疏矩阵:

from scipy import sparse

data = sparse.csr_matrix((3, 4))

#Save
sparse.save_npz('data_sparse.npz', data)

#Load
data = sparse.load_npz("data_sparse.npz")

#7


0  

This is what I used to save a lil_matrix.

这就是我用来保存lil_matrix的方法。

import numpy as np
from scipy.sparse import lil_matrix

def save_sparse_lil(filename, array):
    # use np.savez_compressed(..) for compression
    np.savez(filename, dtype=array.dtype.str, data=array.data,
        rows=array.rows, shape=array.shape)

def load_sparse_lil(filename):
    loader = np.load(filename)
    result = lil_matrix(tuple(loader["shape"]), dtype=str(loader["dtype"]))
    result.data = loader["data"]
    result.rows = loader["rows"]
    return result

I must say I found NumPy's np.load(..) to be very slow. This is my current solution, I feel runs much faster:

我必须说我发现NumPy的np.load(..)非常慢。这是我目前的解决方案,我感觉跑得快得多:

from scipy.sparse import lil_matrix
import numpy as np
import json

def lil_matrix_to_dict(myarray):
    result = {
        "dtype": myarray.dtype.str,
        "shape": myarray.shape,
        "data":  myarray.data,
        "rows":  myarray.rows
    }
    return result

def lil_matrix_from_dict(mydict):
    result = lil_matrix(tuple(mydict["shape"]), dtype=mydict["dtype"])
    result.data = np.array(mydict["data"])
    result.rows = np.array(mydict["rows"])
    return result

def load_lil_matrix(filename):
    result = None
    with open(filename, "r", encoding="utf-8") as infile:
        mydict = json.load(infile)
        result = lil_matrix_from_dict(mydict)
    return result

def save_lil_matrix(filename, myarray):
    with open(filename, "w", encoding="utf-8") as outfile:
        mydict = lil_matrix_to_dict(myarray)
        json.dump(mydict, outfile)

#8


0  

I was asked to send the matrix in a simple and generic format:

我被要求以简单和通用的格式发送矩阵:

<x,y,value>

I ended up with this:

我最终得到了这个:

def save_sparse_matrix(m,filename):
    thefile = open(filename, 'w')
    nonZeros = np.array(m.nonzero())
    for entry in range(nonZeros.shape[1]):
        thefile.write("%s,%s,%s\n" % (nonZeros[0, entry], nonZeros[1, entry], m[nonZeros[0, entry], nonZeros[1, entry]]))