Scipy sparse的CSC矩阵总结

一、构造

csr的构造还是能一下猜到的，然后csc的还是花了点时间才看懂。

此处直接引用一篇比较清楚的博客：

许多同学可能在使用Python进行科学计算时用过稀疏矩阵的构造，而python的科学计算包scipy.sparse是很好的一个解决稀疏矩阵构造/计算的包。

下面我介绍一下scipy.sparse包中csc/csr矩阵的构造中一个比较难理解的构造方法：

官方文档（http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html）中对csc矩阵的构造方法中最后一种：

csc_matrix((data, indices, indptr), [shape=(M, N)])

is the standard CSC representation where the row indices for column i are stored in indices[indptr[i]:indptr[i+1]] and their corresponding values are stored in data[indptr[i]:indptr[i+1]]. If the shape parameter is not supplied, the matrix dimensions are inferred from the index arrays.这个构造方法比较难理解，这里的indptr indices分别是什么呢？

对于以下代码来说：

indptr = np.array([0, 2, 3, 6])

indices = np.array([0, 2, 2, 0, 1, 2])

data = np.array([1, 2, 3, 4, 5, 6])

csc_matrix((data, indices, indptr), shape=(3, 3)).toarray()

indices代表了非零元素的行信息，它与indptr共同定位元素的行和列

首先对于0列来说 indptr[0]:indptr[1]=[0,1] 再看行indices[0,1]=[0,2] 数据data[0,1]=[1,2] 说明列0在行0和2上有数据1和2

对于1列来说 indptr[1]:indptr[2]=[2] 行indices[2]=[2] 数据data[2]=[3] 说明列2在行2上有数据3

对于2列来说 indptr[2]:indptr[3]=[3,4,5] 行indices[3,4,5]=[0,1,2] 数据data[3,4,5]=[4,5,6]

所以上述代码可以得到矩阵：

array([[1, 0, 4],

[0, 0, 5],

[2, 3, 6]])

二、CSC与CSR等其他矩阵的相互转化

sparse矩阵的官方文档有比较详细的说明

https://docs.scipy.org/doc/scipy/reference/sparse.html

三、hstack vstack拼接矩阵

hstack是把两个矩阵水平拼接，比如一个样本的两块特征拼在一起

vstack是把两个矩阵竖着拼接，比如100个样本的特征，另外100个样本的特征，然后放在一起作为Train数据，那么就是vstack([mat1,mat2])，然后就变成200个样本的特征。

hstack和vstack函数可以将稀疏矩阵横向或者纵向合并，比如：

>>> from scipy.sparse importcoo_matrix, vstack

>>> A = coo_matrix([[1,2],[3,4]])

>>> B = coo_matrix([[5,6]])

>>> vstack( [A,B] ).todense()

matrix([[1, 2],

[3, 4],

[5, 6]])

但是经过测试，如果A和B的数据形式不一样，不能合并。比如A存储的是字符串，B是数字，那么不能合并。也就是说一个矩阵中的数据格式必须是相同的

四、矩阵的切片

官方文档没有，然后自己试着试出来的

比如去掉第一列：

b[:,1:] 就是这么简单。第一个:表示选择所有的行，第二个1:就表示去掉第一列。

返回结果类型仍然是Csc

由此可以类推其他切片方法

五、二维的list转化为csr和csc的代码（因为直接转化csc感觉有点写起来容易出bug，我先搞成csr再转化了）

def mat_to_coo(mat,row_len,col_len):

    row = []
    col = []
    data = []

    for i,linein enumerate(mat):

        for j, vin enumerate(line):

            if v!= 0:
                row.append(i)
                col.append(j)
                data.append(v)

    return coo_matrix( (np.array(data), (np.array(row), np.array(col))),shape=(row_len,col_len) )

def mat_to_csc(mat,row_len,col_len):
    mat = mat_to_coo(mat,row_len,col_len)
    return mat.tocsc()

六、矩阵的存储和读取

非常简单，小数据下没问题，然后我用大数据似乎类型混乱

import numpy as np

import scipy.sparse as sp

m = sp.lil_matrix((7329,7329))

np.save(path,m) #用numpy的load方法存储矩阵，path为存储的路径

mat = np.load(path)[()] #读取存储的矩阵，注意[()]这个符号可以抽取对象

mat = mat.toarray() #将稀疏矩阵转为稠密矩阵

秒客网

Scipy sparse的CSC矩阵总结

相关文章