利用scipy.sparse.csr_matrix构建term-document矩阵

该部分讲解的是scikit-learn中构建term-document矩阵的方法,该方法被用到了以下地方:

The sklearn.feature_extraction.text submodule gathers utilities to build feature vectors from text documents.

`feature_extraction.text.CountVectorizer`	Convert a collection of text documents to a matrix of token counts
`feature_extraction.text.HashingVectorizer`	Convert a collection of text documents to a matrix of token occurrences
`feature_extraction.text.TfidfTransformer`	Transform a count matrix to a normalized tf or tf-idf representation
`feature_extraction.text.TfidfVectorizer`	Convert a collection of raw documents to a matrix of TF-IDF features.

例如TfidfVectorizer中的fit_transform方法就是利用scipy的稀疏矩阵构建并返回term-document矩阵：

fit_transform ( raw_documents, y=None )

Learn vocabulary and idf, return term-document matrix.

This is equivalent to fit followed by transform, but more efficiently implemented.

Parameters:

Parameters:	raw_documents : iterable an iterable which yields either str, unicode or file objects
Returns:	X : sparse matrix, [n_samples, n_features] Tf-idf-weighted document-term matrix.

raw_documents : iterable

an iterable which yields either str, unicode or file objects

Returns:

X : sparse matrix, [n_samples, n_features]

Tf-idf-weighted document-term matrix.

稀疏矩阵Compressed Sparse Row (CSR)存储原理：

利用scipy.sparse.csr_matrix构建term-document矩阵

具体实现代码：

# coding: utf-8
from scipy.sparse.csr import csr_matrix

docs = [["hello", "world", "hello"], ["goodbye", "cruel", "world"]]
indptr = [0]        # 存放的是行偏移量
indices = []        # 存放的是data中元素对应的列编号（列编号可重复）
data = []           # 存放的是非0数据元素
vocabulary = {}     # key是word词汇，value是列编号
for d in docs:      # 遍历每个文档
    for term in d:  # 遍历文档的每个词汇term
        # setdefault如果term不存在，则将新term和他的列
        # 编号len(vocabulary)加入到词典中，返回他的编号；
        # 如果term存在，则不填加，返回已存在的编号
        index = vocabulary.setdefault(term, len(vocabulary))
        indices.append(index)
        data.append(1)
    indptr.append(len(indices))
# csr_matrix可以将同一个词汇次数求和
csr_matrix((data, indices, indptr), dtype=int).toarray()

结果：

[[2 1 0 0]
[0 1 1 1]]

秒客网

利用scipy.sparse.csr_matrix构建term-document矩阵

该部分讲解的是scikit-learn中构建term-document矩阵的方法,该方法被用到了以下地方:

相关文章