该部分讲解的是scikit-learn中构建term-document矩阵的方法,该方法被用到了以下地方:
The sklearn.feature_extraction.text
submodule gathers utilities to build feature vectors from text documents.
feature_extraction.text.CountVectorizer |
Convert a collection of text documents to a matrix of token counts |
feature_extraction.text.HashingVectorizer |
Convert a collection of text documents to a matrix of token occurrences |
feature_extraction.text.TfidfTransformer |
Transform a count matrix to a normalized tf or tf-idf representation |
feature_extraction.text.TfidfVectorizer |
Convert a collection of raw documents to a matrix of TF-IDF features. |
fit_transform
(
raw_documents,
y=None
)
Learn vocabulary and idf, return term-document matrix.
This is equivalent to fit followed by transform, but more efficiently implemented.
Parameters: |
raw_documents : iterable
|
---|---|
Returns: |
X : sparse matrix, [n_samples, n_features]
|
# coding: utf-8结果:
from scipy.sparse.csr import csr_matrix
docs = [["hello", "world", "hello"], ["goodbye", "cruel", "world"]]
indptr = [0] # 存放的是行偏移量
indices = [] # 存放的是data中元素对应的列编号(列编号可重复)
data = [] # 存放的是非0数据元素
vocabulary = {} # key是word词汇,value是列编号
for d in docs: # 遍历每个文档
for term in d: # 遍历文档的每个词汇term
# setdefault如果term不存在,则将新term和他的列
# 编号len(vocabulary)加入到词典中,返回他的编号;
# 如果term存在,则不填加,返回已存在的编号
index = vocabulary.setdefault(term, len(vocabulary))
indices.append(index)
data.append(1)
indptr.append(len(indices))
# csr_matrix可以将同一个词汇次数求和
csr_matrix((data, indices, indptr), dtype=int).toarray()
[[2 1 0 0]
[0 1 1 1]]