聚类:寻找相关的帖子
Levenshtein距离
比Levenshtein距离更稳健的方法,叫做词袋法(bag of word),基于词频统计,记住出现次数。
使用Scikit的CountVectorizer可以高效的完成词频的统计和向量化。
from sklearn.feature_extraction.text import CountVectorizer运行结果
import os
vectorizer = CountVectorizer(min_df=1)
print vectorizer
CountVectorizer(analyzer=word, binary=False, charset=None, charset_error=None,可以利用sklearn 的函数将词向量化
decode_error=strict, dtype=<type 'numpy.int64'>, encoding=utf-8,
input=content, lowercase=True, max_df=1.0, max_features=None,
min_df=1, ngram_range=(1, 1), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern=(?u)\b\w\w+\b, tokenizer=None,
vocabulary=None)
content=["How to format my hard disk","Hard disk format problems"]输出结果分别为
X=vectorizer.fit_transform(content)
print vectorizer.get_feature_names()
print(X.toarray().transpose())
[u'disk', u'format', u'hard', u'how', u'my', u'problems', u'to']
[[1 1]
[1 1]
[1 1]
[1 0]
[1 0]
[0 1]
[1 0]]
我们来对具有五个帖子的数据集进行实验
DIR = './toy/'输出结果为
posts = [open(os.path.join(DIR,f)).read() for f in os.listdir(DIR)]
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=1)
print vectorizer
x_train = vectorizer.fit_transform(posts)
print x_train
num_samples,num_features = x_train.shape
print("#samples:%d. #features:%d" % (num_samples,num_features))
print(vectorizer.get_feature_names())
x_train:
(0, 0) 1
(0, 1) 1
(1, 2) 1
(0, 3) 1
(3, 4) 1
(4, 4) 3
(1, 5) 1
(2, 5) 1
(3, 5) 1
(4, 5) 3
(2, 6) 1
(1, 7) 1
(2, 7) 1
(3, 7) 1
(4, 7) 3
(0, 8) 1
(0, 9) 1
(0, 10) 1
(0, 11) 1
(0, 12) 1
(2, 13) 1
(0, 14) 1
(0, 15) 1
(2, 16) 1
(0, 17) 1
(1, 18) 1
(2, 19) 1
(1, 20) 1
(3, 21) 1
(4, 21) 3
(0, 22) 1
(0, 23) 1
(0, 24) 1
#samples:5. #features:25
[u'about', u'actually', u'capabilities', u'contains', u'data', u'databases', u'images', u'imaging', u'interesting', u'is', u'it', u'learning', u'machine', u'most', u'much', u'not', u'permanently', u'post', u'provide', u'save', u'storage', u'store', u'stuff', u'this', u'toy']
我们对新的帖子进行向量化
new_post = "imaging databases"
new_post_vec = vectorizer.transform([new_post])
print(new_post_vec)
print(new_post_vec.toarray())
输出结果:
(0, 5) 1
(0, 7) 1
[[0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
相似度计算函数
import scipy as sp
def dist_raw(v1,v2):
delta = v1-v2
return sp.linalg.norm(delta.toarray())
遍历所有的帖子找到最接近的那个
import sys
best_doc = None
best_dist = sys.maxint
best_i = None
for i in range(0,num_samples):
post = posts[i]
if post==new_post:
continue
post_vec = x_train.getrow(i)
d=dist_raw(post_vec,new_post_vec)
print "== post %i with dist=%.2f: %s" %(i,d,post)
if d<best_dist:
best_dist = d
best_i = i
print("Best post is %i with dist=%.2f"%(best_i,best_dist))
结果
== post 0 with dist=4.00: This is a toy post about machine learning. Actually, it contains not much interesting stuff.
== post 1 with dist=1.73: Imaging databases provide storage capabilities.
== post 2 with dist=2.00: Most imaging databases save images permanently.
== post 3 with dist=1.41: Imaging databases store data.
== post 4 with dist=5.10: Imaging databases store data. Imaging databases store data. Imaging databases store data.
Best post is 3 with dist=1.41
词频向量的归一化,如果一句话重复多次,应该有相同的相似度。
def dist_norm(v1,v2):
v1_normalized = v1/sp.linalg.norm(v1.toarray())
v2_normalized = v2/sp.linalg.norm(v2.toarray())
delta = v1_normalized - v2_normalized
return sp.linalg.norm(delta.toarray())
这次输出结果为
== post 0 with dist=1.41: This is a toy post about machine learning. Actually, it contains not much interesting stuff.
== post 1 with dist=0.86: Imaging databases provide storage capabilities.
== post 2 with dist=0.92: Most imaging databases save images permanently.
== post 3 with dist=0.77: Imaging databases store data.
== post 4 with dist=0.77: Imaging databases store data. Imaging databases store data. Imaging databases store data.
Best post is 3 with dist=0.77
删除不重要的词汇
一些常见词应该不统计
vectorizer = CountVectorizer(min_df=1,stop_words='english')如果知道停用词,可以传入停用词列表,如果设置为english,则会使用包含318个单词停用词。
可以查看有哪些
print(sorted(vectorizer.get_stop_words()))
['a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'amoungst',..]不列举了。
最终结果为
== post 0 with dist=1.41: This is a toy post about machine learning. Actually, it contains not much interesting stuff.
== post 1 with dist=0.86: Imaging databases provide storage capabilities.
== post 2 with dist=0.86: Most imaging databases save images permanently.
== post 3 with dist=0.77: Imaging databases store data.
== post 4 with dist=0.77: Imaging databases store data. Imaging databases store data. Imaging databases store data.
Best post is 3 with dist=0.77
词干处理
我们在前面部分把语义类似但形式不同的词当作了不同的词进行统计,这是不合理的。
使用NLTK包进行词干处理。
同时,词语应该具有不同的权重,这就是TF-IDF所做的,将权重折扣考虑了进去。
这里就不写了,对这个过程不感兴趣
对新闻进行聚类
下载20newsgroup数据集 http://mlcomp.org/download/dataset-379-20news-18828_DGKQR.zip
import sklearn.datasets
MLCOMP_DIR = r"D:\PythonWorkPlace\data"
data = sklearn.datasets.load_mlcomp("20news-18828",mlcomp_root=MLCOMP_DIR)
print(data.filenames)
print(len(data.filenames))
print(data.target_names)
输出结果
['D:\\PythonWorkPlace\\data\\379\\raw\\comp.graphics\\1190-38614''D:\\PythonWorkPlace\\data\\379\\raw\\comp.graphics\\1383-38616'
'D:\\PythonWorkPlace\\data\\379\\raw\\alt.atheism\\487-53344' ...,
'D:\\PythonWorkPlace\\data\\379\\raw\\rec.sport.hockey\\10215-54303'
'D:\\PythonWorkPlace\\data\\379\\raw\\sci.crypt\\10799-15660'
'D:\\PythonWorkPlace\\data\\379\\raw\\comp.os.ms-windows.misc\\2732-10871']
18828
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball'...]等
# 我们可以在训练集和测试集中进行选取
train_data = sklearn.datasets.load_mlcomp("20news-18828","train",mlcomp_root=MLCOMP_DIR)
print(len(train_data.filenames))
test_data = sklearn.datasets.load_mlcomp("20news-18828","test",mlcomp_root=MLCOMP_DIR)
print(len(test_data.filenames))
#可以将范围限定在某些新闻组中
groups = ['comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware',
'comp.sys.mac.hardware', 'comp.windows.x', 'sci.space']
train_data = sklearn.datasets.load_mlcomp("20news-18828","train",mlcomp_root=MLCOMP_DIR)
print(len(train_data.filenames))
#kmenas
num_clusters = 50
from sklearn.cluster import KMeans
km = KMeans(n_clusters=num_clusters,init='random',n_init=1,verbose=1)
km.fit(vectorized)
print km.labels_
print km.labels_.shape
对一个新的帖子
#when a new text come...
new_post = ["Disk drive problem.Hi, I have a problem with my hard disk. After one year it is working..."]
new_post_vec = vectorizer.transform([new_post])
new_post_label=km.predict(new_post_vec)[0]
将相似度进行排行
similar_indices = (km.labels_ == new_post_label).nonzero()[0]
import scipy as sp
similar = []
for i in similar_indices:
dist = sp.linalg.norm((new_post_vec - vectorized[i]).toarray())
similar.append((dist,dataset.data[i]))
similar = sorted(similar)
print(len(similar))
print(similar[0])
print(similar[len(similar)/2])
print(-1)