python 关键字提取

时间:2021-01-02 19:08:14

jieba 关键字提取

import jieba.analyse
tags = jieba.analyse.extract_tags(str, topK=3)
#str是提取关键字的内容,topK是提取关键字数
print ",".join(tags)


TF-IDF关键字提取

 是用于资讯检索和文本挖掘的加权技术。是评估一个词对一个文件集或一份文件的重要程度。TF*IDF

TF(词频)tf(w,d) = count(w, d) / size(d)     count(w, d) :w在文档d中出现的次数   size(d):文档d中的总次数

IDF(逆向文件频率)idf = log(n / docs(w, D))


1.安装scikit-learn包

2.安装jieba分词包

3.实现

def find_keywords(string_list, num):
"""查找关键字,num代表查找关键字个数"""
fenci_result = []
for str in string_list:
fenci_result.append(jieba_seg(str))
vectorizer = CountVectorizer()
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(vectorizer.fit_transform(fenci_result))
word = vectorizer.get_feature_names()
weight = tfidf.toarray()
keywords_dict = {}
for i in range(len(word)):
keywords_dict[word[i]] = 0.0
for j in range(len(weight)):
keywords_dict[word[i]] += weight[j][i]
keyword_rank_dict = sorted(keywords_dict.iteritems(), key=lambda d: d[1], reverse=True)
i = 0
result = []
for item in keyword_rank_dict:
i += 1
if i > num:
break
result.append(item[0])
print item[0].encode("utf8"),
print item[1]
return result