一、代码实践
(一)基础版
代码参照下面博文进行:
NLP实战学习(2):基于Bertopic的新闻主题建模_berttopic-****博客
前提:由于国内很难登上Huggingface下载数据集和模型,需要提前将用到的“news2016zh_valid.json”数据集和SentenceTransformer的“paraphrase-MiniLM-L12-v2”句向量嵌入模型下载到本地。这篇博文里附带的数据集链接失效了,且没有提供模型,下面是网盘资源:
“news2016zh_valid.json”数据集、 “paraphrase-MiniLM-L12-v2”句向量嵌入模型
本地下载后,按照本地保存路径对数据集、停用词表及模型的位置、名称进行调整,其中模型的修改需要格外注意,使用r引入绝对路径,改写方式如下:
%%time
model = SentenceTransformer(r'paraphrase-MiniLM-L12-v2的绝对路径')
embeddings = (data['content'].tolist(), show_progress_bar=True)
#注意:“%%time”必须顶行顶格,在jupyter中另起一行,作为开头,否则会报错:
#UsageError: Line magic function `%%time` not found.
(二)提升版
NLP实战之BERTopic主题分析_topic_model.visualize_barchart()-****博客
1数据准备与预处理
1.1数据准备
import pandas as pd
import os
folder_path = () + '\\' +'tilu'
data0 = ()
for file_name in (folder_path):
# 确保文件是 Excel 文件
if file_name.endswith(".xlsx"):
# 构建完整的文件路径
file_path = (folder_path, file_name)
# 读取 Excel 文件
df = pd.read_excel(file_path)
# 将当前文件的数据合并到总的 DataFrame 中
data0 = ([data0, df], ignore_index=True)
data = data0
del data0
()
1.2数据预处理
# 设置所需函数
def read(path):
f = open(path,encoding="utf8")
data = []
for line in ():
(line)
return data
# 加载停用词表
def getword(path):
swlis = []
for i in read(path):
outsw = str(i).replace('\n','')
(outsw)
return swlis
# # 构建分词器
import jieba
# 加载用户词表
jieba.load_userdict('')
# 停用词表
stopLists = list(getword('stop_words_jieba.txt'))
# 定义分词器
def paperCut(intxt):
return [w for w in (intxt,cut_all=False,HMM=True) if w not in stopLists and len(w)>=1]
def SentenceCut(doc):
d = str(doc)
d = paperCut(d)
d= " ".join(d)
return d
# 对相应列执行分词
data['content'] = data['content'].apply(SentenceCut)
2 BERTopic Moudel构建
2.1 导入bertopic及相关第三方库
# 载入bertopic库
from bertopic import BERTopic
from import ClassTfidfTransformer
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer
from umap import UMAP
2.2 嵌入文档
# 句向量嵌入模型路径设置
model_path = () + '\\' +'embedding_models\\paraphrase-MiniLM-L12-v2'
# 加载sklearn的停用词
stop_words_path = () + '\\' +'stop_words_sklearn.txt'
stop_words_sklearn = list(getword(stop_words_path))
2.3训练bertopic主题模型
# 设定模型参数
model = BERTopic(verbose=True,
language="multilingual",
embedding_model=model_path,
umap_model=UMAP(n_neighbors=64,
n_components=128,
min_dist=0.00,
metric='cosine',
random_state=100
),
hdbscan_model=HDBSCAN(min_cluster_size=100,
metric='euclidean',
core_dist_n_jobs=4,
prediction_data=True),
vectorizer_model=CountVectorizer(
stop_words=stop_words_sklearn,
ngram_range=(1,1),
binary=False
),
ctfidf_model=ClassTfidfTransformer(
bm25_weighting=True,
reduce_frequent_words=True),
n_gram_range=(1,1),
calculate_probabilities=True,
nr_topics="auto",
top_n_words=20,
min_topic_size=100)
# ${project_home}/bertopic/_bertopic.py
# UMAP or another algorithm that has .fit and .transform functions
self.umap_model = umap_model or
UMAP(n_neighbors=15,
n_components=5,
min_dist=0.0,
metric='cosine',
low_memory=self.low_memory)
# HDBSCAN or another clustering algorithm that has .fit and .predict functions and
# the .labels_ variable to extract the labels
self.hdbscan_model = hdbscan_model or
(min_cluster_size=self.min_topic_size,
metric='euclidean',
cluster_selection_method='eom',
prediction_data=True)
2.3.1UMAP参数说明
关于 UMAP,《Understanding UMAP》这篇文章讲解得很清晰,尤其有互动的图,十分有助于理解各个参数的效果。
n_neighbors:近似最近邻数。它控制了UMAP局部结构与全局结构的平衡,数值较小时,UMAP会更加关注局部结构,反之,会关注全局结构,丢掉一些细节。
n_components:设置将数据嵌入的降维空间的维数。
min_dist:点之间的最小距离。此参数控制UMAP聚集在一起的紧密程度,值较小时,会更紧密,反之,会更松散。
2.3.2HDBSCAN参数说明
关于 HDBSCAN,官网讲解很清晰,参阅 《How HDBSCAN Works》
min_cluster_size:控制集群的最小大小,它通常设置为默认值10。值越大,集群越少但规模更大,而值越小,微集群越多。
metric:用于计算距离,通常使用默认值euclidean.
prediction_data:一般始终将此值设置为True,可以预测新点。如果不进行预测,可以将其设置为False。
2.3.3C-TF-IDF参数说明
stop_words:设置停用词语言。
2.3.4BERTopic参数说明
top_n_words:设置提取的每个主题的字数,通常为10-30之间。
min_topic_size:设置主题最小大小,值越低,创建的主题就越多。值太高,则可能根本不会创建任何主题。
nr_topics:设置主题数量,可以设置为一个具体的数字,也可设置为‘none’不进行主题数量约束,设置为‘auto’则自动进行约束。
diversity:是否使用MMR(最大边际相关性)来多样化主题表示,可以设置0~1之间的值,0表示完全不多样化,1表示最多样化,设置为‘none’,不会使用MMR。
3主题提取与呈现
3.1主题-词
#查看各主题信息
model.get_topic_info()
#查看每个主题数量
model.get_topic_freq()
#查找term最有可能所属话题
model.find_topics(term, top_n=5)
#查看Topic 0的特征词
model.get_topic(0)
# 删除索引为-1的“噪声"聚类簇群
top_n_words[-1]
#查看目前的所有主题及其对应的主题词列表
from pprint import pprint
for i in list(range(len(top_n_words) - 1)):
print('Most 20 Important words in TOPIC {} :\n'.format(i))
pprint(top_n_words[i])
pprint('***'*20)
#话题间距离的可视化
model.visualize_topics()
#查看某条文本的主题分布
model.visualize_distribution(probs[0])
#主题层次聚类可视化
model.visualize_hierarchy(top_n_topics=20)
#显示主题1的词条形图
model.visualize_barchart(topics=[1])
#主题相似度热力图
model.visualize_heatmap(n_clusters=10)
#可视化词语
model.visualize_term_rank()
#保存主题模型
()
#压缩主题个数(合并相近的主题)
model.reduce_topics()
3.2文档-主题
#查看某篇文档的主题概率
model.visualize_distribution(model.probabilities_[1], min_probability=0.015)
# 减少离群文档数量
new_topics = model.reduce_outliers(summary, headline_topics, strategy="embeddings")
model.update_topics(summary, topics=new_topics,
vectorizer_model=CountVectorizer(
stop_words=stop_words_sklearn,
ngram_range=(1,1),
binary=False
),
ctfidf_model=ClassTfidfTransformer(
bm25_weighting=True,
reduce_frequent_words=True)
)
# 重新获取主题及主题词列表
model.get_topic_info()
# 查看某个主题排名前10的主题词
model.get_topic(0)
# Get the topic predictions
topic_prediction = model.topics_[:]
# Save the predictions in the dataframe
data['主题预测'] = topic_prediction
# Take a look at the data
()
data.to_excel("文档主题预测.xlsx",index=True)
#主题归并
for i in tqdm(range(20)):
# Calculate cosine similarity
similarities = cosine_similarity(tf_idf.T)
np.fill_diagonal(similarities, 0)
# Extract label to merge into and from where
topic_sizes = docs_df.groupby(['Topic']).count().sort_values("Doc", ascending=False).reset_index()
topic_to_merge = topic_sizes.iloc[-1].Topic
topic_to_merge_into = (similarities[topic_to_merge + 1]) - 1
# Adjust topics
docs_df.loc[docs_df.Topic == topic_to_merge, "Topic"] = topic_to_merge_into
old_topics = docs_df.sort_values("Topic").()
map_topics = {old_topic: index - 1 for index, old_topic in enumerate(old_topics)}
docs_df.Topic = docs_df.(map_topics)
docs_per_topic = docs_df.groupby(['Topic'], as_index = False).agg({'Doc': ' '.join})
# Calculate new topic words
m = len(data)
tf_idf, count = c_tf_idf(docs_per_topic., m)
top_n_words = extract_top_n_words_per_topic(tf_idf, count, docs_per_topic, n=20)
topic_sizes = extract_topic_sizes(docs_df); topic_sizes.head(10)
#查看归并之后的主题数
len(docs_per_topic.())
#218
4BERTopic可视化
4.1主题-词概率分布柱状图
# 柱状图可视化
fig_bar = model.visualize_barchart(
top_n_topics=16,
n_words=8,
title='主题词得分', # 若报错,将此行注视掉
width=250,height=300
)
fig_bar
from import write_html
with open("主题词得分.html", "w", encoding="utf8") as file:
write_html(fig_bar, file)
4.2隐含主题分布图
fig_clu = model.visualize_topics()
fig_clu
with open("主题关系.html", "w", encoding="utf8") as file:
write_html(fig_clu, file)
4.3层次聚类图
fig_hierarchy = model.visualize_hierarchy(top_n_topics=16,
#title='层次聚类图',
width=600,
height=600)
fig_hierarchy
with open("层次聚类图.html", "w", encoding="utf8") as file:
write_html(fig_hierarchy, file)
4.4文档主题聚类图
fig_doc_topic = model.visualize_documents(
topics=list(range(0,16)),
docs=summary,
hide_document_hover=False,
#title='文本主题聚类图',# 不同的文本按主题聚类
width=1200,
height=750
)
fig_doc_topic
with open("文档主题聚类.html", "w", encoding="utf8") as file:
write_html(fig_doc_topic, file)
4.5主题相似度热力图
fig_heatmap = model.visualize_heatmap(top_n_topics=13,
#title='主题相似度热力图',
width=800,
height=600)
fig_heatmap
with open("主题相似度热力图.html", "w", encoding="utf8") as file:
write_html(fig_heatmap, file)
4.6DTM动态主题图
timepoint = data['created_at'].tolist()
timepoint = pd.to_datetime(timepoint, format='%Y%m%d', errors='ignore')
topics_over_time = model.topics_over_time(summary,
timepoint,
datetime_format='mixed',
nr_bins=20,
evolution_tuning=True)
fig_DTM = model.visualize_topics_over_time(topics_over_time,
top_n_topics=7,
#title='DTM',
width=800,
height=350)
fig_DTM
with open("DTM图.html", "w", encoding="utf8") as file:
write_html(fig_DTM, file)