BERTopic实践手记

时间:2025-02-21 09:17:49

一、代码实践

(一)基础版

代码参照下面博文进行:

NLP实战学习(2):基于Bertopic的新闻主题建模_berttopic-****博客

前提:由于国内很难登上Huggingface下载数据集和模型,需要提前将用到的“news2016zh_valid.json”数据集和SentenceTransformer的“paraphrase-MiniLM-L12-v2”句向量嵌入模型下载到本地。这篇博文里附带的数据集链接失效了,且没有提供模型,下面是网盘资源:

        “news2016zh_valid.json”数据集、   “paraphrase-MiniLM-L12-v2”句向量嵌入模型

本地下载后,按照本地保存路径对数据集、停用词表及模型的位置、名称进行调整,其中模型的修改需要格外注意,使用r引入绝对路径,改写方式如下:

%%time
model = SentenceTransformer(r'paraphrase-MiniLM-L12-v2的绝对路径')
embeddings = (data['content'].tolist(), show_progress_bar=True)
#注意:“%%time”必须顶行顶格,在jupyter中另起一行,作为开头,否则会报错:
#UsageError: Line magic function `%%time` not found.

(二)提升版

NLP实战之BERTopic主题分析_topic_model.visualize_barchart()-****博客

1数据准备与预处理

1.1数据准备

import pandas as pd
import os
folder_path = () + '\\' +'tilu'
data0 = ()
for file_name in (folder_path):
    # 确保文件是 Excel 文件
    if file_name.endswith(".xlsx"):
        # 构建完整的文件路径
        file_path = (folder_path, file_name)
        # 读取 Excel 文件
        df = pd.read_excel(file_path)
        # 将当前文件的数据合并到总的 DataFrame 中
        data0 = ([data0, df], ignore_index=True)

data = data0
del data0
()

1.2数据预处理

# 设置所需函数
def read(path):
    f = open(path,encoding="utf8")
    data = []
    for line in ():
        (line)
    return data

# 加载停用词表
def getword(path):
    swlis = []
    for i in read(path):
        outsw = str(i).replace('\n','')
        (outsw)
    return swlis
# # 构建分词器
import jieba

# 加载用户词表
jieba.load_userdict('')

# 停用词表
stopLists = list(getword('stop_words_jieba.txt')) 

# 定义分词器
def paperCut(intxt):
    return [w for w in (intxt,cut_all=False,HMM=True) if w not in stopLists and len(w)>=1]
def SentenceCut(doc):
    d = str(doc)
    d = paperCut(d)
    d= " ".join(d)
    return d
# 对相应列执行分词
data['content'] = data['content'].apply(SentenceCut)

2 BERTopic Moudel构建

2.1 导入bertopic及相关第三方库

# 载入bertopic库
from bertopic import BERTopic
from  import ClassTfidfTransformer
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer
from umap import UMAP

2.2 嵌入文档

# 句向量嵌入模型路径设置
model_path = () + '\\' +'embedding_models\\paraphrase-MiniLM-L12-v2'
# 加载sklearn的停用词
stop_words_path = () + '\\' +'stop_words_sklearn.txt'
stop_words_sklearn = list(getword(stop_words_path))

2.3训练bertopic主题模型

# 设定模型参数
model = BERTopic(verbose=True,
                 language="multilingual",
                 embedding_model=model_path,
                 umap_model=UMAP(n_neighbors=64,
                                  n_components=128, 
                                  min_dist=0.00,
                                  metric='cosine',
                                  random_state=100
                                 ),
                 hdbscan_model=HDBSCAN(min_cluster_size=100,
                                       metric='euclidean',
                                       core_dist_n_jobs=4,
                                       prediction_data=True),
                 vectorizer_model=CountVectorizer(
                                                  stop_words=stop_words_sklearn,
                                                  ngram_range=(1,1),
                                                  binary=False
                                                  ),
                 ctfidf_model=ClassTfidfTransformer(
                                                   bm25_weighting=True,
                                                   reduce_frequent_words=True),
                 n_gram_range=(1,1), 
                 calculate_probabilities=True, 
                 nr_topics="auto", 
                 top_n_words=20, 
                 min_topic_size=100)
  # ${project_home}/bertopic/_bertopic.py

  # UMAP or another algorithm that has .fit and .transform functions
  self.umap_model = umap_model or 
                    UMAP(n_neighbors=15,
                         n_components=5,
                         min_dist=0.0,
                         metric='cosine',
                         low_memory=self.low_memory)

  # HDBSCAN or another clustering algorithm that has .fit and .predict functions and
  # the .labels_ variable to extract the labels
  self.hdbscan_model = hdbscan_model or 
                       (min_cluster_size=self.min_topic_size,
                                       metric='euclidean',
                                       cluster_selection_method='eom',
                                       prediction_data=True)

2.3.1UMAP参数说明

关于 UMAP,《Understanding UMAP》这篇文章讲解得很清晰,尤其有互动的图,十分有助于理解各个参数的效果。

 n_neighbors:近似最近邻数。它控制了UMAP局部结构与全局结构的平衡,数值较小时,UMAP会更加关注局部结构,反之,会关注全局结构,丢掉一些细节。

 n_components:设置将数据嵌入的降维空间的维数。

 min_dist:点之间的最小距离。此参数控制UMAP聚集在一起的紧密程度,值较小时,会更紧密,反之,会更松散。

2.3.2HDBSCAN参数说明

 关于 HDBSCAN,官网讲解很清晰,参阅 《How HDBSCAN Works》

min_cluster_size:控制集群的最小大小,它通常设置为默认值10。值越大,集群越少但规模更大,而值越小,微集群越多。

 metric:用于计算距离,通常使用默认值euclidean.

 prediction_data:一般始终将此值设置为True,可以预测新点。如果不进行预测,可以将其设置为False。

2.3.3C-TF-IDF参数说明

stop_words:设置停用词语言。

2.3.4BERTopic参数说明

top_n_words:设置提取的每个主题的字数,通常为10-30之间。

min_topic_size:设置主题最小大小,值越低,创建的主题就越多。值太高,则可能根本不会创建任何主题。

nr_topics:设置主题数量,可以设置为一个具体的数字,也可设置为‘none’不进行主题数量约束,设置为‘auto’则自动进行约束。

diversity:是否使用MMR(最大边际相关性)来多样化主题表示,可以设置0~1之间的值,0表示完全不多样化,1表示最多样化,设置为‘none’,不会使用MMR。

3主题提取与呈现

3.1主题-词

#查看各主题信息
model.get_topic_info()

#查看每个主题数量
model.get_topic_freq()

#查找term最有可能所属话题
model.find_topics(term, top_n=5)  

#查看Topic 0的特征词
model.get_topic(0)

# 删除索引为-1的“噪声"聚类簇群
top_n_words[-1]
#查看目前的所有主题及其对应的主题词列表
from pprint import pprint
for i in list(range(len(top_n_words) - 1)):
    print('Most 20 Important words in TOPIC {} :\n'.format(i))
    pprint(top_n_words[i])
    pprint('***'*20)


#话题间距离的可视化
model.visualize_topics() 

#查看某条文本的主题分布
model.visualize_distribution(probs[0])

#主题层次聚类可视化
model.visualize_hierarchy(top_n_topics=20)

#显示主题1的词条形图
model.visualize_barchart(topics=[1]) 

#主题相似度热力图
model.visualize_heatmap(n_clusters=10)

#可视化词语
model.visualize_term_rank()

#保存主题模型
() 

#压缩主题个数(合并相近的主题)
model.reduce_topics()  

3.2文档-主题 

#查看某篇文档的主题概率
model.visualize_distribution(model.probabilities_[1], min_probability=0.015)
# 减少离群文档数量
new_topics = model.reduce_outliers(summary, headline_topics, strategy="embeddings")
model.update_topics(summary, topics=new_topics,
                    vectorizer_model=CountVectorizer(
                                                    stop_words=stop_words_sklearn,
                                                    ngram_range=(1,1),
                                                    binary=False
                                                    ),
                    ctfidf_model=ClassTfidfTransformer(
                                                    bm25_weighting=True,
                                                    reduce_frequent_words=True)
                    )
# 重新获取主题及主题词列表
model.get_topic_info()
# 查看某个主题排名前10的主题词
model.get_topic(0)
# Get the topic predictions
topic_prediction = model.topics_[:]
# Save the predictions in the dataframe
data['主题预测'] = topic_prediction
# Take a look at the data
()
data.to_excel("文档主题预测.xlsx",index=True)
#主题归并
for i in tqdm(range(20)):
    # Calculate cosine similarity
    similarities = cosine_similarity(tf_idf.T)
    np.fill_diagonal(similarities, 0)

    # Extract label to merge into and from where
    topic_sizes = docs_df.groupby(['Topic']).count().sort_values("Doc", ascending=False).reset_index()
    topic_to_merge = topic_sizes.iloc[-1].Topic
    topic_to_merge_into = (similarities[topic_to_merge + 1]) - 1

    # Adjust topics
    docs_df.loc[docs_df.Topic == topic_to_merge, "Topic"] = topic_to_merge_into
    old_topics = docs_df.sort_values("Topic").()
    map_topics = {old_topic: index - 1 for index, old_topic in enumerate(old_topics)}
    docs_df.Topic = docs_df.(map_topics)
    docs_per_topic = docs_df.groupby(['Topic'], as_index = False).agg({'Doc': ' '.join})

    # Calculate new topic words
    m = len(data)
    tf_idf, count = c_tf_idf(docs_per_topic., m)
    top_n_words = extract_top_n_words_per_topic(tf_idf, count, docs_per_topic, n=20)

topic_sizes = extract_topic_sizes(docs_df); topic_sizes.head(10)

#查看归并之后的主题数
len(docs_per_topic.())
#218

4BERTopic可视化

4.1主题-词概率分布柱状图

# 柱状图可视化
fig_bar = model.visualize_barchart(
                         top_n_topics=16, 
                         n_words=8, 
                         title='主题词得分', # 若报错,将此行注视掉
                         width=250,height=300 
                         )
fig_bar
from  import write_html
with open("主题词得分.html", "w", encoding="utf8") as file:
    write_html(fig_bar, file)

4.2隐含主题分布图

fig_clu = model.visualize_topics()
fig_clu
with open("主题关系.html", "w", encoding="utf8") as file:
    write_html(fig_clu, file)

4.3层次聚类图

fig_hierarchy = model.visualize_hierarchy(top_n_topics=16,
                          #title='层次聚类图',
                          width=600,
                          height=600)
fig_hierarchy
with open("层次聚类图.html", "w", encoding="utf8") as file:
    write_html(fig_hierarchy, file)

 4.4文档主题聚类图

fig_doc_topic = model.visualize_documents(
                          topics=list(range(0,16)),
                          docs=summary,
                          hide_document_hover=False,
                          #title='文本主题聚类图',# 不同的文本按主题聚类
                          width=1200,
                          height=750
                          )
fig_doc_topic
with open("文档主题聚类.html", "w", encoding="utf8") as file:
    write_html(fig_doc_topic, file)

 4.5主题相似度热力图

fig_heatmap = model.visualize_heatmap(top_n_topics=13,
                                      #title='主题相似度热力图',
                                      width=800,
                                      height=600)
fig_heatmap
with open("主题相似度热力图.html", "w", encoding="utf8") as file:
    write_html(fig_heatmap, file)

4.6DTM动态主题图

timepoint = data['created_at'].tolist()
timepoint = pd.to_datetime(timepoint, format='%Y%m%d', errors='ignore')
topics_over_time = model.topics_over_time(summary,  
                                          timepoint,  
                                          datetime_format='mixed', 
                                          nr_bins=20,
                                          evolution_tuning=True)
fig_DTM = model.visualize_topics_over_time(topics_over_time,
                                 top_n_topics=7,
                                 #title='DTM',
                                 width=800,
                                 height=350)
fig_DTM
with open("DTM图.html", "w", encoding="utf8") as file:
    write_html(fig_DTM, file)