使用LDA确定大型语料库的最佳主题数的快速方法

I have a corpus consisting of around 160,000 documents. I want to do a topic modeling on it using LDA in R (specifically the function lda.collapsed.gibbs.sampler in lda package).

我有一个包含大约160,000个文档的语料库。我想在R中使用LDA（特别是lda包中的函数lda.collapsed.gibbs.sampler）对其进行主题建模。

I want to determine the optimal number of topics. It seems the common procedure is to have a vector of topic numbers, e.g., from 1 to 100, then run the model for 100 times and the find the one has the largest harmonic mean or samllest perplexity.

我想确定最佳主题数量。似乎常见的程序是具有主题编号的矢量，例如从1到100，然后运行模型100次并且找到具有最大调和平均值或最严重的调制误差。

However, given the large amount of documents, the optimal number of topics can easily go to several hundreds or even thousands. I find that as the number of topic increases, the computation time grows significantly. Even if I use parallel computing, it will several days or weeks.

但是，考虑到大量文档，最佳主题数量可以轻松达到数百甚至数千。我发现随着主题数量的增加，计算时间会显着增加。即使我使用并行计算，它也会持续几天或几周。

I wonder is there a better (time-efficient) way to choose the optimal number of topics? or is there any suggestion to reduce the computation time?

我想知道有更好的（时间效率）方式来选择最佳主题数量吗？或者是否有任何减少计算时间的建议？

Any suggestion is welcomed.

任何建议都受到欢迎。

1 个解决方案

#1

Start with some guess in middle. decrease and increase the number of topics by say 50 or 100 instead of 1. Check in which way Coherence Score is increasing. I am sure it will converge.

从中间的一些猜测开始。减少并增加主题数量50或100而不是1.检查Coherence分数增加的方式。我相信它会收敛。

#1

Start with some guess in middle. decrease and increase the number of topics by say 50 or 100 instead of 1. Check in which way Coherence Score is increasing. I am sure it will converge.

从中间的一些猜测开始。减少并增加主题数量50或100而不是1.检查Coherence分数增加的方式。我相信它会收敛。

秒客网

使用LDA确定大型语料库的最佳主题数的快速方法

1 个解决方案

#1

#1

相关文章