I have a corpus consisting of around 160,000 documents. I want to do a topic modeling on it using LDA in R (specifically the function lda.collapsed.gibbs.sampler in lda package).
我有一个包含大约160,000个文档的语料库。我想在R中使用LDA(特别是lda包中的函数lda.collapsed.gibbs.sampler)对其进行主题建模。
I want to determine the optimal number of topics. It seems the common procedure is to have a vector of topic numbers, e.g., from 1 to 100, then run the model for 100 times and the find the one has the largest harmonic mean or samllest perplexity.
我想确定最佳主题数量。似乎常见的程序是具有主题编号的矢量,例如从1到100,然后运行模型100次并且找到具有最大调和平均值或最严重的调制误差。
However, given the large amount of documents, the optimal number of topics can easily go to several hundreds or even thousands. I find that as the number of topic increases, the computation time grows significantly. Even if I use parallel computing, it will several days or weeks.
但是,考虑到大量文档,最佳主题数量可以轻松达到数百甚至数千。我发现随着主题数量的增加,计算时间会显着增加。即使我使用并行计算,它也会持续几天或几周。
I wonder is there a better (time-efficient) way to choose the optimal number of topics? or is there any suggestion to reduce the computation time?
我想知道有更好的(时间效率)方式来选择最佳主题数量吗?或者是否有任何减少计算时间的建议?
Any suggestion is welcomed.
任何建议都受到欢迎。
1 个解决方案
#1
0
Start with some guess in middle. decrease and increase the number of topics by say 50 or 100 instead of 1. Check in which way Coherence Score is increasing. I am sure it will converge.
从中间的一些猜测开始。减少并增加主题数量50或100而不是1.检查Coherence分数增加的方式。我相信它会收敛。
#1
0
Start with some guess in middle. decrease and increase the number of topics by say 50 or 100 instead of 1. Check in which way Coherence Score is increasing. I am sure it will converge.
从中间的一些猜测开始。减少并增加主题数量50或100而不是1.检查Coherence分数增加的方式。我相信它会收敛。