使用text2vec包进行文本预处理和主题建模

时间:2022-02-27 15:52:34

I have a large number of documents and I want to do topic modelling using text2vec and LDA (Gibbs Sampling).

我有大量文档,我想使用text2vec和LDA(Gibbs Sampling)进行主题建模。

Steps I need are as (in order):

我需要的步骤是(按顺序):

  1. Removing numbers and symbols from the text

    从文本中删除数字和符号

    library(stringr)
    docs$text <- stringr::str_replace_all(docs$text,"[^[:alpha:]]", " ")
    docs$text <- stringr::str_replace_all(docs$text,"\\s+", " ")
    
  2. Removing stop words

    删除停用词

    library(text2vec)        
    library(tm)
    
    stopwords <- c(tm::stopwords("english"),custom_stopwords)
    
    prep_fun <- tolower
    tok_fun <- word_tokenizer
    tok_fun <- word_tokenizer    
    tokens <- docs$text%>% 
             prep_fun %>% 
             tok_fun
    it <- itoken(tokens, 
                ids = docs$id,
                progressbar = FALSE)
    
    v <- create_vocabulary(it, stopwords = stopwords) %>% 
        prune_vocabulary(term_count_min = 10)
    
    vectorizer <- vocab_vectorizer(v)
    
  3. Replacing synonyms by terms

    用术语代替同义词

I have an excel file in which first column is the main word and synonyms are listed in second, third and ... columns. I want to replace all synonyms by main words (column #1). Each term can have different number of synonyms. Here is an example of code using "tm" package (but I am interested to the one in text2vec package):

我有一个excel文件,其中第一列是主要词,同义词列在第二,第三和......列中。我想用主词(第1列)替换所有同义词。每个术语可以具有不同数量的同义词。以下是使用“tm”包的代码示例(但我对text2vec包中的代码感兴趣):

replaceSynonyms <- content_transformer(function(x, syn=NULL)
       {Reduce(function(a,b) {
       gsub(paste0("\\b(", paste(b$syns, collapse="|"),")\\b"), b$word,     a, perl = TRUE)}, syn, x)  })

 l <- lapply(as.data.frame(t(Synonyms), stringsAsFactors = FALSE), #
          function(x) { 
            x <- unname(x) 
            list(word = x[1], syns = x[-1])
          })
names(l) <- paste0("list", Synonyms[, 1])
list2env(l, envir = .GlobalEnv)

synonyms <- list()        
for (i in 1:length(names(l))) synonyms[i] = l[i]

MyCorpus <- tm_map(MyCorpus, replaceSynonyms, synonyms)
  1. Convert to document term matrix

    转换为文档术语矩阵

    dtm  <- create_dtm(it, vectorizer)
    
  2. Apply LDA model on document term matrix

    在文档术语矩阵上应用LDA模型

    doc_topic_prior <- 0.1  # can be chosen based on data? 
    lda_model <- LDA$new(n_topics = 10, 
              doc_topic_prior = doc_topic_prior, topic_word_prior = 0.01)
    doc_topic_distr <- lda_model$fit_transform(dtm, n_iter = 1000, convergence_tol <- 0.01, check_convergence_every_n = 10)
    

MyCorpurs in Step 3 is the corpus obtained using "tm" package. Step 2 and Step 3 do not work together as the output of Step 2 is vocab but the input for Step 3 is a "tm" corpus.

步骤3中的MyCorpurs是使用“tm”包获得的语料库。步骤2和步骤3不能一起工作,因为步骤2的输出是词汇,但步骤3的输入是“tm”语料库。

My first question, here, is that how can I do all steps using text2vec package (and compatible packages) as I found it very efficient; thanks to Dmitriy Selivanov.

我的第一个问题是,如何使用text2vec包(和兼容包)完成所有步骤,因为我发现它非常有效;感谢Dmitriy Selivanov。

Second: how we set optimal values for parameters in LDA in Step 5? Is it possible to set them automatically based on data?

第二:我们如何在步骤5中为LDA中的参数设置最佳值?是否可以根据数据自动设置?

Thanks to Manuel Bickel for corrections in my post.

感谢Manuel Bickel在我的帖子中进行更正。

Thanks, Sam

1 个解决方案

#1


2  

Updated answer in response to your comment:

更新回答以回应您的评论:

First question: The issue of synonym replacement has already been answered here: Replace words in text2vec efficiently. Check the answer of count in partiular. Patterns and replacements may be ngrams (multi word phrases). Please note that the second answer of Dmitriy Selivanov uses word_tokenizer() and does not cover the case of ngram replacement in the form presented.

第一个问题:这里已经回答了同义词替换的问题:有效地替换text2vec中的单词。在partiular中检查计数的答案。模式和替换可以是ngrams(多词短语)。请注意,Dmitriy Selivanov的第二个答案使用word_tokenizer(),并未涵盖所提供格式的ngram替换案例。

Is there any reason why you need to replace synonyms before stopword removal? Usually this order should not cause problems; or do you have an example in which switching the order produces significanlty different results? If you really want to replace synonyms after stopword removal, I guess, that you would have to apply such changes to the dtm when using text2vec. If you do so, you need to allow ngrams in your dtm with a minimum ngram length as included in your synonyms. I have provided a workaround in below code as one option. Please note, that allowing higher ngrams in your dtm produces noise that may or may not influence your downstream tasks (you can probably prune most of the noise in the vocabulary step). Therefore, replacing ngrams in earlier seems to be a better solution.

是否有任何理由需要在删除停用词之前替换同义词?通常这个顺序不应该引起问题;或者你有一个例子,其中切换顺序产生显着不同的结果?如果你真的想在删除停用词之后替换同义词,我想,你必须在使用text2vec时将这些更改应用到dtm。如果这样做,则需要在dtm中允许ngram,其中包含您的同义词中包含的最小ngram长度。我在下面的代码中提供了一种解决方法作为一个选项。请注意,在你的dtm中允许更高的ngram会产生噪音,这些噪音可能会或可能不会影响你的下游任务(你可以修剪词汇步骤中的大部分噪音)。因此,在早期更换ngrams似乎是一个更好的解决方案。

Second question: You might check the package (and the source code) of the textmineR package which helps you to select the best number of topics or also the answer to this question Topic models: cross validation with loglikelihood or perplexity. Regarding handling of priors I have not figured out yet, how different packages, e.g., text2vec (WarpLDA algorithm), lda (Collaped Gibbs Sampling algorithm and others), or topicmodels ('standard' Gibbs Sampling and Variational Expectation-Maximization algorithm) handle these values in detail. As a starting point, you might have a look at the detailed documentation of topicmodels, chapter "2.2. Estimation" tells you how the alpha and beta parameters are estimated that are defined in "2.1 Model specification".

第二个问题:您可以检查textmineR包的包(和源代码),它可以帮助您选择最佳主题数或者也可以回答这个问题主题模型:具有对数似然或困惑的交叉验证。关于先验的处理,我还没有想到,不同的包,例如text2vec(WarpLDA算法),lda(Collaped Gibbs采样算法和其他),或topicmodels('标准'Gibbs采样和变分期望 - 最大化算法)如何处理这些价值观详细。作为起点,您可以查看topicmodels的详细文档,“2.2。估计”一章告诉您如何估算“2.1模型规范”中定义的alpha和beta参数。

For the purpose of learning, please note that your code produced errors at two points, which I have revised: (1) you need to use the correct name for stopwords in create_vocabulary(), stopwords instead of stop_words, since you defined the name as such (2) you do not need vocabulary =... in your lda model definition - maybe you use an older version of text2vec?

出于学习的目的,请注意您的代码在两点产生了错误,我已经修改过:(1)您需要在create_vocabulary()中使用正确的停用词名称,而不是stop_words,因为您将名称定义为这样的(2)你在lda模型定义中不需要词汇表= ...也许你使用旧版本的text2vec?

library(text2vec) 
library(reshape2)
library(stringi)

#function proposed by @count
mgsub <- function(pattern,replacement,x) {
  if (length(pattern) != length(replacement)){
    stop("Pattern not equal to Replacment")
  } 
  for (v in 1:length(pattern)) {
    x  <- gsub(pattern[v],replacement[v],x, perl = TRUE)
  }
  return(x )
}

docs <- c("the coffee is warm",
          "the coffee is cold",
          "the coffee is hot",
          "the coffee is boiling like lava",
          "the coffee is frozen",
          "the coffee is perfect",
          "the coffee is warm almost hot"
)

synonyms <- data.frame(mainword = c("warm", "cold")
                       ,syn1 = c("hot", "frozen")
                       ,syn2 = c("boiling like lava", "")
                       ,stringsAsFactors = FALSE)

synonyms[synonyms == ""] <- NA

synonyms <- reshape2::melt(synonyms
                           ,id.vars = "mainword"
                           ,value.name = "synonym"
                           ,na.rm = TRUE)

synonyms <- synonyms[, c("mainword", "synonym")]


prep_fun <- tolower
tok_fun <- word_tokenizer
tokens <- docs %>% 
  #here is where you might replace synonyms directly in the docs
  #{ mgsub(synonyms[,"synonym"], synonyms[,"mainword"], . ) } %>%
  prep_fun %>% 
  tok_fun
it <- itoken(tokens, 
             progressbar = FALSE)

v <- create_vocabulary(it,
                       sep_ngram = "_",
                       ngram = c(ngram_min = 1L
                                 #allow for ngrams in dtm
                                 ,ngram_max = max(stri_count_fixed(unlist(synonyms), " "))
                                 )
)

vectorizer <- vocab_vectorizer(v)
dtm <- create_dtm(it, vectorizer)

#ngrams in dtm
colnames(dtm)

#ensure that ngrams in synonym replacement table have the same format as ngrams in dtm
synonyms <- apply(synonyms, 2, function(x) gsub(" ", "_", x))

colnames(dtm) <- mgsub(synonyms[,"synonym"], synonyms[,"mainword"], colnames(dtm))


#only zeros/ones in dtm since none of the docs specified in my example
#contains duplicate terms
dim(dtm)
#7 24
max(dtm)
#1

#workaround to aggregate colnames in dtm
#I think there is no function `colsum` that allows grouping
#therefore, a workaround based on rowsum
#not elegant because you have to transpose two times, 
#convert to matrix and reconvert to sparse matrix
dtm <- 
  Matrix::Matrix(
    t(
      rowsum(t(as.matrix(dtm)), group = colnames(dtm))
    )
    , sparse = T)


#synonyms in columns replaced
dim(dtm)
#7 20
max(dtm)
#2

#1


2  

Updated answer in response to your comment:

更新回答以回应您的评论:

First question: The issue of synonym replacement has already been answered here: Replace words in text2vec efficiently. Check the answer of count in partiular. Patterns and replacements may be ngrams (multi word phrases). Please note that the second answer of Dmitriy Selivanov uses word_tokenizer() and does not cover the case of ngram replacement in the form presented.

第一个问题:这里已经回答了同义词替换的问题:有效地替换text2vec中的单词。在partiular中检查计数的答案。模式和替换可以是ngrams(多词短语)。请注意,Dmitriy Selivanov的第二个答案使用word_tokenizer(),并未涵盖所提供格式的ngram替换案例。

Is there any reason why you need to replace synonyms before stopword removal? Usually this order should not cause problems; or do you have an example in which switching the order produces significanlty different results? If you really want to replace synonyms after stopword removal, I guess, that you would have to apply such changes to the dtm when using text2vec. If you do so, you need to allow ngrams in your dtm with a minimum ngram length as included in your synonyms. I have provided a workaround in below code as one option. Please note, that allowing higher ngrams in your dtm produces noise that may or may not influence your downstream tasks (you can probably prune most of the noise in the vocabulary step). Therefore, replacing ngrams in earlier seems to be a better solution.

是否有任何理由需要在删除停用词之前替换同义词?通常这个顺序不应该引起问题;或者你有一个例子,其中切换顺序产生显着不同的结果?如果你真的想在删除停用词之后替换同义词,我想,你必须在使用text2vec时将这些更改应用到dtm。如果这样做,则需要在dtm中允许ngram,其中包含您的同义词中包含的最小ngram长度。我在下面的代码中提供了一种解决方法作为一个选项。请注意,在你的dtm中允许更高的ngram会产生噪音,这些噪音可能会或可能不会影响你的下游任务(你可以修剪词汇步骤中的大部分噪音)。因此,在早期更换ngrams似乎是一个更好的解决方案。

Second question: You might check the package (and the source code) of the textmineR package which helps you to select the best number of topics or also the answer to this question Topic models: cross validation with loglikelihood or perplexity. Regarding handling of priors I have not figured out yet, how different packages, e.g., text2vec (WarpLDA algorithm), lda (Collaped Gibbs Sampling algorithm and others), or topicmodels ('standard' Gibbs Sampling and Variational Expectation-Maximization algorithm) handle these values in detail. As a starting point, you might have a look at the detailed documentation of topicmodels, chapter "2.2. Estimation" tells you how the alpha and beta parameters are estimated that are defined in "2.1 Model specification".

第二个问题:您可以检查textmineR包的包(和源代码),它可以帮助您选择最佳主题数或者也可以回答这个问题主题模型:具有对数似然或困惑的交叉验证。关于先验的处理,我还没有想到,不同的包,例如text2vec(WarpLDA算法),lda(Collaped Gibbs采样算法和其他),或topicmodels('标准'Gibbs采样和变分期望 - 最大化算法)如何处理这些价值观详细。作为起点,您可以查看topicmodels的详细文档,“2.2。估计”一章告诉您如何估算“2.1模型规范”中定义的alpha和beta参数。

For the purpose of learning, please note that your code produced errors at two points, which I have revised: (1) you need to use the correct name for stopwords in create_vocabulary(), stopwords instead of stop_words, since you defined the name as such (2) you do not need vocabulary =... in your lda model definition - maybe you use an older version of text2vec?

出于学习的目的,请注意您的代码在两点产生了错误,我已经修改过:(1)您需要在create_vocabulary()中使用正确的停用词名称,而不是stop_words,因为您将名称定义为这样的(2)你在lda模型定义中不需要词汇表= ...也许你使用旧版本的text2vec?

library(text2vec) 
library(reshape2)
library(stringi)

#function proposed by @count
mgsub <- function(pattern,replacement,x) {
  if (length(pattern) != length(replacement)){
    stop("Pattern not equal to Replacment")
  } 
  for (v in 1:length(pattern)) {
    x  <- gsub(pattern[v],replacement[v],x, perl = TRUE)
  }
  return(x )
}

docs <- c("the coffee is warm",
          "the coffee is cold",
          "the coffee is hot",
          "the coffee is boiling like lava",
          "the coffee is frozen",
          "the coffee is perfect",
          "the coffee is warm almost hot"
)

synonyms <- data.frame(mainword = c("warm", "cold")
                       ,syn1 = c("hot", "frozen")
                       ,syn2 = c("boiling like lava", "")
                       ,stringsAsFactors = FALSE)

synonyms[synonyms == ""] <- NA

synonyms <- reshape2::melt(synonyms
                           ,id.vars = "mainword"
                           ,value.name = "synonym"
                           ,na.rm = TRUE)

synonyms <- synonyms[, c("mainword", "synonym")]


prep_fun <- tolower
tok_fun <- word_tokenizer
tokens <- docs %>% 
  #here is where you might replace synonyms directly in the docs
  #{ mgsub(synonyms[,"synonym"], synonyms[,"mainword"], . ) } %>%
  prep_fun %>% 
  tok_fun
it <- itoken(tokens, 
             progressbar = FALSE)

v <- create_vocabulary(it,
                       sep_ngram = "_",
                       ngram = c(ngram_min = 1L
                                 #allow for ngrams in dtm
                                 ,ngram_max = max(stri_count_fixed(unlist(synonyms), " "))
                                 )
)

vectorizer <- vocab_vectorizer(v)
dtm <- create_dtm(it, vectorizer)

#ngrams in dtm
colnames(dtm)

#ensure that ngrams in synonym replacement table have the same format as ngrams in dtm
synonyms <- apply(synonyms, 2, function(x) gsub(" ", "_", x))

colnames(dtm) <- mgsub(synonyms[,"synonym"], synonyms[,"mainword"], colnames(dtm))


#only zeros/ones in dtm since none of the docs specified in my example
#contains duplicate terms
dim(dtm)
#7 24
max(dtm)
#1

#workaround to aggregate colnames in dtm
#I think there is no function `colsum` that allows grouping
#therefore, a workaround based on rowsum
#not elegant because you have to transpose two times, 
#convert to matrix and reconvert to sparse matrix
dtm <- 
  Matrix::Matrix(
    t(
      rowsum(t(as.matrix(dtm)), group = colnames(dtm))
    )
    , sparse = T)


#synonyms in columns replaced
dim(dtm)
#7 20
max(dtm)
#2