I want to identify major n-grams in a bunch of academic papers, including n-grams with nested stopwords, but not n-grams with leading or trailing stopwords.
我想在一堆学术论文中找出主要的n-g,包括带有嵌套停止词的n-g,但不包括带有引导或拖尾停止词的n-g。
I have about 100 pdf files. I converted them to plain-text files through an Adobe batch command and collected them within a single directory. From there I use R. (It's a patchwork of code because I'm just getting started with text mining.)
我有大约100个pdf文件。我通过Adobe批处理命令将它们转换为纯文本文件,并将它们收集到一个目录中。然后我使用r(这是一堆代码,因为我刚刚开始文本挖掘)。
My code:
我的代码:
library(tm)
# Make path for sub-dir which contains corpus files
path <- file.path(getwd(), "txt")
# Load corpus files
docs <- Corpus(DirSource(path), readerControl=list(reader=readPlain, language="en"))
#Cleaning
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removePunctuation)
# Merge corpus (Corpus class to character vector)
txt <- c(docs, recursive=T)
# Find trigrams (but I might look for other ngrams as well)
library(quanteda)
myDfm <- dfm(txt, ngrams = 3)
# Remove sparse features
myDfm <- dfm_trim(myDfm, min_count = 5)
# Display top features
topfeatures(myDfm)
# as_well_as of_the_ecosystem in_order_to a_business_ecosystem the_business_ecosystem strategic_management_journal
#603 543 458 431 431 359
#in_the_ecosystem academy_of_management the_role_of the_number_of
#336 311 289 276
For example, in the top ngrams sample provided here, I'd want to keep "academy of management", but not "as well as", nor "the_role_of". I'd like the code to work for any n-gram (preferably including less than 3-grams, although I understand it's simpler in this case to just remove stopwords first).
例如,在这里提供的最上面的ngrams示例中,我希望保留“academy of management”,而不是“as well as”,或“the_role_of”。我希望代码适用于任何n-gram(最好包括少于3克的,尽管我理解在这种情况下,先删除stopwords更简单)。
2 个解决方案
#1
1
Here's how in quanteda: use dfm_remove()
, where the pattern you want to remove is the stopword list followed by the concatenator character, for the beginning and end of the expression. (Note here that for reproducibility, I have used a built-in text object.)
在quanteda中是这样的:使用dfm_remove(),您想要删除的模式是一个stopword列表,后面是concatenator字符,用于表示表达式的开始和结束。(请注意,为了再现性,我使用了一个内置的文本对象。)
library("quanteda")
# remove for your own txt
txt <- data_char_ukimmig2010
(myDfm <- dfm(txt, remove_numbers = TRUE, remove_punct = TRUE, ngrams = 3))
## Document-feature matrix of: 9 documents, 5,518 features (88.5% sparse).
(myDfm2 <- dfm_remove(myDfm,
pattern = c(paste0("^", stopwords("english"), "_"),
paste0("_", stopwords("english"), "$")),
valuetype = "regex"))
## Document-feature matrix of: 9 documents, 1,763 features (88.6% sparse).
head(featnames(myDfm2))
## [1] "immigration_an_unparalleled" "bnp_can_solve" "solve_at_current"
## [4] "immigration_and_birth" "birth_rates_indigenous" "rates_indigenous_british"
Bonus answer:
You can read your pdfs using the readtext package, which also works just fine with quanteda using the above code.
您可以使用readtext包读取pdf,使用上述代码也可以使用quanteda。
library("readtext")
txt <- readtext("yourpdfolder/*.pdf") %>% corpus()
#2
2
Using the corpus R package, with The Wizard of Oz as an example (Project Gutenberg ID#55):
使用语料库R包,以Oz向导为例(Project Gutenberg ID#55):
library(corpus)
library(Matrix) # needed for sparse matrix operations
# download the corpus
corpus <- gutenberg_corpus(55)
# set the preprocessing options
text_filter(corpus) <- text_filter(drop_punct = TRUE, drop_number = TRUE)
# compute trigram statistics for terms appearing at least 5 times;
# specify `types = TRUE` to report component types as well
stats <- term_stats(corpus, ngrams = 3, min_count = 5, types = TRUE)
# discard trigrams starting or ending with a stopword
stats2 <- subset(stats, !type1 %in% stopwords_en & !type3 %in% stopwords_en)
# print first five results:
print(stats2, 5)
## term type1 type2 type3 count support
## 4 said the scarecrow said the scarecrow 36 1
## 7 back to kansas back to kansas 28 1
## 16 said the lion said the lion 19 1
## 17 said the tin said the tin 19 1
## 48 road of yellow road of yellow 12 1
## ⋮ (35 rows total)
# form a document-by-term count matrix for these terms
x <- term_matrix(corpus, select = stats2$term)
In your case, you can convert from the tm
Corpus object with
在您的例子中,您可以使用tm文集对象进行转换
corpus <- as_corpus_frame(docs)
#1
1
Here's how in quanteda: use dfm_remove()
, where the pattern you want to remove is the stopword list followed by the concatenator character, for the beginning and end of the expression. (Note here that for reproducibility, I have used a built-in text object.)
在quanteda中是这样的:使用dfm_remove(),您想要删除的模式是一个stopword列表,后面是concatenator字符,用于表示表达式的开始和结束。(请注意,为了再现性,我使用了一个内置的文本对象。)
library("quanteda")
# remove for your own txt
txt <- data_char_ukimmig2010
(myDfm <- dfm(txt, remove_numbers = TRUE, remove_punct = TRUE, ngrams = 3))
## Document-feature matrix of: 9 documents, 5,518 features (88.5% sparse).
(myDfm2 <- dfm_remove(myDfm,
pattern = c(paste0("^", stopwords("english"), "_"),
paste0("_", stopwords("english"), "$")),
valuetype = "regex"))
## Document-feature matrix of: 9 documents, 1,763 features (88.6% sparse).
head(featnames(myDfm2))
## [1] "immigration_an_unparalleled" "bnp_can_solve" "solve_at_current"
## [4] "immigration_and_birth" "birth_rates_indigenous" "rates_indigenous_british"
Bonus answer:
You can read your pdfs using the readtext package, which also works just fine with quanteda using the above code.
您可以使用readtext包读取pdf,使用上述代码也可以使用quanteda。
library("readtext")
txt <- readtext("yourpdfolder/*.pdf") %>% corpus()
#2
2
Using the corpus R package, with The Wizard of Oz as an example (Project Gutenberg ID#55):
使用语料库R包,以Oz向导为例(Project Gutenberg ID#55):
library(corpus)
library(Matrix) # needed for sparse matrix operations
# download the corpus
corpus <- gutenberg_corpus(55)
# set the preprocessing options
text_filter(corpus) <- text_filter(drop_punct = TRUE, drop_number = TRUE)
# compute trigram statistics for terms appearing at least 5 times;
# specify `types = TRUE` to report component types as well
stats <- term_stats(corpus, ngrams = 3, min_count = 5, types = TRUE)
# discard trigrams starting or ending with a stopword
stats2 <- subset(stats, !type1 %in% stopwords_en & !type3 %in% stopwords_en)
# print first five results:
print(stats2, 5)
## term type1 type2 type3 count support
## 4 said the scarecrow said the scarecrow 36 1
## 7 back to kansas back to kansas 28 1
## 16 said the lion said the lion 19 1
## 17 said the tin said the tin 19 1
## 48 road of yellow road of yellow 12 1
## ⋮ (35 rows total)
# form a document-by-term count matrix for these terms
x <- term_matrix(corpus, select = stats2$term)
In your case, you can convert from the tm
Corpus object with
在您的例子中,您可以使用tm文集对象进行转换
corpus <- as_corpus_frame(docs)