计算使用特定单词的次数

时间:2020-12-02 23:55:28

I want to perform textmining on several bank account descriptions. My first step would be get a ranking of the words that are used the most in the description.

我想对几个银行帐户描述执行文本挖掘。我的第一步是对描述中使用最多的词进行排名。

So lets say i have a dataframe that looks like this:

假设我有一个dataframe像这样

    a                       b
    1 1          House expenses
    2 2 Office furniture bought
    3 3 Office supplies ordered

Then I want to create a ranking of the use of the words. Like this:

然后我想要创建一个单词使用排名。是这样的:

    Name      Times
    1. Office   2
    2. Furniture 1

Etc...

等等……

Any thoughts on how I can quickly get an overview of the words that are used most in the description?

关于我如何快速获得描述中最常用的词的概述?

2 个解决方案

#1


2  

Another way around this is using the tm package. You can create a corpus:

另一种方法是使用tm包。你可以创建一个语料库:

     require(tm)
     corpus <- Corpus(DataframeSource(data))
     dtm<-DocumentTermMatrix(corpus)
     dtmDataFrame <- as.data.frame(inspect(dtm))

by default it makes term frequencies tf using "weightTf". I converted the Document Term Matrix into a Dataframe. Now what you have is a row per document, a column for each term and the value is the term frequency for every term, you can just create the rankings in a straightforward way, adding all values for each column.

在默认情况下,它使用“weightTf”生成术语频率tf。我将文档术语矩阵转换为Dataframe。现在你得到的是每行文档,每一项的一列值是每一项的频率,你可以直接创建排名,为每一列添加所有的值。

colSums(dtmDataFrame)

You can sort it too after, whatever. The good point of using tm is that you can filter easily words out, process them with bunch of things like stop words, remove punctuations, stemming, remove sparse words in case you need it.

你也可以对它进行排序。使用tm的好处是你可以很容易地过滤出一些单词,用一些东西来处理它们,比如停止的单词,删除标点,词干,删除稀疏的单词,如果你需要的话。

#2


0  

   d<-data.frame(a=c(1,2,3), b=c("1          House expenses", "2 Office furniture bought", "3 Office supplies ordered"), stringsAsFactors =FALSE)
   e <- unlist(strsplit(d$b, " "))
   f <- e[! e %in% c("")]
   g <- sapply(f, function(x) { sum(f %in% c(x))})
   h = data.frame(Name=names(g), Times=g)
   h[!duplicated(h),]

#1


2  

Another way around this is using the tm package. You can create a corpus:

另一种方法是使用tm包。你可以创建一个语料库:

     require(tm)
     corpus <- Corpus(DataframeSource(data))
     dtm<-DocumentTermMatrix(corpus)
     dtmDataFrame <- as.data.frame(inspect(dtm))

by default it makes term frequencies tf using "weightTf". I converted the Document Term Matrix into a Dataframe. Now what you have is a row per document, a column for each term and the value is the term frequency for every term, you can just create the rankings in a straightforward way, adding all values for each column.

在默认情况下,它使用“weightTf”生成术语频率tf。我将文档术语矩阵转换为Dataframe。现在你得到的是每行文档,每一项的一列值是每一项的频率,你可以直接创建排名,为每一列添加所有的值。

colSums(dtmDataFrame)

You can sort it too after, whatever. The good point of using tm is that you can filter easily words out, process them with bunch of things like stop words, remove punctuations, stemming, remove sparse words in case you need it.

你也可以对它进行排序。使用tm的好处是你可以很容易地过滤出一些单词,用一些东西来处理它们,比如停止的单词,删除标点,词干,删除稀疏的单词,如果你需要的话。

#2


0  

   d<-data.frame(a=c(1,2,3), b=c("1          House expenses", "2 Office furniture bought", "3 Office supplies ordered"), stringsAsFactors =FALSE)
   e <- unlist(strsplit(d$b, " "))
   f <- e[! e %in% c("")]
   g <- sapply(f, function(x) { sum(f %in% c(x))})
   h = data.frame(Name=names(g), Times=g)
   h[!duplicated(h),]