R - 如何:对于某个列表中的每个单词,计算单词在例如3000个单词的列中出现的频率

时间:2022-07-30 01:42:09

I have a dataset x with a bunch of text (columns: title, location, contents) in about 3000 rows.

我有一个数据集x,其中有大量文本(列:标题,位置,内容),大约有3000行。

EDIT: an example.

编辑:一个例子。

title | location | contents ... DUBAI .... ... DUBAI .... ... KHARTOUM .... ... KHARTOUMSUDAN .... ... JAKARTA ....

标题|位置|内容......迪拜...... ......迪拜...... ...... KHARTOUM ...... ...... KHARTOUMSUDAN .... ... JAKARTA ....

link to image example

链接到图片示例

I have a list of locations. locations <- c("DUBAI", "KHARTOUM", "JAKARTA", "Paris").

我有一个地点列表。地点< - c(“DUBAI”,“KHARTOUM”,“JAKARTA”,“巴黎”)。

Now I want to make a loop that'll start with Dubai and see in how many columns it occurs and then create a variable with the count for that. and then i want to move onto the next word in the locations list (Khartoum) and do the same thing.

现在我想创建一个循环,从迪拜开始,看看它出现了多少列,然后创建一个带有计数的变量。然后我想转到位置列表(喀土穆)中的下一个单词并做同样的事情。

So in this case I would expect to see: Dubai = 2, Khartoum = 2, Jakarta = 1.

所以在这种情况下,我希望看到:迪拜= 2,喀土穆= 2,雅加达= 1。

I have this so far, but I don't know how to generalize it and make it into a loop:

到目前为止我有这个,但我不知道如何概括它并使它成为一个循环:

numberDUBAI <- nrow(dplyr::filter(x, grepl(' DUBAI ', location))) 

and then I repeat it for each word

然后我为每个单词重复一遍

numberLOCATIONS <- c(numberDUBAI, numberKHARTOUM, numberJAKARTA, numberPARIS)

but this feels very inefficient, help? :D

但这感觉非常低效,有帮助吗? :d

1 个解决方案

#1


4  

We can do this with tidyverse using map

我们可以用tidyverse使用map来做到这一点

library(tidyverse)
map(locations, ~
               x %>%
                  summarise(n = sum(str_detect(location, .x, ignore_case = TRUE)))
      )

NOTE: Assuming that 'x' is the dataset, 'location' is the column and from the OP's post 'locations' is a vector of patterns

注意:假设'x'是数据集,'location'是列,从OP的帖子'locations'是模式向量

#1


4  

We can do this with tidyverse using map

我们可以用tidyverse使用map来做到这一点

library(tidyverse)
map(locations, ~
               x %>%
                  summarise(n = sum(str_detect(location, .x, ignore_case = TRUE)))
      )

NOTE: Assuming that 'x' is the dataset, 'location' is the column and from the OP's post 'locations' is a vector of patterns

注意:假设'x'是数据集,'location'是列,从OP的帖子'locations'是模式向量