从Dataframe Pandas中的句子中计算最频繁的100个单词

时间:2020-12-21 12:44:58

I have text reviews in one column in Pandas dataframe and I want to count the N-most frequent words with their frequency counts (in whole column - NOT in single cell). One approach is Counting the words using a counter, by iterating through each row. Is there a better alternative?

我在Pandas数据帧的一栏中进行了文本评论,我想用频率计数计算N个最频繁的单词(在整个列中 - 不在单个单元格中)。一种方法是使用计数器计数单词,通过遍历每一行。还有更好的选择吗?

Representative data.

代表性数据。

0    a heartening tale of small victories and endu
1    no sophomore slump for director sam mendes  w
2    if you are an actor who can relate to the sea
3    it's this memory-as-identity obviation that g
4    boyd's screenplay ( co-written with guardian

2 个解决方案

#1


16  

Counter(" ".join(df["text"]).split()).most_common(100)

im pretty sure would give you what you want (you might have to remove some non-words from the counter result before calling most_common)

我很确定会给你你想要的东西(在调用most_common之前你可能需要从计数器结果中删除一些非单词)

#2


14  

Along with @Joran's solution you could also you use series.value_counts for large amounts of text/rows

除了@Joran的解决方案,您还可以使用series.value_counts来存储大量文本/行

 pd.Series(' '.join(df['text']).lower().split()).value_counts()[:100]

You would find from the benchmarks series.value_counts seems twice (2X) faster than Counter method

您可以从基准测试系列中找到.value_counts似乎比Counter方法快两倍(2X)

For Movie Reviews dataset of 3000 rows, totaling 400K characters and 70k words.

对于3000行的电影评论数据集,总计400K字符和70k字。

In [448]: %timeit Counter(" ".join(df.text).lower().split()).most_common(100)
10 loops, best of 3: 44.2 ms per loop

In [449]: %timeit pd.Series(' '.join(df.text).lower().split()).value_counts()[:100]
10 loops, best of 3: 27.1 ms per loop

#1


16  

Counter(" ".join(df["text"]).split()).most_common(100)

im pretty sure would give you what you want (you might have to remove some non-words from the counter result before calling most_common)

我很确定会给你你想要的东西(在调用most_common之前你可能需要从计数器结果中删除一些非单词)

#2


14  

Along with @Joran's solution you could also you use series.value_counts for large amounts of text/rows

除了@Joran的解决方案,您还可以使用series.value_counts来存储大量文本/行

 pd.Series(' '.join(df['text']).lower().split()).value_counts()[:100]

You would find from the benchmarks series.value_counts seems twice (2X) faster than Counter method

您可以从基准测试系列中找到.value_counts似乎比Counter方法快两倍(2X)

For Movie Reviews dataset of 3000 rows, totaling 400K characters and 70k words.

对于3000行的电影评论数据集,总计400K字符和70k字。

In [448]: %timeit Counter(" ".join(df.text).lower().split()).most_common(100)
10 loops, best of 3: 44.2 ms per loop

In [449]: %timeit pd.Series(' '.join(df.text).lower().split()).value_counts()[:100]
10 loops, best of 3: 27.1 ms per loop