数熊猫数据存储器中某些词的出现次数

时间:2022-08-10 22:17:18

I want to count number of occurrences of certain words in a data frame. I know using "str.contains"

我想计算数据框中某些词出现的次数。我知道使用“str.contains”

a = df2[df2['col1'].str.contains("sample")].groupby('col2').size()
n = a.apply(lambda x: 1).sum()

Currently I'm using the above code. Is there a method to match regular expression and get the count of occurrences? In my case I have a large dataframe and I want to match around 100 strings.

目前我正在使用上述代码。是否有一种方法可以匹配正则表达式并获得出现次数?在我的例子中,我有一个大的dataframe,我想匹配大约100个字符串。

2 个解决方案

#1


11  

The str.contains method accepts a regular expression:

包含方法接受正则表达式:

Definition: df.words.str.contains(self, pat, case=True, flags=0, na=nan)
Docstring:
Check whether given pattern is contained in each string in the array

Parameters
----------
pat : string
    Character sequence or regular expression
case : boolean, default True
    If True, case sensitive
flags : int, default 0 (no flags)
    re module flags, e.g. re.IGNORECASE
na : default NaN, fill value for missing values.

For example:

例如:

In [11]: df = pd.DataFrame(['hello', 'world'], columns=['words'])

In [12]: df
Out[12]:
   words
0  hello
1  world

In [13]: df.words.str.contains(r'[hw]')
Out[13]:
0    True
1    True
Name: words, dtype: bool

In [14]: df.words.str.contains(r'he|wo')
Out[14]:
0    True
1    True
Name: words, dtype: bool

To count the occurences you can just sum this boolean Series:

要计算发生的情况,你可以将这个布尔级数求和:

In [15]: df.words.str.contains(r'he|wo').sum()
Out[15]: 2

In [16]: df.words.str.contains(r'he').sum()
Out[16]: 1

#2


3  

To count the total number of matches, use s.str.match(...).str.get(0).count().

要计算匹配的总数,请使用s.s. .match(…).str.get(0).count()。

If your regex will be matching several unique words, to be tallied individually, use s.str.match(...).str.get(0).groupby(lambda x: x).count()

如果您的regex将匹配几个惟一的单词,要单独统计,请使用s.s. .match(…).str.get(0)。groupby(λx:x).count()

It works like this:

是这样的:

In [12]: s
Out[12]: 
0    ax
1    ay
2    bx
3    by
4    bz
dtype: object

The match string method handles regular expressions...

match string方法处理正则表达式…

In [13]: s.str.match('(b[x-y]+)')
Out[13]: 
0       []
1       []
2    (bx,)
3    (by,)
4       []
dtype: object

...but the results, as given, are not very convenient. The string method get takes the matches as strings and converts empty results to NaNs...

…但结果却不是很方便。string方法get将匹配作为string并将空结果转换为NaNs…

In [14]: s.str.match('(b[x-y]+)').str.get(0)
Out[14]: 
0    NaN
1    NaN
2     bx
3     by
4    NaN
dtype: object

...which are not counted.

…不计算在内。

In [15]: s.str.match('(b[x-y]+)').str.get(0).count()
Out[15]: 2

#1


11  

The str.contains method accepts a regular expression:

包含方法接受正则表达式:

Definition: df.words.str.contains(self, pat, case=True, flags=0, na=nan)
Docstring:
Check whether given pattern is contained in each string in the array

Parameters
----------
pat : string
    Character sequence or regular expression
case : boolean, default True
    If True, case sensitive
flags : int, default 0 (no flags)
    re module flags, e.g. re.IGNORECASE
na : default NaN, fill value for missing values.

For example:

例如:

In [11]: df = pd.DataFrame(['hello', 'world'], columns=['words'])

In [12]: df
Out[12]:
   words
0  hello
1  world

In [13]: df.words.str.contains(r'[hw]')
Out[13]:
0    True
1    True
Name: words, dtype: bool

In [14]: df.words.str.contains(r'he|wo')
Out[14]:
0    True
1    True
Name: words, dtype: bool

To count the occurences you can just sum this boolean Series:

要计算发生的情况,你可以将这个布尔级数求和:

In [15]: df.words.str.contains(r'he|wo').sum()
Out[15]: 2

In [16]: df.words.str.contains(r'he').sum()
Out[16]: 1

#2


3  

To count the total number of matches, use s.str.match(...).str.get(0).count().

要计算匹配的总数,请使用s.s. .match(…).str.get(0).count()。

If your regex will be matching several unique words, to be tallied individually, use s.str.match(...).str.get(0).groupby(lambda x: x).count()

如果您的regex将匹配几个惟一的单词,要单独统计,请使用s.s. .match(…).str.get(0)。groupby(λx:x).count()

It works like this:

是这样的:

In [12]: s
Out[12]: 
0    ax
1    ay
2    bx
3    by
4    bz
dtype: object

The match string method handles regular expressions...

match string方法处理正则表达式…

In [13]: s.str.match('(b[x-y]+)')
Out[13]: 
0       []
1       []
2    (bx,)
3    (by,)
4       []
dtype: object

...but the results, as given, are not very convenient. The string method get takes the matches as strings and converts empty results to NaNs...

…但结果却不是很方便。string方法get将匹配作为string并将空结果转换为NaNs…

In [14]: s.str.match('(b[x-y]+)').str.get(0)
Out[14]: 
0    NaN
1    NaN
2     bx
3     by
4    NaN
dtype: object

...which are not counted.

…不计算在内。

In [15]: s.str.match('(b[x-y]+)').str.get(0).count()
Out[15]: 2