I have a list as follows,
我有一个如下列表,
remove_words = ['abc', 'deff', 'pls']
The following is the data frame which I am having with column name 'string'
以下是我使用列名'string'的数据框
data['string']
0 abc stack overflow
1 abc123
2 deff comedy
3 definitely
4 pls lkjh
5 pls1234
I want to check for words from remove_words list in the pandas dataframe column and remove those words in the pandas dataframe. I want to check for the words occurring individually without occurring with other words.
我想检查pandas dataframe列中remove_words列表中的单词,并删除pandas数据帧中的这些单词。我想检查单独出现的单词,而不是用其他单词出现。
For example, if there is 'abc' in pandas df column, replace it with '' but if it occurs with abc123, we need to leave it as it is. The output here should be,
例如,如果pandas df列中有'abc',请将其替换为'',但如果它与abc123一起出现,我们需要保持原样。这里的输出应该是,
data['string']
0 stack overflow
1 abc123
2 comedy
3 definitely
4 lkjh
5 pls1234
In my actual data, I have 2000 words in the remove_words list and 5 billion records in the pandas dataframe. So I am looking for the best efficient way to do this.
在我的实际数据中,我在remove_words列表中有2000个单词,在pandas数据帧中有50亿个记录。所以我正在寻找最有效的方法来做到这一点。
I have tried few things in python, without much success. Can anybody help me in doing this? Any ideas would be helpful.
我在python中尝试过很少的东西,没有太大的成功。有人可以帮我这么做吗?任何想法都会有所帮助。
Thanks
2 个解决方案
#1
6
Try this:
In [98]: pat = r'\b(?:{})\b'.format('|'.join(remove_words))
In [99]: pat
Out[99]: '\\b(?:abc|def|pls)\\b'
In [100]: df['new'] = df['string'].str.replace(pat, '')
In [101]: df
Out[101]:
string new
0 abc stack overflow stack overflow
1 abc123 abc123
2 def comedy comedy
3 definitely definitely
4 pls lkjh lkjh
5 pls1234 pls1234
#2
3
Totally taking @MaxU's pattern!
完全采用@ MaxU的模式!
We can use pd.DataFrame.replace
by setting the regex
parameter to True
and passing a dictionary of dictionaries that specifies the pattern and what to replace with for each column.
我们可以通过将regex参数设置为True并传递字典字典来指定模式以及每列要替换的内容,从而使用pd.DataFrame.replace。
pat = '|'.join([r'\b{}\b'.format(w) for w in remove_words])
df.assign(new=df.replace(dict(string={pat: ''}), regex=True))
string new
0 abc stack overflow stack overflow
1 abc123 abc123
2 def comedy comedy
3 definitely definitely
4 pls lkjh lkjh
5 pls1234 pls1234
#1
6
Try this:
In [98]: pat = r'\b(?:{})\b'.format('|'.join(remove_words))
In [99]: pat
Out[99]: '\\b(?:abc|def|pls)\\b'
In [100]: df['new'] = df['string'].str.replace(pat, '')
In [101]: df
Out[101]:
string new
0 abc stack overflow stack overflow
1 abc123 abc123
2 def comedy comedy
3 definitely definitely
4 pls lkjh lkjh
5 pls1234 pls1234
#2
3
Totally taking @MaxU's pattern!
完全采用@ MaxU的模式!
We can use pd.DataFrame.replace
by setting the regex
parameter to True
and passing a dictionary of dictionaries that specifies the pattern and what to replace with for each column.
我们可以通过将regex参数设置为True并传递字典字典来指定模式以及每列要替换的内容,从而使用pd.DataFrame.replace。
pat = '|'.join([r'\b{}\b'.format(w) for w in remove_words])
df.assign(new=df.replace(dict(string={pat: ''}), regex=True))
string new
0 abc stack overflow stack overflow
1 abc123 abc123
2 def comedy comedy
3 definitely definitely
4 pls lkjh lkjh
5 pls1234 pls1234