I have a pandas dataframe with 3 columns: key1, key2, document
. All three columns are text fields with the size of document
ranging from 50 characters to 5000 characters. I identify a vocabulary based on minimum frequency from the set of documents for each (key1, key2)
for which I am using scikit-learn
CountVectorizer
and setting min_df
. I am able to do this using df.groupby[['key1','key2']]['document'].apply(vocab).reset_index()
where vocab
is a function in which I compute and return the vocabulary (as defined above) as a set.
我有一个包含3列的pandas数据框:key1,key2,document。所有三列都是文本字段,文档大小从50个字符到5000个字符不等。我根据每个文件集(key1,key2)中的最小频率识别词汇表,我使用scikit-learn CountVectorizer并设置min_df。我可以使用df.groupby [['key1','key2']] ['document']来执行此操作.apple(vocab).reset_index()其中vocab是一个函数,我在其中计算并返回词汇表(如以上定义)作为一组。
Now, I would like to use these vocabularies (one set for each key1, key2
), to filter the corresponding documents so that each document only has words which are in its vocabulary. I would appreciate any help I can get with this part.
现在,我想使用这些词汇表(每个key1,key2一个集合)来过滤相应的文档,以便每个文档只包含其词汇表中的单词。我很感激能从这部分得到任何帮助。
Sample data
Input
key1 | key2 | document
aa | bb | He went home that evening. Then he had soup for dinner.
aa | bb | We want to sit down and eat dinner
cc | mm | Sometimes people eat in a restaurant
aa | bb | The culinary skills of that chef are terrible. Let us not go there.
cc | mm | People go home after dinner and try to sleep.
Vocabulary - not using counts for the purpose of this example
key1 | key2 | vocab
aa | bb | {went, evening, sit, down, culinary, chef, dinner}
cc | mm | {people, restaurant, home, dinner, sleep}
Result - only use words from corresponding vocab in document
key1 | key2 | document
aa | bb | went evening dinner
aa | bb | sit down dinner
cc | mm | people restaurant
aa | bb | culinary chef
cc | mm | people home dinner sleep
1 个解决方案
#1
0
You can use first merge
for add column vocab
to first DataFrame
:
您可以使用第一个合并将列词汇添加到第一个DataFrame:
import re
df = df.groupby[['key1','key2']]['document'].apply(vocab).reset_index()
df = pd.merge(df1, df2, on=['key1','key2'], how='left')
#another theoretical solution
#df['vocab'] = df.groupby[['key1','key2']]['document'].transform(vocab)
Then extract all words by findall
, re.I
is for ignore case and last remove column vocab
:
然后通过findall提取所有单词,re。我用于忽略大小写,最后删除列vocab:
df['document'] = df['document'].str.findall('\w+', flags=re.I)
Last get intersection between set
s and convert to strings by str.join
:
最后得到集合之间的交集并通过str.join转换为字符串:
df['document'] = df.apply(lambda x: set(x['document']) & x['vocab'], axis=1).str.join(' ')
df = df.drop('vocab', axis=1)
print (df)
key1 key2 document
0 aa bb evening went dinner
1 aa bb sit down dinner
2 cc mm restaurant people
3 aa bb chef culinary
4 cc mm home people sleep dinner
#1
0
You can use first merge
for add column vocab
to first DataFrame
:
您可以使用第一个合并将列词汇添加到第一个DataFrame:
import re
df = df.groupby[['key1','key2']]['document'].apply(vocab).reset_index()
df = pd.merge(df1, df2, on=['key1','key2'], how='left')
#another theoretical solution
#df['vocab'] = df.groupby[['key1','key2']]['document'].transform(vocab)
Then extract all words by findall
, re.I
is for ignore case and last remove column vocab
:
然后通过findall提取所有单词,re。我用于忽略大小写,最后删除列vocab:
df['document'] = df['document'].str.findall('\w+', flags=re.I)
Last get intersection between set
s and convert to strings by str.join
:
最后得到集合之间的交集并通过str.join转换为字符串:
df['document'] = df.apply(lambda x: set(x['document']) & x['vocab'], axis=1).str.join(' ')
df = df.drop('vocab', axis=1)
print (df)
key1 key2 document
0 aa bb evening went dinner
1 aa bb sit down dinner
2 cc mm restaurant people
3 aa bb chef culinary
4 cc mm home people sleep dinner