I am trying to add multiple dictionaries (sum of common keys), based on categorical variable in another column. I tried using the groupby (and agg), groupby (and sum), and Counter(). I have other continous columns too, but I do not want to add them up. I keep getting errors or undesired output.
我试图添加多个字典(公共密钥的总和),基于另一列中的分类变量。我尝试使用groupby(和agg),groupby(和sum)和Counter()。我也有其他连续的专栏,但我不想把它们加起来。我不断收到错误或不良输出。
import pandas as pd
import numpy as np
from collections import Counter
# input
df1 = pd.DataFrame([
['Cat1', {'Word1': 8, 'Word2': 7, 'Word3': 6, 'Word4':1}],
['Cat2', {'Word2': 7, 'Word4': 7, 'Word3': 6}],
['Cat2', {'Word3':3, 'Word5': 2}],
['Cat1', {'Word1': 10, 'Word3': 5, 'Word4':1}]], columns=list('AB'))
# desired output
df_out = pd.DataFrame([
['Cat1', {'Word1': 18, 'Word2': 7, 'Word3': 11, 'Word4':2}],
['Cat2', {'Word2': 7, 'Word3': 9, 'Word4': 7, 'Word5': 2}]], columns=list('AB'))
df_out
# Trial 1 - groupby
for i in range(len(df1)):
df1.groupby('A')['B'].agg({df1['B'][i])
# Trial 2 - Counter
counter = Counter()
for d in range(len(df['B']):
counter.update(d)
Any help is appreciated. TIA
任何帮助表示赞赏。 TIA
1 个解决方案
#1
0
Here's a solution which produces a regular DataFrame instead of a Series of dicts:
这是一个生成常规DataFrame而不是一系列dicts的解决方案:
pd.DataFrame.from_records(df1.B).groupby(df1.A).sum()
The first step converts your Series of dicts into a regular DataFrame with one column per key. Then it's a simple groupby and sum to get the final result:
第一步将您的系列dicts转换为常规DataFrame,每个键一列。然后它是一个简单的groupby和sum来得到最终结果:
Word1 Word2 Word3 Word4 Word5
A
Cat1 18.0 7.0 11 2.0 0.0
Cat2 0.0 7.0 9 7.0 2.0
Keeping your data in such a format will be much more efficient than a Series of dicts, unless the values are very sparse (i.e. the matrix is large and mostly zeros).
保持这种格式的数据将比一系列的dicts更有效,除非值非常稀疏(即矩阵很大且大多为零)。
If you do need the result to be a Series of dicts, this works:
如果你确实需要结果是一系列的dicts,这可行:
def add_dicts(s):
c = Counter()
s.apply(c.update)
return dict(c)
df1.groupby('A').B.agg(add_dicts)
It produces exactly your df_out
.
它产生了你的df_out。
#1
0
Here's a solution which produces a regular DataFrame instead of a Series of dicts:
这是一个生成常规DataFrame而不是一系列dicts的解决方案:
pd.DataFrame.from_records(df1.B).groupby(df1.A).sum()
The first step converts your Series of dicts into a regular DataFrame with one column per key. Then it's a simple groupby and sum to get the final result:
第一步将您的系列dicts转换为常规DataFrame,每个键一列。然后它是一个简单的groupby和sum来得到最终结果:
Word1 Word2 Word3 Word4 Word5
A
Cat1 18.0 7.0 11 2.0 0.0
Cat2 0.0 7.0 9 7.0 2.0
Keeping your data in such a format will be much more efficient than a Series of dicts, unless the values are very sparse (i.e. the matrix is large and mostly zeros).
保持这种格式的数据将比一系列的dicts更有效,除非值非常稀疏(即矩阵很大且大多为零)。
If you do need the result to be a Series of dicts, this works:
如果你确实需要结果是一系列的dicts,这可行:
def add_dicts(s):
c = Counter()
s.apply(c.update)
return dict(c)
df1.groupby('A').B.agg(add_dicts)
It produces exactly your df_out
.
它产生了你的df_out。