熊猫数据堆使用多个列

时间:2021-04-09 22:34:44

Is there a way to write an aggregation function as is used in DataFrame.agg method, that would have access to more than one column of the data that is being aggregated? Typical use cases would be weighted average, weighted standard deviation funcs.

是否有一种方法可以像在DataFrame中那样编写聚合函数。agg方法,可以访问被聚合的数据的多个列吗?典型的用例是加权平均,加权标准差。

I would like to be able to write something like

我想写一些类似的东西

def wAvg(c, w):
    return ((c * w).sum() / w.sum())

df = DataFrame(....) # df has columns c and w, i want weighted average
                     # of c using w as weight.
df.aggregate ({"c": wAvg}) # and somehow tell it to use w column as weights ...

6 个解决方案

#1


71  

Yes; use the .apply(...) function, which will be called on each sub-DataFrame. For example:

是的,使用.apply(…)函数,该函数将对每个子dataframe调用。例如:

grouped = df.groupby(keys)

def wavg(group):
    d = group['data']
    w = group['weights']
    return (d * w).sum() / w.sum()

grouped.apply(wavg)

#2


4  

The following (based on Wes McKinney' answer) accomplishes exactly what I was looking for. I'd be happy to learn if there's a simpler way of doing this within pandas.

下面(根据韦斯·麦金尼的回答)完成了我想要的。如果在熊猫身上有更简单的方法,我很乐意去了解。

def wavg_func(datacol, weightscol):
    def wavg(group):
        dd = group[datacol]
        ww = group[weightscol] * 1.0
        return (dd * ww).sum() / ww.sum()
    return wavg


def df_wavg(df, groupbycol, weightscol):
    grouped = df.groupby(groupbycol)
    df_ret = grouped.agg({weightscol:sum})
    datacols = [cc for cc in df.columns if cc not in [groupbycol, weightscol]]
    for dcol in datacols:
        try:
            wavg_f = wavg_func(dcol, weightscol)
            df_ret[dcol] = grouped.apply(wavg_f)
        except TypeError:  # handle non-numeric columns
            df_ret[dcol] = grouped.agg({dcol:min})
    return df_ret

The function df_wavg() returns a dataframe that's grouped by the "groupby" column, and that returns the sum of the weights for the weights column. Other columns are either the weighted averages or, if non-numeric, the min() function is used for aggregation.

函数df_wavg()返回按“groupby”列分组的数据aframe,并返回权重列的权重之和。其他列要么是加权平均值,要么(如果不是数值)使用min()函数进行聚合。

#3


3  

I do this a lot and found the following quite handy:

我经常这样做,发现下面的方法非常方便:

def weighed_average(grp):
    return grp._get_numeric_data().multiply(grp['COUNT'], axis=0).sum()/grp['COUNT'].sum()
df.groupby('SOME_COL').apply(weighed_average)

This will compute the weighted average of all the numerical columns in the df and drop non-numeric ones.

这将计算df中所有数值列的加权平均值,并删除非数值列。

#4


1  

Accomplishing this via groupby(...).apply(...) is non-performant. Here's a solution that I use all the time (essentially using kalu's logic).

通过groupby(…)完成此任务。apply(…)是无性能的。这里有一个我一直使用的解决方案(本质上是使用kalu的逻辑)。

def grouped_weighted_average(self, values, weights, *groupby_args, **groupby_kwargs):
   """
    :param values: column(s) to take the average of
    :param weights_col: column to weight on
    :param group_args: args to pass into groupby (e.g. the level you want to group on)
    :param group_kwargs: kwargs to pass into groupby
    :return: pandas.Series or pandas.DataFrame
    """

    if isinstance(values, str):
        values = [values]

    ss = []
    for value_col in values:
        df = self.copy()
        prod_name = 'prod_{v}_{w}'.format(v=value_col, w=weights)
        weights_name = 'weights_{w}'.format(w=weights)

        df[prod_name] = df[value_col] * df[weights]
        df[weights_name] = df[weights].where(~df[prod_name].isnull())
        df = df.groupby(*groupby_args, **groupby_kwargs).sum()
        s = df[prod_name] / df[weights_name]
        s.name = value_col
        ss.append(s)
    df = pd.concat(ss, axis=1) if len(ss) > 1 else ss[0]
    return df

pandas.DataFrame.grouped_weighted_average = grouped_weighted_average

#5


1  

My solution is similar to Nathaniel's solution, only it's for a single column and I don't deep-copy the entire data frame each time, which could be prohibitively slow. The performance gain over the solution groupby(...).apply(...) is about 100x(!)

我的解决方案与Nathaniel的方案类似,只是针对一个列,我不会每次都深入复制整个数据框,这可能会非常慢。与groupby(…).apply(…)相比,性能收益大约为100x(!)

def weighted_average(df,data_col,weight_col,by_col):
    df['_data_times_weight'] = df[data_col]*df[weight_col]
    df['_weight_where_notnull'] = df[weight_col]*pd.notnull(df[data_col])
    g = df.groupby(by_col)
    result = g['_data_times_weight'].sum() / g['_weight_where_notnull'].sum()
    del df['_data_times_weight'], df['_weight_where_notnull']
    return result

#6


1  

It is possible to return any number of aggregated values from a groupby object with apply. Simply, return a Series and the index values will become the new column names.

可以通过应用程序从groupby对象中返回任意数量的聚合值。简单地说,返回一个序列,索引值将成为新的列名。

Let's see a quick example:

让我们看一个简单的例子:

df = pd.DataFrame({'group':['a','a','b','b'],
                   'd1':[5,10,100,30],
                   'd2':[7,1,3,20],
                   'weights':[.2,.8, .4, .6]},
                 columns=['group', 'd1', 'd2', 'weights'])
df

  group   d1  d2  weights
0     a    5   7      0.2
1     a   10   1      0.8
2     b  100   3      0.4
3     b   30  20      0.6

Define a custom function that will be passed to apply. It implicitly accepts a DataFrame - meaning the data parameter is a DataFrame. Notice how it uses multiple columns, which is not possible with the agg groupby method:

定义要传递给应用程序的自定义函数。它隐式地接受一个DataFrame——这意味着数据参数是一个DataFrame。注意它是如何使用多个列的,这是agg groupby方法无法做到的:

def weighted_average(data):
    d = {}
    d['d1_wa'] = np.average(data['d1'], weights=data['weights'])
    d['d2_wa'] = np.average(data['d2'], weights=data['weights'])
    return pd.Series(d)

Call the groupby apply method with our custom function:

用我们的自定义函数应用方法调用groupby:

df.groupby('group').apply(weighted_average)

       d1_wa  d2_wa
group              
a        9.0    2.2
b       58.0   13.2

You can get better performance by precalculating the weighted totals into new DataFrame columns as explained in other answers and avoid using apply altogether.

如其他答案所述,您可以通过将加权总数预先计算到新的DataFrame列中,并避免使用apply来获得更好的性能。

#1


71  

Yes; use the .apply(...) function, which will be called on each sub-DataFrame. For example:

是的,使用.apply(…)函数,该函数将对每个子dataframe调用。例如:

grouped = df.groupby(keys)

def wavg(group):
    d = group['data']
    w = group['weights']
    return (d * w).sum() / w.sum()

grouped.apply(wavg)

#2


4  

The following (based on Wes McKinney' answer) accomplishes exactly what I was looking for. I'd be happy to learn if there's a simpler way of doing this within pandas.

下面(根据韦斯·麦金尼的回答)完成了我想要的。如果在熊猫身上有更简单的方法,我很乐意去了解。

def wavg_func(datacol, weightscol):
    def wavg(group):
        dd = group[datacol]
        ww = group[weightscol] * 1.0
        return (dd * ww).sum() / ww.sum()
    return wavg


def df_wavg(df, groupbycol, weightscol):
    grouped = df.groupby(groupbycol)
    df_ret = grouped.agg({weightscol:sum})
    datacols = [cc for cc in df.columns if cc not in [groupbycol, weightscol]]
    for dcol in datacols:
        try:
            wavg_f = wavg_func(dcol, weightscol)
            df_ret[dcol] = grouped.apply(wavg_f)
        except TypeError:  # handle non-numeric columns
            df_ret[dcol] = grouped.agg({dcol:min})
    return df_ret

The function df_wavg() returns a dataframe that's grouped by the "groupby" column, and that returns the sum of the weights for the weights column. Other columns are either the weighted averages or, if non-numeric, the min() function is used for aggregation.

函数df_wavg()返回按“groupby”列分组的数据aframe,并返回权重列的权重之和。其他列要么是加权平均值,要么(如果不是数值)使用min()函数进行聚合。

#3


3  

I do this a lot and found the following quite handy:

我经常这样做,发现下面的方法非常方便:

def weighed_average(grp):
    return grp._get_numeric_data().multiply(grp['COUNT'], axis=0).sum()/grp['COUNT'].sum()
df.groupby('SOME_COL').apply(weighed_average)

This will compute the weighted average of all the numerical columns in the df and drop non-numeric ones.

这将计算df中所有数值列的加权平均值,并删除非数值列。

#4


1  

Accomplishing this via groupby(...).apply(...) is non-performant. Here's a solution that I use all the time (essentially using kalu's logic).

通过groupby(…)完成此任务。apply(…)是无性能的。这里有一个我一直使用的解决方案(本质上是使用kalu的逻辑)。

def grouped_weighted_average(self, values, weights, *groupby_args, **groupby_kwargs):
   """
    :param values: column(s) to take the average of
    :param weights_col: column to weight on
    :param group_args: args to pass into groupby (e.g. the level you want to group on)
    :param group_kwargs: kwargs to pass into groupby
    :return: pandas.Series or pandas.DataFrame
    """

    if isinstance(values, str):
        values = [values]

    ss = []
    for value_col in values:
        df = self.copy()
        prod_name = 'prod_{v}_{w}'.format(v=value_col, w=weights)
        weights_name = 'weights_{w}'.format(w=weights)

        df[prod_name] = df[value_col] * df[weights]
        df[weights_name] = df[weights].where(~df[prod_name].isnull())
        df = df.groupby(*groupby_args, **groupby_kwargs).sum()
        s = df[prod_name] / df[weights_name]
        s.name = value_col
        ss.append(s)
    df = pd.concat(ss, axis=1) if len(ss) > 1 else ss[0]
    return df

pandas.DataFrame.grouped_weighted_average = grouped_weighted_average

#5


1  

My solution is similar to Nathaniel's solution, only it's for a single column and I don't deep-copy the entire data frame each time, which could be prohibitively slow. The performance gain over the solution groupby(...).apply(...) is about 100x(!)

我的解决方案与Nathaniel的方案类似,只是针对一个列,我不会每次都深入复制整个数据框,这可能会非常慢。与groupby(…).apply(…)相比,性能收益大约为100x(!)

def weighted_average(df,data_col,weight_col,by_col):
    df['_data_times_weight'] = df[data_col]*df[weight_col]
    df['_weight_where_notnull'] = df[weight_col]*pd.notnull(df[data_col])
    g = df.groupby(by_col)
    result = g['_data_times_weight'].sum() / g['_weight_where_notnull'].sum()
    del df['_data_times_weight'], df['_weight_where_notnull']
    return result

#6


1  

It is possible to return any number of aggregated values from a groupby object with apply. Simply, return a Series and the index values will become the new column names.

可以通过应用程序从groupby对象中返回任意数量的聚合值。简单地说,返回一个序列,索引值将成为新的列名。

Let's see a quick example:

让我们看一个简单的例子:

df = pd.DataFrame({'group':['a','a','b','b'],
                   'd1':[5,10,100,30],
                   'd2':[7,1,3,20],
                   'weights':[.2,.8, .4, .6]},
                 columns=['group', 'd1', 'd2', 'weights'])
df

  group   d1  d2  weights
0     a    5   7      0.2
1     a   10   1      0.8
2     b  100   3      0.4
3     b   30  20      0.6

Define a custom function that will be passed to apply. It implicitly accepts a DataFrame - meaning the data parameter is a DataFrame. Notice how it uses multiple columns, which is not possible with the agg groupby method:

定义要传递给应用程序的自定义函数。它隐式地接受一个DataFrame——这意味着数据参数是一个DataFrame。注意它是如何使用多个列的,这是agg groupby方法无法做到的:

def weighted_average(data):
    d = {}
    d['d1_wa'] = np.average(data['d1'], weights=data['weights'])
    d['d2_wa'] = np.average(data['d2'], weights=data['weights'])
    return pd.Series(d)

Call the groupby apply method with our custom function:

用我们的自定义函数应用方法调用groupby:

df.groupby('group').apply(weighted_average)

       d1_wa  d2_wa
group              
a        9.0    2.2
b       58.0   13.2

You can get better performance by precalculating the weighted totals into new DataFrame columns as explained in other answers and avoid using apply altogether.

如其他答案所述,您可以通过将加权总数预先计算到新的DataFrame列中,并避免使用apply来获得更好的性能。