在Python/ panda中创建部分SAS PROC概要替换

时间:2021-12-06 15:51:55

We are working to get off of SAS and onto Python/Pandas. However, one thing we are having trouble with is creating a replacement for PROC SUMMARY (AKA PROC MEANS) that has the SAS routine's flexibility. For non-SAS users: PROC SUMMARY is just a routine that produces a table containing "descriptive statistics for variables across all observations or within groups of observations" in a dataset, to paraphrase the SAS documentation. Our requirements are just a small subset of the full functionality - outputting a table where we have:

我们正在努力摆脱SAS和Python/熊猫。然而,有一件事我们遇到了麻烦,那就是创建一个替代具有SAS例程灵活性的PROC SUMMARY(又名PROC MEANS)。对于非SAS用户:PROC SUMMARY只是一个例程,它在数据集中生成一个表,其中包含“跨所有观察或在观察组内的变量的描述性统计”,以解释SAS文档。我们的需求只是整个功能的一小部分——输出一个我们拥有的表:

  • Ability to apply different stats to different columns (for now just count, sum, mean, weighted mean)
  • 能够对不同的列应用不同的统计数据(现在只计算、求和、平均值、加权平均值)
  • Ability to handle zero to many grouping variables
  • 能够处理0到多个分组变量
  • Ability to specify a weight variable for weighted mean
  • 能够为加权平均值指定权重变量

We are not trying to do anything else (anything graphical, etc.)

我们不尝试做任何其他事情(任何图形化的东西,等等)

Here is what we have so far:

这是我们目前所拥有的:

def wmean_ungrouped (d,w):
    return (d.dot(w)).sum() / w.sum()

def wmean_grouped (group, var_name_in, var_name_weight):
    d = group[var_name_in]
    w = group[var_name_weight]
    return (d * w).sum() / w.sum()

FUNCS = {
    "mean"   : np.mean ,
    "sum"   : np.sum ,
    "count" : np.count_nonzero
}

def my_summary (
        data ,
        var_names_in ,
        var_names_out ,
        var_functions ,
        var_name_weight = None ,
        var_names_group = None
):
    result = DataFrame()

    if var_names_group is not None:
        grouped = data.groupby (var_names_group)
        for var_name_in, var_name_out, var_function in \
                zip(var_names_in,var_names_out,var_functions):
            if var_function == "wmean":
                func = lambda x : wmean_grouped (x, var_name_in, var_name_weight)
                result[var_name_out] = Series(grouped.apply(func))
            else:
                func = FUNCS[var_function]
                result[var_name_out] = grouped[var_name_in].apply(func)
    else:
        for var_name_in, var_name_out, var_function in \
                zip(var_names_in,var_names_out,var_functions):
            if var_function == "wmean":
                result[var_name_out] = \
                    Series(wmean_ungrouped(data[var_name_in], data[var_name_weight]))
            else:
                func = FUNCS[var_function]
                result[var_name_out] = Series(func(data[var_name_in]))

    return result

Here is a sample call to the my_summary() function:

下面是对my_summary()函数的示例调用:

    my_summary (
        data=df,
        var_names_in=["x_1","x_1","x_1","x_1"] ,
        var_names_out=[
            "x_1_c","x_1_s","x_1_m","x_1_wm"
        ] ,
        var_functions=["count","sum","mean","wmean"] ,
        var_name_weight="val_1" ,
        var_names_group=["Region","Category"]
)

my_summary() works, but as you can see, its implementation is not the prettiest. Here are the main issues:

my_summary()是有效的,但是您可以看到,它的实现并不是最漂亮的。以下是主要问题:

  • Two different code paths depending on grouped or ungrouped - this stems completely from the fact that DataFrame and DataFrameGroupBy have different ways for applying a programmatically-selected reducing function to a single column. For DataFrame, the only way I've found is directly invoking func(data[var_name_in]). data[var_name_in].apply(func) doesn't work because apply() on a Series does not reduce (unlike apply() on a DataFrame). On the other hand, for DataFrameGroupBy, I have to use that very approach: grouped[var_name_in].apply(func). That's because something like func(grouped[var_name_in]) will not work (no reason it should.)
  • 依赖于分组或非分组的两种不同的代码路径——这完全是因为DataFrame和DataFrameGroupBy有不同的方法将程序选择的减少函数应用到单个列。对于DataFrame,我找到的唯一方法是直接调用func(data[var_name_in])。数据[var_name_in].apply(func)不起作用,因为在序列上的apply()不会减少(不像在DataFrame上的apply()))。另一方面,对于DataFrameGroupBy,我必须使用这种方法:分组[var_name_in].apply(func)。这是因为类似func(分组[var_name_in])之类的东西不能工作(没有理由它应该这样做)。
  • Special treatment for weighted mean - this is because it operates on two columns, unlike all the other calculations, which operate on just one; I don't know if this can be helped.
  • 加权平均数的特殊处理——这是因为它作用于两列,不同于所有其他只作用于一列的计算;我不知道这是否有帮助。
  • Two different weighted mean functions - this is a consequence of the first issue. The ungrouped function has Series-type parameters and needs dot() to multiply and reduce them; the grouped function eventually deals with SeriesGroupBy objects and has to use the * operator (acknowledgements to the answer to this SO post for the weighted average function code.)
  • 两个不同的加权平均函数——这是第一个问题的结果。未分组的函数具有序列化类型的参数,需要点()对它们进行相乘和减少;分组函数最终处理SeriesGroupBy对象,并且必须使用*操作符(对于权重平均函数代码的SO post的答案表示感谢)。

So my questions are:

所以我的问题是:

  • Is there something native to pandas that can do all of this (i.e. throw out the above and use that instead)?
  • 熊猫身上是否有什么东西可以做到这一切(比如扔掉上面的东西,用它来代替)?
  • If not, are there any fixes to any of the issues mentioned above?
  • 如果没有,上面提到的问题有什么解决办法吗?
  • By any chance, is there some way to group by nothing - that is, to obtain a DataFrameGroupBy object from a DataFrame without grouping on any variable? Then the code paths would be reduced as we would be dealing with the DataFrameGroupBy interface exclusively.
  • 无论如何,是否有什么方法可以不加分组——也就是说,从DataFrame中获取DataFrameGroupBy对象,而不对任何变量进行分组?这样,代码路径就会减少,因为我们只处理DataFrameGroupBy接口。

Update / Current Solution

@JohnE's answer provides a way to group by nothing: groupby(lambda x: True). This is a workaround that he spotted in this SO post (which, incidentally, features an answer from Wes himself speaking of the need for a DataFrame.agg(), which would serve the same purpose). @JohnE's excellent solution allows us to deal exclusively with objects of type DataFrameGroupBy, and instantly reduces most of the code paths. I was able to reduce further using some functional gimmickry that is now possible because we have only DataFrameGroupBy instances. Basically, all functions are generated as needed - the "generators" (in quotes here so as to not be confused with Python generator expressions) take two parameters: value column name and weight column name, the second of which is ignored in all cases except wmean. The generated functions are always applied over the entire DataFrameGroupBy, as was originally the case just with wmean, with the parameters being the correct column name(s) to use. I also replaced all the np.* implementations with pandas calculations, to better deal with NaN values.

@JohnE的答案提供了一种无分组方式:groupby(lambda x: True)。这是他在SO post中发现的一种变通方法(顺便提一下,它的特点是韦斯自己给出的答案,谈到需要使用DataFrame.agg(),这将起到相同的作用)。@JohnE的优秀解决方案允许我们只处理类型DataFrameGroupBy的对象,并立即减少大多数代码路径。我能够使用一些功能上的花招进一步减少,这现在是可能的,因为我们只有DataFrameGroupBy实例。基本上,所有函数都是根据需要生成的——“生成器”(在这里用引号,以免与Python生成器表达式混淆)有两个参数:值列名和权重列名,第二个参数在除wmean外的所有情况下都被忽略。生成的函数总是应用于整个DataFrameGroupBy,就像wmean一样,参数是要使用的正确列名。我还替换了所有的np。*实现熊猫计算,更好地处理NaN值。

Unless there is something native to pandas that can do this, this is our solution:

除非有熊猫本土的东西可以做到这一点,这就是我们的解决方案:

FUNC_GENS = {
    "mean"  : lambda y,z : lambda x : x[y].mean(),
    "sum"   : lambda y,z : lambda x : x[y].sum() ,
    "count" : lambda y,z : lambda x : x[y].count() ,
    "wmean" : lambda y,z : lambda x : (x[y] * x[z]).sum() / x[z].sum()
}

def my_summary (
        data ,
        var_names_in ,
        var_names_out ,
        var_functions ,
        var_name_weight = None ,
        var_names_group = None ):

    result = pd.DataFrame()

    if var_names_group is None:
        grouped = data.groupby (lambda x: True)
    else:
        grouped = data.groupby (var_names_group)

    for var_name_in, var_name_out, var_function in \
            zip(var_names_in,var_names_out,var_functions):
        func_gen = FUNC_GENS[var_function]
        func = func_gen (var_name_in, var_name_weight)
        result[var_name_out] = grouped.apply(func)

    return result

1 个解决方案

#1


4  

Well, here's a quickie that does get at two issues (but still requires a different function for weighted mean). Mostly it uses the trick here (credit to @DSM) to get around your empty group by doing groupby(lamda x: True). It would be great if there was a kwarg for 'weights' on stuff like means but there is not, to my knowledge. Apparently there is a package for weighted quantiles mentioned here based on numpy but I don't know anything about it. Great project btw!

这里有一个快速解两个问题(但加权平均值仍然需要一个不同的函数)。大多数情况下,它使用这里的技巧(通过@DSM)来通过groupby(lamda x: True)绕过空组。如果能有一个kwarg来表示方法之类的东西的权重,那就太好了,但据我所知,没有。显然这里有一个基于numpy的加权分位数的包,但我对此一无所知。伟大的项目顺便说一句!

(note that names are mostly the same as yours, I just added a '2' to wmean_grouped and my_summary, otherwise you can use the same calling interface)

(注意,名称与您的名称基本相同,我只是向wmean_groups和my_summary添加了一个'2',否则您可以使用相同的调用接口)

def wmean_grouped2 (group, var_name_in, var_name_weight):
    d = group[var_name_in]
    w = group[var_name_weight]
    return (d * w).sum() / w.sum()

FUNCS = { "mean"  : np.mean ,
          "sum"   : np.sum ,
          "count" : np.count_nonzero }

def my_summary2 (
        data ,
        var_names_in ,
        var_names_out ,
        var_functions ,
        var_name_weight = None ,
        var_names_group = None ):

    result = pd.DataFrame()

    if var_names_group is None:
        grouped = data.groupby (lambda x: True)
    else:
        grouped = data.groupby (var_names_group)

    for var_name_in, var_name_out, var_function in \
            zip(var_names_in,var_names_out,var_functions):
        if var_function == "wmean":
            func = lambda x : wmean_grouped2 (x, var_name_in, var_name_weight)
            result[var_name_out] = pd.Series(grouped.apply(func))
        else:
            func = FUNCS[var_function]
            result[var_name_out] = grouped[var_name_in].apply(func)

    return result

#1


4  

Well, here's a quickie that does get at two issues (but still requires a different function for weighted mean). Mostly it uses the trick here (credit to @DSM) to get around your empty group by doing groupby(lamda x: True). It would be great if there was a kwarg for 'weights' on stuff like means but there is not, to my knowledge. Apparently there is a package for weighted quantiles mentioned here based on numpy but I don't know anything about it. Great project btw!

这里有一个快速解两个问题(但加权平均值仍然需要一个不同的函数)。大多数情况下,它使用这里的技巧(通过@DSM)来通过groupby(lamda x: True)绕过空组。如果能有一个kwarg来表示方法之类的东西的权重,那就太好了,但据我所知,没有。显然这里有一个基于numpy的加权分位数的包,但我对此一无所知。伟大的项目顺便说一句!

(note that names are mostly the same as yours, I just added a '2' to wmean_grouped and my_summary, otherwise you can use the same calling interface)

(注意,名称与您的名称基本相同,我只是向wmean_groups和my_summary添加了一个'2',否则您可以使用相同的调用接口)

def wmean_grouped2 (group, var_name_in, var_name_weight):
    d = group[var_name_in]
    w = group[var_name_weight]
    return (d * w).sum() / w.sum()

FUNCS = { "mean"  : np.mean ,
          "sum"   : np.sum ,
          "count" : np.count_nonzero }

def my_summary2 (
        data ,
        var_names_in ,
        var_names_out ,
        var_functions ,
        var_name_weight = None ,
        var_names_group = None ):

    result = pd.DataFrame()

    if var_names_group is None:
        grouped = data.groupby (lambda x: True)
    else:
        grouped = data.groupby (var_names_group)

    for var_name_in, var_name_out, var_function in \
            zip(var_names_in,var_names_out,var_functions):
        if var_function == "wmean":
            func = lambda x : wmean_grouped2 (x, var_name_in, var_name_weight)
            result[var_name_out] = pd.Series(grouped.apply(func))
        else:
            func = FUNCS[var_function]
            result[var_name_out] = grouped[var_name_in].apply(func)

    return result