We are working to get off of SAS and onto Python/Pandas. However, one thing we are having trouble with is creating a replacement for PROC SUMMARY
(AKA PROC MEANS
) that has the SAS routine's flexibility. For non-SAS users: PROC SUMMARY
is just a routine that produces a table containing "descriptive statistics for variables across all observations or within groups of observations" in a dataset, to paraphrase the SAS documentation. Our requirements are just a small subset of the full functionality - outputting a table where we have:
我们正在努力摆脱SAS和Python/熊猫。然而,有一件事我们遇到了麻烦,那就是创建一个替代具有SAS例程灵活性的PROC SUMMARY(又名PROC MEANS)。对于非SAS用户:PROC SUMMARY只是一个例程,它在数据集中生成一个表,其中包含“跨所有观察或在观察组内的变量的描述性统计”,以解释SAS文档。我们的需求只是整个功能的一小部分——输出一个我们拥有的表:
- Ability to apply different stats to different columns (for now just count, sum, mean, weighted mean)
- 能够对不同的列应用不同的统计数据(现在只计算、求和、平均值、加权平均值)
- Ability to handle zero to many grouping variables
- 能够处理0到多个分组变量
- Ability to specify a weight variable for weighted mean
- 能够为加权平均值指定权重变量
We are not trying to do anything else (anything graphical, etc.)
我们不尝试做任何其他事情(任何图形化的东西,等等)
Here is what we have so far:
这是我们目前所拥有的:
def wmean_ungrouped (d,w):
return (d.dot(w)).sum() / w.sum()
def wmean_grouped (group, var_name_in, var_name_weight):
d = group[var_name_in]
w = group[var_name_weight]
return (d * w).sum() / w.sum()
FUNCS = {
"mean" : np.mean ,
"sum" : np.sum ,
"count" : np.count_nonzero
}
def my_summary (
data ,
var_names_in ,
var_names_out ,
var_functions ,
var_name_weight = None ,
var_names_group = None
):
result = DataFrame()
if var_names_group is not None:
grouped = data.groupby (var_names_group)
for var_name_in, var_name_out, var_function in \
zip(var_names_in,var_names_out,var_functions):
if var_function == "wmean":
func = lambda x : wmean_grouped (x, var_name_in, var_name_weight)
result[var_name_out] = Series(grouped.apply(func))
else:
func = FUNCS[var_function]
result[var_name_out] = grouped[var_name_in].apply(func)
else:
for var_name_in, var_name_out, var_function in \
zip(var_names_in,var_names_out,var_functions):
if var_function == "wmean":
result[var_name_out] = \
Series(wmean_ungrouped(data[var_name_in], data[var_name_weight]))
else:
func = FUNCS[var_function]
result[var_name_out] = Series(func(data[var_name_in]))
return result
Here is a sample call to the my_summary()
function:
下面是对my_summary()函数的示例调用:
my_summary (
data=df,
var_names_in=["x_1","x_1","x_1","x_1"] ,
var_names_out=[
"x_1_c","x_1_s","x_1_m","x_1_wm"
] ,
var_functions=["count","sum","mean","wmean"] ,
var_name_weight="val_1" ,
var_names_group=["Region","Category"]
)
my_summary()
works, but as you can see, its implementation is not the prettiest. Here are the main issues:
my_summary()是有效的,但是您可以看到,它的实现并不是最漂亮的。以下是主要问题:
-
Two different code paths depending on grouped or ungrouped - this stems completely from the fact that
DataFrame
andDataFrameGroupBy
have different ways for applying a programmatically-selected reducing function to a single column. ForDataFrame
, the only way I've found is directly invokingfunc(data[var_name_in])
.data[var_name_in].apply(func)
doesn't work becauseapply()
on aSeries
does not reduce (unlikeapply()
on aDataFrame
). On the other hand, forDataFrameGroupBy
, I have to use that very approach:grouped[var_name_in].apply(func)
. That's because something likefunc(grouped[var_name_in])
will not work (no reason it should.) - 依赖于分组或非分组的两种不同的代码路径——这完全是因为DataFrame和DataFrameGroupBy有不同的方法将程序选择的减少函数应用到单个列。对于DataFrame,我找到的唯一方法是直接调用func(data[var_name_in])。数据[var_name_in].apply(func)不起作用,因为在序列上的apply()不会减少(不像在DataFrame上的apply()))。另一方面,对于DataFrameGroupBy,我必须使用这种方法:分组[var_name_in].apply(func)。这是因为类似func(分组[var_name_in])之类的东西不能工作(没有理由它应该这样做)。
- Special treatment for weighted mean - this is because it operates on two columns, unlike all the other calculations, which operate on just one; I don't know if this can be helped.
- 加权平均数的特殊处理——这是因为它作用于两列,不同于所有其他只作用于一列的计算;我不知道这是否有帮助。
-
Two different weighted mean functions - this is a consequence of the first issue. The ungrouped function has
Series
-type parameters and needsdot()
to multiply and reduce them; the grouped function eventually deals withSeriesGroupBy
objects and has to use the*
operator (acknowledgements to the answer to this SO post for the weighted average function code.) - 两个不同的加权平均函数——这是第一个问题的结果。未分组的函数具有序列化类型的参数,需要点()对它们进行相乘和减少;分组函数最终处理SeriesGroupBy对象,并且必须使用*操作符(对于权重平均函数代码的SO post的答案表示感谢)。
So my questions are:
所以我的问题是:
- Is there something native to pandas that can do all of this (i.e. throw out the above and use that instead)?
- 熊猫身上是否有什么东西可以做到这一切(比如扔掉上面的东西,用它来代替)?
- If not, are there any fixes to any of the issues mentioned above?
- 如果没有,上面提到的问题有什么解决办法吗?
- By any chance, is there some way to group by nothing - that is, to obtain a
DataFrameGroupBy
object from aDataFrame
without grouping on any variable? Then the code paths would be reduced as we would be dealing with theDataFrameGroupBy
interface exclusively. - 无论如何,是否有什么方法可以不加分组——也就是说,从DataFrame中获取DataFrameGroupBy对象,而不对任何变量进行分组?这样,代码路径就会减少,因为我们只处理DataFrameGroupBy接口。
Update / Current Solution
@JohnE's answer provides a way to group by nothing: groupby(lambda x: True)
. This is a workaround that he spotted in this SO post (which, incidentally, features an answer from Wes himself speaking of the need for a DataFrame.agg()
, which would serve the same purpose). @JohnE's excellent solution allows us to deal exclusively with objects of type DataFrameGroupBy
, and instantly reduces most of the code paths. I was able to reduce further using some functional gimmickry that is now possible because we have only DataFrameGroupBy
instances. Basically, all functions are generated as needed - the "generators" (in quotes here so as to not be confused with Python generator expressions) take two parameters: value column name and weight column name, the second of which is ignored in all cases except wmean
. The generated functions are always applied over the entire DataFrameGroupBy
, as was originally the case just with wmean
, with the parameters being the correct column name(s) to use. I also replaced all the np.*
implementations with pandas calculations, to better deal with NaN
values.
@JohnE的答案提供了一种无分组方式:groupby(lambda x: True)。这是他在SO post中发现的一种变通方法(顺便提一下,它的特点是韦斯自己给出的答案,谈到需要使用DataFrame.agg(),这将起到相同的作用)。@JohnE的优秀解决方案允许我们只处理类型DataFrameGroupBy的对象,并立即减少大多数代码路径。我能够使用一些功能上的花招进一步减少,这现在是可能的,因为我们只有DataFrameGroupBy实例。基本上,所有函数都是根据需要生成的——“生成器”(在这里用引号,以免与Python生成器表达式混淆)有两个参数:值列名和权重列名,第二个参数在除wmean外的所有情况下都被忽略。生成的函数总是应用于整个DataFrameGroupBy,就像wmean一样,参数是要使用的正确列名。我还替换了所有的np。*实现熊猫计算,更好地处理NaN值。
Unless there is something native to pandas that can do this, this is our solution:
除非有熊猫本土的东西可以做到这一点,这就是我们的解决方案:
FUNC_GENS = {
"mean" : lambda y,z : lambda x : x[y].mean(),
"sum" : lambda y,z : lambda x : x[y].sum() ,
"count" : lambda y,z : lambda x : x[y].count() ,
"wmean" : lambda y,z : lambda x : (x[y] * x[z]).sum() / x[z].sum()
}
def my_summary (
data ,
var_names_in ,
var_names_out ,
var_functions ,
var_name_weight = None ,
var_names_group = None ):
result = pd.DataFrame()
if var_names_group is None:
grouped = data.groupby (lambda x: True)
else:
grouped = data.groupby (var_names_group)
for var_name_in, var_name_out, var_function in \
zip(var_names_in,var_names_out,var_functions):
func_gen = FUNC_GENS[var_function]
func = func_gen (var_name_in, var_name_weight)
result[var_name_out] = grouped.apply(func)
return result
1 个解决方案
#1
4
Well, here's a quickie that does get at two issues (but still requires a different function for weighted mean). Mostly it uses the trick here (credit to @DSM) to get around your empty group by doing groupby(lamda x: True)
. It would be great if there was a kwarg for 'weights' on stuff like means but there is not, to my knowledge. Apparently there is a package for weighted quantiles mentioned here based on numpy but I don't know anything about it. Great project btw!
这里有一个快速解两个问题(但加权平均值仍然需要一个不同的函数)。大多数情况下,它使用这里的技巧(通过@DSM)来通过groupby(lamda x: True)绕过空组。如果能有一个kwarg来表示方法之类的东西的权重,那就太好了,但据我所知,没有。显然这里有一个基于numpy的加权分位数的包,但我对此一无所知。伟大的项目顺便说一句!
(note that names are mostly the same as yours, I just added a '2' to wmean_grouped and my_summary, otherwise you can use the same calling interface)
(注意,名称与您的名称基本相同,我只是向wmean_groups和my_summary添加了一个'2',否则您可以使用相同的调用接口)
def wmean_grouped2 (group, var_name_in, var_name_weight):
d = group[var_name_in]
w = group[var_name_weight]
return (d * w).sum() / w.sum()
FUNCS = { "mean" : np.mean ,
"sum" : np.sum ,
"count" : np.count_nonzero }
def my_summary2 (
data ,
var_names_in ,
var_names_out ,
var_functions ,
var_name_weight = None ,
var_names_group = None ):
result = pd.DataFrame()
if var_names_group is None:
grouped = data.groupby (lambda x: True)
else:
grouped = data.groupby (var_names_group)
for var_name_in, var_name_out, var_function in \
zip(var_names_in,var_names_out,var_functions):
if var_function == "wmean":
func = lambda x : wmean_grouped2 (x, var_name_in, var_name_weight)
result[var_name_out] = pd.Series(grouped.apply(func))
else:
func = FUNCS[var_function]
result[var_name_out] = grouped[var_name_in].apply(func)
return result
#1
4
Well, here's a quickie that does get at two issues (but still requires a different function for weighted mean). Mostly it uses the trick here (credit to @DSM) to get around your empty group by doing groupby(lamda x: True)
. It would be great if there was a kwarg for 'weights' on stuff like means but there is not, to my knowledge. Apparently there is a package for weighted quantiles mentioned here based on numpy but I don't know anything about it. Great project btw!
这里有一个快速解两个问题(但加权平均值仍然需要一个不同的函数)。大多数情况下,它使用这里的技巧(通过@DSM)来通过groupby(lamda x: True)绕过空组。如果能有一个kwarg来表示方法之类的东西的权重,那就太好了,但据我所知,没有。显然这里有一个基于numpy的加权分位数的包,但我对此一无所知。伟大的项目顺便说一句!
(note that names are mostly the same as yours, I just added a '2' to wmean_grouped and my_summary, otherwise you can use the same calling interface)
(注意,名称与您的名称基本相同,我只是向wmean_groups和my_summary添加了一个'2',否则您可以使用相同的调用接口)
def wmean_grouped2 (group, var_name_in, var_name_weight):
d = group[var_name_in]
w = group[var_name_weight]
return (d * w).sum() / w.sum()
FUNCS = { "mean" : np.mean ,
"sum" : np.sum ,
"count" : np.count_nonzero }
def my_summary2 (
data ,
var_names_in ,
var_names_out ,
var_functions ,
var_name_weight = None ,
var_names_group = None ):
result = pd.DataFrame()
if var_names_group is None:
grouped = data.groupby (lambda x: True)
else:
grouped = data.groupby (var_names_group)
for var_name_in, var_name_out, var_function in \
zip(var_names_in,var_names_out,var_functions):
if var_function == "wmean":
func = lambda x : wmean_grouped2 (x, var_name_in, var_name_weight)
result[var_name_out] = pd.Series(grouped.apply(func))
else:
func = FUNCS[var_function]
result[var_name_out] = grouped[var_name_in].apply(func)
return result