如何创建唯一值计数的汇总表？

I've scoured many, many other SO posts for an answer to this question, but haven't found quite what I'm looking for. Here goes:

我已经搜索了很多很多其他SO帖子来回答这个问题,但是还没找到我想要的东西。开始:

Let's say we have a dataframe that looks like this:

假设我们有一个如下所示的数据框:

In [7]: df.head(5)
Out[7]:
  bool_flag   group  int_flag
0     False  bottom         0
1     False     mid         1
2     False     top         1
3     False     top         0
4     False    high         1

Where there are five unique groups, two unique boolean values, and two unique integer values. I'd like to create a summary table like this:

其中有五个唯一的组,两个唯一的布尔值和两个唯一的整数值。我想创建一个这样的汇总表:

                 bottom   low  mid  high  top
bool_flag  true       5     32   2    12    4
          false       2     42   7     2   10
int_flag      0       1     10  15     3    8 
              1      10     31  14     0    1

summarizing the unique value counts of each of the non-group columns, and grouped in columns of group.

汇总每个非组列的唯一值计数,并按组列分组。

I've gotten close. The following pivot_table command get me tables that resemble components of what I'd like to have.

我已经接近了。以下pivot_table命令为我提供了类似于我想要的组件的表。

In [8]: pd.pivot_table(df.drop('bool_flag', axis=1), columns=['group'], index=['int_flag'], aggfunc=len)
Out[8]:
group     bottom  high  low  mid  top
int_flag
0             15    11    8   13   13
1             12     5    8    9    6


In [9]: pd.pivot_table(df.drop('int_flag', axis=1), columns=['group'], index=['bool_flag'], aggfunc=len)
Out[9]:
group      bottom  high  low  mid  top
bool_flag
False          19    14   15   18   16
True            8     2    1    4    3

However, the index of the resulting table isn't the Multiindex I'd like to have, and thus makes concatenating that pivot table with the same for the bool_flag more difficult.

但是,结果表的索引不是我想要的Multiindex,因此使bool_flag的pivot表与相同的连接更加困难。

I would hope that there's a way to either use groupby or pivot_table to get what I want without generating these sub-tabulations and concatenating them, but so far I haven't been able to find it. Pivoting with multiple index columns results in too fine-grained a table (I don't want the count of (False, 0) pairs for (bool_flag, int_flag) values, for example, just the count of each unique value within each group.)

我希望有一种方法可以使用groupby或pivot_table来获得我想要的东西而不生成这些子表并连接它们,但到目前为止我还没有找到它。使用多个索引列进行透视导致表格过于细化(我不希望(bool_flag,int_flag)值的(False,0)对的计数,例如,只是每个组中每个唯一值的计数。 )

I also attempted to use groupby('group').agg(f), where I defined f to yield the result of calling value_counts() on each series. However, agg isn't compatible with functions that return DataFrames.

我还尝试使用groupby('group')。agg(f),其中我定义了f以产生在每个系列上调用value_counts()的结果。但是,agg与返回DataFrames的函数不兼容。

Any suggestions would be greatly appreciated!

任何建议将不胜感激!

1 个解决方案

#1

Actually, I don't think that what I'm asking for is possible. Setting the indices of the two pivot tables I showed above to MultiIndexs by doing the following:

实际上,我认为我所要求的是可能的。通过执行以下操作,将上面显示的两个数据透视表的索引设置为MultiIndexs:

x = pd.pivot_table(df.drop('int_flag', axis=1), columns=['group'], index=['bool_flag'], aggfunc=len)
y = pd.pivot_table(df.drop('bool_flag', axis=1), columns=['group'], index=['int_flag'], aggfunc=len)

def multiindex_from_pivot_result(df):
    return pd.MultiIndex.from_tuples([(df.index.name, val) for val in df.index], names=['feature', 'values'])

xx = x.set_index(multiindex_from_pivot_result(x))
yy = y.set_index(multiindex_from_pivot_result(y))

results in a tables that look like this:

得到一个如下所示的表:

group             bottom  high  low  mid  top
feature   values
bool_flag False       19    14   15   18   16
          True         8     2    1    4    3

and

group            bottom  high  low  mid  top
feature  values
int_flag 0           15    11    8   13   13
         1           12     5    8    9    6

however, concatenating them like so

然而,像这样连接起来

pd.concat([yy, xx])

yields a table with the values I want, but whose index's second-level is overridden with the first frame's index values.

产生一个具有我想要的值的表,但其索引的第二级被第一帧的索引值覆盖。

In [24]: pd.concat([yy, xx])
Out[24]:
group             bottom  high  low  mid  top
feature   values
int_flag  0           15    11    8   13   13
          1           12     5    8    9    6
bool_flag 0           19    14   15   18   16
          1            8     2    1    4    3

Unfortunately, that leaves me with the choice of resetting that level of the index to a normal column, which doesn't print as nicely.

不幸的是,这让我可以选择将索引级别重置为普通列,而不能正常打印。

Hope this helped somebody!

希望这有助于某人!

#1