熊猫分组在同一数据存储器的子集上

时间:2023-02-11 21:40:47

This question is an extension to my earlier one. I have a pandas dataframe:

这个问题是我以前问题的延伸。我有一个熊猫档案:

import pandas as pd
codes = ["one","two","three"];
colours = ["black", "white"];
textures = ["soft", "hard"];
N= 100 # length of the dataframe
df = pd.DataFrame({ 'id' : range(1,N+1),
                    'weeks_elapsed' : [random.choice(range(1,25)) for i in range(1,N+1)],
                    'code' : [random.choice(codes) for i in range(1,N+1)],
                    'colour': [random.choice(colours) for i in range(1,N+1)],
                    'texture': [random.choice(textures) for i in range(1,N+1)],
                    'size': [random.randint(1,100) for i in range(1,N+1)],
                    'scaled_size': [random.randint(100,1000) for i in range(1,N+1)]
                   },  columns= ['id', 'weeks_elapsed', 'code','colour', 'texture', 'size', 'scaled_size'])

I group it by colour and code and get some statistics on size and scaled_size as below:

我按颜色和代码对它进行分组,得到一些关于大小和scaled_size的统计数据如下:

grouped = df.groupby(['code', 'colour']).agg( {'size': [np.sum, np.average, np.size, pd.Series.idxmax],'scaled_size': [np.sum, np.average, np.size, pd.Series.idxmax]}).reset_index()

Now, what I want to do it is to run the above calculations on the df multiple times for different weeks_elapsed intervals. Below is a brute-force solution, is there a more succint and faster way to run this? Also, how can I concatenate the results for different intervals in a single dataframe?

现在,我要做的是在不同的周间隔内对df进行多次上述计算。下面是一个蛮力的解决方案,有一个更容易和更快的方法来运行这个吗?另外,如何在一个dataframe中连接不同的时间间隔的结果?

cut_offs= [4,12]
grouped = {c:{} for c in cut_offs}
for c in cut_offs:
   grouped[c] =df.ix[df.weeks_elapsed <= c ].groupby(['code', 'colour']).agg( 
                                                 {'size': [np.sum, np.average, np.size,pd.Series.idxmax],
                                                  'scaled_size': [np.sum, np.average, np.size, pd.Series.idxmax]
                                                 }).reset_index()

I am particularly interested in the np.avg and np.size for the different weeks_elapsed intervals.

我对np特别感兴趣。avg和np。不同周间隔的大小。

3 个解决方案

#1


1  

So this is not a fully working answer, but maybe it can be extended to ultimatively get you there.

所以这并不是一个完全有效的答案,但也许它可以延伸到最终让你达到目的。

filter = array([12, 4])
for f in filter:
        df.loc[(df['weeks_elapsed'] <= f), 'filter'] = f 

Now, df looks like

现在,df的样子

>>> df.head()
Out[384]: 
   id  weeks_elapsed   code colour texture  size  adjusted_size  filter
0   1             20    one  white    soft    64            494     NaN
1   2              3  three  white    hard    22            650       4
2   3             22    two  black    hard    41            770     NaN
3   4              2    two  black    hard     4            325       4
4   5              4    two  black    hard    19            536       4

Where filter contains the smallest group that the row belong to. The next step would be

其中filter包含行所属的最小组。下一步是

>>> df.groupby(['filter', 'code', 'colour']).agg({'size': [np.sum, np.average, np.size, pd.Series.idxmax],
                                    'adjusted_size': [np.sum, np.average, np.size, pd.Series.idxmax]}
).reset_index()
Out[387]: 
    filter   code colour  adjusted_size                            size  \
                                    sum     average  size  idxmax   sum   
0        4    one  black           2195  548.750000     4      45   142   
1        4    one  white            286  286.000000     1      81    58   
2        4  three  black            927  463.500000     2      99   121   
3        4  three  white           5850  585.000000    10      95   511   
4        4    two  black           1102  367.333333     3       4    94   
5        4    two  white            852  852.000000     1      75     2   
6       12    one  white           2499  499.800000     5      72   267   
7       12  three  black           4709  588.625000     8      84   431   
8       12  three  white            569  189.666667     3      97   171   
9       12    two  black           2446  611.500000     4      49   241   
10      12    two  white           2859  714.750000     4      43   203   


      average  size  idxmax  
0   35.500000     4       5  
1   58.000000     1      81  
2   60.500000     2      99  
3   51.100000    10      88  
4   31.333333     3      21  
5    2.000000     1      75  
6   53.400000     5      69  
7   53.875000     8      12  
8   57.000000     3      59  
9   60.250000     4      36  
10  50.750000     4      43  

However, these are not exactly the groups you were looking for: observations with filter=4 will only be in the group belonging to 4, not in the group with filter=12.

然而,这些并不是您要寻找的组:过滤器=4的观察结果将只在属于4的组中,而不在过滤器=12的组中。

I tried looking at expanding_mean, however that will only be row-wise. So far, this is incomplete, but maybe it helps someone else to answer this.

我试着用expanding_mean来表示,但这只能用row-wise表示。到目前为止,这是不完整的,但也许它可以帮助其他人回答这个问题。

#2


1  

Alright, here is an alternative. The only way to have overlapping groups, which is effectively what you want, by my research (I'm only learning myself) is apparently TimeGrouper. That one, however, needs your data to be in a time-range. One way to achieve this is the following:

好的,这是另一种选择。通过我的研究(我只是在学习自己),获得重叠群的唯一方法显然是时间组。但是,这个数据需要在一个时间范围内。实现这一目标的一种方法是:

filter = array([25, 12, 4]) # we need 25 here so we don't have NaN values later on
for i,f in enumerate(filter):
    df.loc[(df['weeks_elapsed'] <= f), 'filter'] = i + 1
df2 = df.set_index([pd.DatetimeIndex('2014-01-'+df['filter'].astype(int).astype(str))])
results = df2.groupby(pd.TimeGrouper('D')).apply(lambda x: x.groupby(['code', 'colour']).agg(
    {'size': [np.sum, np.average, np.size, pd.Series.idxmax],
     'scaled_size': [np.sum, np.average, np.size, pd.Series.idxmax]
    }).reset_index())

Now results contains everything in the weird format. Transform it back

现在结果包含了所有奇怪的格式。将其转换回

results.set_index(results.index.get_level_values(0).day, drop=True, inplace=True)
results.set_index(filter[results.index.values - 1], drop=True)
Out[490]: 
     code colour  scaled_size                   scaled_size  size             \
                          sum     average  size      idxmax   sum    average   
25    one  black         4655  517.222222     9  2014-01-01   331  36.777778   
25    one  white         2444  305.500000     8  2014-01-01   292  36.500000   
25  three  black         2068  344.666667     6  2014-01-01   246  41.000000   
25  three  white         2859  571.800000     5  2014-01-01   260  52.000000   
25    two  black         6330  575.454545    11  2014-01-01   599  54.454545   
25    two  white         3200  533.333333     6  2014-01-01   291  48.500000   
12    one  black         4004  667.333333     6  2014-01-02   331  55.166667   
12    one  white         2965  741.250000     4  2014-01-02   130  32.500000   
12  three  black         3040  608.000000     5  2014-01-02   344  68.800000   
12  three  white         3795  474.375000     8  2014-01-02   359  44.875000   
12    two  black         2198  314.000000     7  2014-01-02   323  46.142857   
12    two  white         3427  571.166667     6  2014-01-02   271  45.166667   
4     one  black         1501  500.333333     3  2014-01-03    73  24.333333   
4     one  white         1710  570.000000     3  2014-01-03   210  70.000000   
4   three  black         1461  730.500000     2  2014-01-03    14   7.000000   
4   three  white          961  480.500000     2  2014-01-03    14   7.000000   
4     two  black         1656  552.000000     3  2014-01-03   189  63.000000   
4     two  white         2462  410.333333     6  2014-01-03   352  58.666667   

               size  
    size     idxmax  
25     9 2014-01-01  
25     8 2014-01-01  
25     6 2014-01-01  
25     5 2014-01-01  
25    11 2014-01-01  
25     6 2014-01-01  
12     6 2014-01-02  
12     4 2014-01-02  
12     5 2014-01-02  
12     8 2014-01-02  
12     7 2014-01-02  
12     6 2014-01-02  
4      3 2014-01-03  
4      3 2014-01-03  
4      2 2014-01-03  
4      2 2014-01-03  
4      3 2014-01-03  
4      6 2014-01-03 

#3


1  

@FooBar's answer may be better (haven't fully digested it), but here's one other approach.

@FooBar的答案可能更好(还没有完全消化),但这里有另一种方法。

First create a function that returns a custom average function, based on your filter condition. The inner function will take just the series, the outer function defines what value to filter on, and what dataframe that series is from.

首先,根据您的筛选条件创建一个返回自定义平均函数的函数。内部函数只取级数,外部函数定义要过滤的值,以及该级数来自哪个dataframe。

In [248]: def filter_average(base_df, filter_value, filter_by='weeks_elapsed'):
     ...:     def inner(x):
     ...:         return np.average(x[base_df[filter_by] <= filter_value])
     ...:     inner.__name__ = 'avg<=' + str(filter_value)
     ...:     return inner

Then, in your groupby operation, build versions of the filter average function for different cutoffs with a list comprehension, as below. The __name__ line above is necessary so that the headings under size are distinct.

然后,在groupby操作中,为具有列表理解的不同分支构建过滤器平均函数的版本,如下所示。上面的__name__行是必要的,这样在大小下的标题是不同的。

In [249]: df.groupby(['code','colour']).agg({'size': [filter_average(df, i) 
                                                      for i in cut_offs]})
Out[249]: 
                   size           
                  avg<=4    avg<=12
code  colour                      
one   black   55.166667  56.555556
      white   81.750000  58.583333
three black         NaN  32.000000
      white   40.333333  36.400000
two   black   32.000000  37.714286
      white   95.000000  45.000000

The same approach could be used np.size and could maybe be even built into a more generic decorator.

同样的方法也可用于np。大小,甚至可以构建成更通用的decorator。

#1


1  

So this is not a fully working answer, but maybe it can be extended to ultimatively get you there.

所以这并不是一个完全有效的答案,但也许它可以延伸到最终让你达到目的。

filter = array([12, 4])
for f in filter:
        df.loc[(df['weeks_elapsed'] <= f), 'filter'] = f 

Now, df looks like

现在,df的样子

>>> df.head()
Out[384]: 
   id  weeks_elapsed   code colour texture  size  adjusted_size  filter
0   1             20    one  white    soft    64            494     NaN
1   2              3  three  white    hard    22            650       4
2   3             22    two  black    hard    41            770     NaN
3   4              2    two  black    hard     4            325       4
4   5              4    two  black    hard    19            536       4

Where filter contains the smallest group that the row belong to. The next step would be

其中filter包含行所属的最小组。下一步是

>>> df.groupby(['filter', 'code', 'colour']).agg({'size': [np.sum, np.average, np.size, pd.Series.idxmax],
                                    'adjusted_size': [np.sum, np.average, np.size, pd.Series.idxmax]}
).reset_index()
Out[387]: 
    filter   code colour  adjusted_size                            size  \
                                    sum     average  size  idxmax   sum   
0        4    one  black           2195  548.750000     4      45   142   
1        4    one  white            286  286.000000     1      81    58   
2        4  three  black            927  463.500000     2      99   121   
3        4  three  white           5850  585.000000    10      95   511   
4        4    two  black           1102  367.333333     3       4    94   
5        4    two  white            852  852.000000     1      75     2   
6       12    one  white           2499  499.800000     5      72   267   
7       12  three  black           4709  588.625000     8      84   431   
8       12  three  white            569  189.666667     3      97   171   
9       12    two  black           2446  611.500000     4      49   241   
10      12    two  white           2859  714.750000     4      43   203   


      average  size  idxmax  
0   35.500000     4       5  
1   58.000000     1      81  
2   60.500000     2      99  
3   51.100000    10      88  
4   31.333333     3      21  
5    2.000000     1      75  
6   53.400000     5      69  
7   53.875000     8      12  
8   57.000000     3      59  
9   60.250000     4      36  
10  50.750000     4      43  

However, these are not exactly the groups you were looking for: observations with filter=4 will only be in the group belonging to 4, not in the group with filter=12.

然而,这些并不是您要寻找的组:过滤器=4的观察结果将只在属于4的组中,而不在过滤器=12的组中。

I tried looking at expanding_mean, however that will only be row-wise. So far, this is incomplete, but maybe it helps someone else to answer this.

我试着用expanding_mean来表示,但这只能用row-wise表示。到目前为止,这是不完整的,但也许它可以帮助其他人回答这个问题。

#2


1  

Alright, here is an alternative. The only way to have overlapping groups, which is effectively what you want, by my research (I'm only learning myself) is apparently TimeGrouper. That one, however, needs your data to be in a time-range. One way to achieve this is the following:

好的,这是另一种选择。通过我的研究(我只是在学习自己),获得重叠群的唯一方法显然是时间组。但是,这个数据需要在一个时间范围内。实现这一目标的一种方法是:

filter = array([25, 12, 4]) # we need 25 here so we don't have NaN values later on
for i,f in enumerate(filter):
    df.loc[(df['weeks_elapsed'] <= f), 'filter'] = i + 1
df2 = df.set_index([pd.DatetimeIndex('2014-01-'+df['filter'].astype(int).astype(str))])
results = df2.groupby(pd.TimeGrouper('D')).apply(lambda x: x.groupby(['code', 'colour']).agg(
    {'size': [np.sum, np.average, np.size, pd.Series.idxmax],
     'scaled_size': [np.sum, np.average, np.size, pd.Series.idxmax]
    }).reset_index())

Now results contains everything in the weird format. Transform it back

现在结果包含了所有奇怪的格式。将其转换回

results.set_index(results.index.get_level_values(0).day, drop=True, inplace=True)
results.set_index(filter[results.index.values - 1], drop=True)
Out[490]: 
     code colour  scaled_size                   scaled_size  size             \
                          sum     average  size      idxmax   sum    average   
25    one  black         4655  517.222222     9  2014-01-01   331  36.777778   
25    one  white         2444  305.500000     8  2014-01-01   292  36.500000   
25  three  black         2068  344.666667     6  2014-01-01   246  41.000000   
25  three  white         2859  571.800000     5  2014-01-01   260  52.000000   
25    two  black         6330  575.454545    11  2014-01-01   599  54.454545   
25    two  white         3200  533.333333     6  2014-01-01   291  48.500000   
12    one  black         4004  667.333333     6  2014-01-02   331  55.166667   
12    one  white         2965  741.250000     4  2014-01-02   130  32.500000   
12  three  black         3040  608.000000     5  2014-01-02   344  68.800000   
12  three  white         3795  474.375000     8  2014-01-02   359  44.875000   
12    two  black         2198  314.000000     7  2014-01-02   323  46.142857   
12    two  white         3427  571.166667     6  2014-01-02   271  45.166667   
4     one  black         1501  500.333333     3  2014-01-03    73  24.333333   
4     one  white         1710  570.000000     3  2014-01-03   210  70.000000   
4   three  black         1461  730.500000     2  2014-01-03    14   7.000000   
4   three  white          961  480.500000     2  2014-01-03    14   7.000000   
4     two  black         1656  552.000000     3  2014-01-03   189  63.000000   
4     two  white         2462  410.333333     6  2014-01-03   352  58.666667   

               size  
    size     idxmax  
25     9 2014-01-01  
25     8 2014-01-01  
25     6 2014-01-01  
25     5 2014-01-01  
25    11 2014-01-01  
25     6 2014-01-01  
12     6 2014-01-02  
12     4 2014-01-02  
12     5 2014-01-02  
12     8 2014-01-02  
12     7 2014-01-02  
12     6 2014-01-02  
4      3 2014-01-03  
4      3 2014-01-03  
4      2 2014-01-03  
4      2 2014-01-03  
4      3 2014-01-03  
4      6 2014-01-03 

#3


1  

@FooBar's answer may be better (haven't fully digested it), but here's one other approach.

@FooBar的答案可能更好(还没有完全消化),但这里有另一种方法。

First create a function that returns a custom average function, based on your filter condition. The inner function will take just the series, the outer function defines what value to filter on, and what dataframe that series is from.

首先,根据您的筛选条件创建一个返回自定义平均函数的函数。内部函数只取级数,外部函数定义要过滤的值,以及该级数来自哪个dataframe。

In [248]: def filter_average(base_df, filter_value, filter_by='weeks_elapsed'):
     ...:     def inner(x):
     ...:         return np.average(x[base_df[filter_by] <= filter_value])
     ...:     inner.__name__ = 'avg<=' + str(filter_value)
     ...:     return inner

Then, in your groupby operation, build versions of the filter average function for different cutoffs with a list comprehension, as below. The __name__ line above is necessary so that the headings under size are distinct.

然后,在groupby操作中,为具有列表理解的不同分支构建过滤器平均函数的版本,如下所示。上面的__name__行是必要的,这样在大小下的标题是不同的。

In [249]: df.groupby(['code','colour']).agg({'size': [filter_average(df, i) 
                                                      for i in cut_offs]})
Out[249]: 
                   size           
                  avg<=4    avg<=12
code  colour                      
one   black   55.166667  56.555556
      white   81.750000  58.583333
three black         NaN  32.000000
      white   40.333333  36.400000
two   black   32.000000  37.714286
      white   95.000000  45.000000

The same approach could be used np.size and could maybe be even built into a more generic decorator.

同样的方法也可用于np。大小,甚至可以构建成更通用的decorator。