This question is an extension to my earlier one. I have a pandas dataframe:
这个问题是我以前问题的延伸。我有一个熊猫档案:
import pandas as pd
codes = ["one","two","three"];
colours = ["black", "white"];
textures = ["soft", "hard"];
N= 100 # length of the dataframe
df = pd.DataFrame({ 'id' : range(1,N+1),
'weeks_elapsed' : [random.choice(range(1,25)) for i in range(1,N+1)],
'code' : [random.choice(codes) for i in range(1,N+1)],
'colour': [random.choice(colours) for i in range(1,N+1)],
'texture': [random.choice(textures) for i in range(1,N+1)],
'size': [random.randint(1,100) for i in range(1,N+1)],
'scaled_size': [random.randint(100,1000) for i in range(1,N+1)]
}, columns= ['id', 'weeks_elapsed', 'code','colour', 'texture', 'size', 'scaled_size'])
I group it by colour
and code
and get some statistics on size
and scaled_size
as below:
我按颜色和代码对它进行分组,得到一些关于大小和scaled_size的统计数据如下:
grouped = df.groupby(['code', 'colour']).agg( {'size': [np.sum, np.average, np.size, pd.Series.idxmax],'scaled_size': [np.sum, np.average, np.size, pd.Series.idxmax]}).reset_index()
Now, what I want to do it is to run the above calculations on the df
multiple times for different weeks_elapsed
intervals. Below is a brute-force solution, is there a more succint and faster way to run this? Also, how can I concatenate the results for different intervals in a single dataframe?
现在,我要做的是在不同的周间隔内对df进行多次上述计算。下面是一个蛮力的解决方案,有一个更容易和更快的方法来运行这个吗?另外,如何在一个dataframe中连接不同的时间间隔的结果?
cut_offs= [4,12]
grouped = {c:{} for c in cut_offs}
for c in cut_offs:
grouped[c] =df.ix[df.weeks_elapsed <= c ].groupby(['code', 'colour']).agg(
{'size': [np.sum, np.average, np.size,pd.Series.idxmax],
'scaled_size': [np.sum, np.average, np.size, pd.Series.idxmax]
}).reset_index()
I am particularly interested in the np.avg
and np.size
for the different weeks_elapsed
intervals.
我对np特别感兴趣。avg和np。不同周间隔的大小。
3 个解决方案
#1
1
So this is not a fully working answer, but maybe it can be extended to ultimatively get you there.
所以这并不是一个完全有效的答案,但也许它可以延伸到最终让你达到目的。
filter = array([12, 4])
for f in filter:
df.loc[(df['weeks_elapsed'] <= f), 'filter'] = f
Now, df
looks like
现在,df的样子
>>> df.head()
Out[384]:
id weeks_elapsed code colour texture size adjusted_size filter
0 1 20 one white soft 64 494 NaN
1 2 3 three white hard 22 650 4
2 3 22 two black hard 41 770 NaN
3 4 2 two black hard 4 325 4
4 5 4 two black hard 19 536 4
Where filter
contains the smallest group that the row belong to. The next step would be
其中filter包含行所属的最小组。下一步是
>>> df.groupby(['filter', 'code', 'colour']).agg({'size': [np.sum, np.average, np.size, pd.Series.idxmax],
'adjusted_size': [np.sum, np.average, np.size, pd.Series.idxmax]}
).reset_index()
Out[387]:
filter code colour adjusted_size size \
sum average size idxmax sum
0 4 one black 2195 548.750000 4 45 142
1 4 one white 286 286.000000 1 81 58
2 4 three black 927 463.500000 2 99 121
3 4 three white 5850 585.000000 10 95 511
4 4 two black 1102 367.333333 3 4 94
5 4 two white 852 852.000000 1 75 2
6 12 one white 2499 499.800000 5 72 267
7 12 three black 4709 588.625000 8 84 431
8 12 three white 569 189.666667 3 97 171
9 12 two black 2446 611.500000 4 49 241
10 12 two white 2859 714.750000 4 43 203
average size idxmax
0 35.500000 4 5
1 58.000000 1 81
2 60.500000 2 99
3 51.100000 10 88
4 31.333333 3 21
5 2.000000 1 75
6 53.400000 5 69
7 53.875000 8 12
8 57.000000 3 59
9 60.250000 4 36
10 50.750000 4 43
However, these are not exactly the groups you were looking for: observations with filter=4
will only be in the group belonging to 4
, not in the group with filter=12
.
然而,这些并不是您要寻找的组:过滤器=4的观察结果将只在属于4的组中,而不在过滤器=12的组中。
I tried looking at expanding_mean, however that will only be row-wise. So far, this is incomplete, but maybe it helps someone else to answer this.
我试着用expanding_mean来表示,但这只能用row-wise表示。到目前为止,这是不完整的,但也许它可以帮助其他人回答这个问题。
#2
1
Alright, here is an alternative. The only way to have overlapping groups, which is effectively what you want, by my research (I'm only learning myself) is apparently TimeGrouper
. That one, however, needs your data to be in a time-range. One way to achieve this is the following:
好的,这是另一种选择。通过我的研究(我只是在学习自己),获得重叠群的唯一方法显然是时间组。但是,这个数据需要在一个时间范围内。实现这一目标的一种方法是:
filter = array([25, 12, 4]) # we need 25 here so we don't have NaN values later on
for i,f in enumerate(filter):
df.loc[(df['weeks_elapsed'] <= f), 'filter'] = i + 1
df2 = df.set_index([pd.DatetimeIndex('2014-01-'+df['filter'].astype(int).astype(str))])
results = df2.groupby(pd.TimeGrouper('D')).apply(lambda x: x.groupby(['code', 'colour']).agg(
{'size': [np.sum, np.average, np.size, pd.Series.idxmax],
'scaled_size': [np.sum, np.average, np.size, pd.Series.idxmax]
}).reset_index())
Now results
contains everything in the weird format. Transform it back
现在结果包含了所有奇怪的格式。将其转换回
results.set_index(results.index.get_level_values(0).day, drop=True, inplace=True)
results.set_index(filter[results.index.values - 1], drop=True)
Out[490]:
code colour scaled_size scaled_size size \
sum average size idxmax sum average
25 one black 4655 517.222222 9 2014-01-01 331 36.777778
25 one white 2444 305.500000 8 2014-01-01 292 36.500000
25 three black 2068 344.666667 6 2014-01-01 246 41.000000
25 three white 2859 571.800000 5 2014-01-01 260 52.000000
25 two black 6330 575.454545 11 2014-01-01 599 54.454545
25 two white 3200 533.333333 6 2014-01-01 291 48.500000
12 one black 4004 667.333333 6 2014-01-02 331 55.166667
12 one white 2965 741.250000 4 2014-01-02 130 32.500000
12 three black 3040 608.000000 5 2014-01-02 344 68.800000
12 three white 3795 474.375000 8 2014-01-02 359 44.875000
12 two black 2198 314.000000 7 2014-01-02 323 46.142857
12 two white 3427 571.166667 6 2014-01-02 271 45.166667
4 one black 1501 500.333333 3 2014-01-03 73 24.333333
4 one white 1710 570.000000 3 2014-01-03 210 70.000000
4 three black 1461 730.500000 2 2014-01-03 14 7.000000
4 three white 961 480.500000 2 2014-01-03 14 7.000000
4 two black 1656 552.000000 3 2014-01-03 189 63.000000
4 two white 2462 410.333333 6 2014-01-03 352 58.666667
size
size idxmax
25 9 2014-01-01
25 8 2014-01-01
25 6 2014-01-01
25 5 2014-01-01
25 11 2014-01-01
25 6 2014-01-01
12 6 2014-01-02
12 4 2014-01-02
12 5 2014-01-02
12 8 2014-01-02
12 7 2014-01-02
12 6 2014-01-02
4 3 2014-01-03
4 3 2014-01-03
4 2 2014-01-03
4 2 2014-01-03
4 3 2014-01-03
4 6 2014-01-03
#3
1
@FooBar's answer may be better (haven't fully digested it), but here's one other approach.
@FooBar的答案可能更好(还没有完全消化),但这里有另一种方法。
First create a function that returns a custom average function, based on your filter condition. The inner function will take just the series, the outer function defines what value to filter on, and what dataframe that series is from.
首先,根据您的筛选条件创建一个返回自定义平均函数的函数。内部函数只取级数,外部函数定义要过滤的值,以及该级数来自哪个dataframe。
In [248]: def filter_average(base_df, filter_value, filter_by='weeks_elapsed'):
...: def inner(x):
...: return np.average(x[base_df[filter_by] <= filter_value])
...: inner.__name__ = 'avg<=' + str(filter_value)
...: return inner
Then, in your groupby operation, build versions of the filter average function for different cutoffs with a list comprehension, as below. The __name__
line above is necessary so that the headings under size are distinct.
然后,在groupby操作中,为具有列表理解的不同分支构建过滤器平均函数的版本,如下所示。上面的__name__行是必要的,这样在大小下的标题是不同的。
In [249]: df.groupby(['code','colour']).agg({'size': [filter_average(df, i)
for i in cut_offs]})
Out[249]:
size
avg<=4 avg<=12
code colour
one black 55.166667 56.555556
white 81.750000 58.583333
three black NaN 32.000000
white 40.333333 36.400000
two black 32.000000 37.714286
white 95.000000 45.000000
The same approach could be used np.size
and could maybe be even built into a more generic decorator.
同样的方法也可用于np。大小,甚至可以构建成更通用的decorator。
#1
1
So this is not a fully working answer, but maybe it can be extended to ultimatively get you there.
所以这并不是一个完全有效的答案,但也许它可以延伸到最终让你达到目的。
filter = array([12, 4])
for f in filter:
df.loc[(df['weeks_elapsed'] <= f), 'filter'] = f
Now, df
looks like
现在,df的样子
>>> df.head()
Out[384]:
id weeks_elapsed code colour texture size adjusted_size filter
0 1 20 one white soft 64 494 NaN
1 2 3 three white hard 22 650 4
2 3 22 two black hard 41 770 NaN
3 4 2 two black hard 4 325 4
4 5 4 two black hard 19 536 4
Where filter
contains the smallest group that the row belong to. The next step would be
其中filter包含行所属的最小组。下一步是
>>> df.groupby(['filter', 'code', 'colour']).agg({'size': [np.sum, np.average, np.size, pd.Series.idxmax],
'adjusted_size': [np.sum, np.average, np.size, pd.Series.idxmax]}
).reset_index()
Out[387]:
filter code colour adjusted_size size \
sum average size idxmax sum
0 4 one black 2195 548.750000 4 45 142
1 4 one white 286 286.000000 1 81 58
2 4 three black 927 463.500000 2 99 121
3 4 three white 5850 585.000000 10 95 511
4 4 two black 1102 367.333333 3 4 94
5 4 two white 852 852.000000 1 75 2
6 12 one white 2499 499.800000 5 72 267
7 12 three black 4709 588.625000 8 84 431
8 12 three white 569 189.666667 3 97 171
9 12 two black 2446 611.500000 4 49 241
10 12 two white 2859 714.750000 4 43 203
average size idxmax
0 35.500000 4 5
1 58.000000 1 81
2 60.500000 2 99
3 51.100000 10 88
4 31.333333 3 21
5 2.000000 1 75
6 53.400000 5 69
7 53.875000 8 12
8 57.000000 3 59
9 60.250000 4 36
10 50.750000 4 43
However, these are not exactly the groups you were looking for: observations with filter=4
will only be in the group belonging to 4
, not in the group with filter=12
.
然而,这些并不是您要寻找的组:过滤器=4的观察结果将只在属于4的组中,而不在过滤器=12的组中。
I tried looking at expanding_mean, however that will only be row-wise. So far, this is incomplete, but maybe it helps someone else to answer this.
我试着用expanding_mean来表示,但这只能用row-wise表示。到目前为止,这是不完整的,但也许它可以帮助其他人回答这个问题。
#2
1
Alright, here is an alternative. The only way to have overlapping groups, which is effectively what you want, by my research (I'm only learning myself) is apparently TimeGrouper
. That one, however, needs your data to be in a time-range. One way to achieve this is the following:
好的,这是另一种选择。通过我的研究(我只是在学习自己),获得重叠群的唯一方法显然是时间组。但是,这个数据需要在一个时间范围内。实现这一目标的一种方法是:
filter = array([25, 12, 4]) # we need 25 here so we don't have NaN values later on
for i,f in enumerate(filter):
df.loc[(df['weeks_elapsed'] <= f), 'filter'] = i + 1
df2 = df.set_index([pd.DatetimeIndex('2014-01-'+df['filter'].astype(int).astype(str))])
results = df2.groupby(pd.TimeGrouper('D')).apply(lambda x: x.groupby(['code', 'colour']).agg(
{'size': [np.sum, np.average, np.size, pd.Series.idxmax],
'scaled_size': [np.sum, np.average, np.size, pd.Series.idxmax]
}).reset_index())
Now results
contains everything in the weird format. Transform it back
现在结果包含了所有奇怪的格式。将其转换回
results.set_index(results.index.get_level_values(0).day, drop=True, inplace=True)
results.set_index(filter[results.index.values - 1], drop=True)
Out[490]:
code colour scaled_size scaled_size size \
sum average size idxmax sum average
25 one black 4655 517.222222 9 2014-01-01 331 36.777778
25 one white 2444 305.500000 8 2014-01-01 292 36.500000
25 three black 2068 344.666667 6 2014-01-01 246 41.000000
25 three white 2859 571.800000 5 2014-01-01 260 52.000000
25 two black 6330 575.454545 11 2014-01-01 599 54.454545
25 two white 3200 533.333333 6 2014-01-01 291 48.500000
12 one black 4004 667.333333 6 2014-01-02 331 55.166667
12 one white 2965 741.250000 4 2014-01-02 130 32.500000
12 three black 3040 608.000000 5 2014-01-02 344 68.800000
12 three white 3795 474.375000 8 2014-01-02 359 44.875000
12 two black 2198 314.000000 7 2014-01-02 323 46.142857
12 two white 3427 571.166667 6 2014-01-02 271 45.166667
4 one black 1501 500.333333 3 2014-01-03 73 24.333333
4 one white 1710 570.000000 3 2014-01-03 210 70.000000
4 three black 1461 730.500000 2 2014-01-03 14 7.000000
4 three white 961 480.500000 2 2014-01-03 14 7.000000
4 two black 1656 552.000000 3 2014-01-03 189 63.000000
4 two white 2462 410.333333 6 2014-01-03 352 58.666667
size
size idxmax
25 9 2014-01-01
25 8 2014-01-01
25 6 2014-01-01
25 5 2014-01-01
25 11 2014-01-01
25 6 2014-01-01
12 6 2014-01-02
12 4 2014-01-02
12 5 2014-01-02
12 8 2014-01-02
12 7 2014-01-02
12 6 2014-01-02
4 3 2014-01-03
4 3 2014-01-03
4 2 2014-01-03
4 2 2014-01-03
4 3 2014-01-03
4 6 2014-01-03
#3
1
@FooBar's answer may be better (haven't fully digested it), but here's one other approach.
@FooBar的答案可能更好(还没有完全消化),但这里有另一种方法。
First create a function that returns a custom average function, based on your filter condition. The inner function will take just the series, the outer function defines what value to filter on, and what dataframe that series is from.
首先,根据您的筛选条件创建一个返回自定义平均函数的函数。内部函数只取级数,外部函数定义要过滤的值,以及该级数来自哪个dataframe。
In [248]: def filter_average(base_df, filter_value, filter_by='weeks_elapsed'):
...: def inner(x):
...: return np.average(x[base_df[filter_by] <= filter_value])
...: inner.__name__ = 'avg<=' + str(filter_value)
...: return inner
Then, in your groupby operation, build versions of the filter average function for different cutoffs with a list comprehension, as below. The __name__
line above is necessary so that the headings under size are distinct.
然后,在groupby操作中,为具有列表理解的不同分支构建过滤器平均函数的版本,如下所示。上面的__name__行是必要的,这样在大小下的标题是不同的。
In [249]: df.groupby(['code','colour']).agg({'size': [filter_average(df, i)
for i in cut_offs]})
Out[249]:
size
avg<=4 avg<=12
code colour
one black 55.166667 56.555556
white 81.750000 58.583333
three black NaN 32.000000
white 40.333333 36.400000
two black 32.000000 37.714286
white 95.000000 45.000000
The same approach could be used np.size
and could maybe be even built into a more generic decorator.
同样的方法也可用于np。大小,甚至可以构建成更通用的decorator。