然后分组应用函数然后在Pandas Python中展平回数据帧

I have some data where some of the column values are the cumsum of the year (YTD sums). I want to change those columns to reflect the differences rather than the cumulative sums. The data looks as follows:

我有一些数据,其中一些列值是年度的cumsum(年初至今的总和)。我想更改这些列以反映差异而不是累积总和。数据如下:

                                ReportNumber   NIY  OANCFY  FQTR  FYEARQ
Reference  Published                                                    
2007-12-31 2008-02-21 00:00:00             1  3131    3073     4    2007
2008-03-31 2008-05-08 00:00:00             1  1189    1482     1    2008
           2009-05-07 16:00:00             2  1149     NaN     1    2008
2008-06-30 2008-08-07 00:00:00             1  2296    2493     2    2008
           2009-08-18 00:00:00             3  2214     NaN     2    2008
2008-09-30 2008-11-06 00:00:00             1  3402    3763     3    2008
           2009-11-07 00:00:00             3  3277     NaN     3    2008
2008-12-31 2009-02-17 16:00:00             1   NaN    4959     4    2008
           2009-02-18 00:00:00             3  4202     NaN     4    2008
           2010-03-21 00:00:00             5  4031     NaN     4    2008
2009-03-31 2009-05-07 16:00:00             1   942    1441     1    2009
2009-06-30 2009-08-06 00:00:00             1  1748    3017     2    2009
2009-09-30 2009-11-07 00:00:00             1  2458    4423     3    2009
2009-12-31 2010-02-24 16:00:00             1  3181    5598     4    2009
2010-03-31 2010-05-07 16:00:00             1   677    1172     1    2010
2010-06-30 2010-08-06 08:00:00             1  1392    2441     2    2010
2010-09-30 2010-11-08 16:00:00             1  1760    3150     3    2010
2010-12-31 2011-02-24 08:00:00             1   961    3946     4    2010
2011-03-31 2011-05-06 16:00:00             1   310     390     1    2011
           2012-08-16 16:00:00             2   319     NaN     1    2011
2011-06-30 2011-08-09 08:00:00             1   465     730     2    2011
           2012-08-16 16:00:00             2   443     NaN     2    2011
2011-09-30 2011-11-09 00:00:00             1   394    1222     3    2011
           2012-11-06 16:00:00             2   411     NaN     3    2011
2011-12-31 2012-03-06 00:00:00             1 -5725    1785     4    2011
           2013-03-05 00:00:00             2 -5754     NaN     4    2011
2012-03-31 2012-05-05 16:00:00             1    42     540     1    2012
           2012-08-16 16:00:00             2    10     NaN     1    2012
2012-06-30 2012-08-02 16:00:00             1  -294     999     2    2012
2012-09-30 2012-11-06 16:00:00             1  -675    1785     3    2012
2012-12-31 2013-03-05 00:00:00             1  -219    2708     4    2012

So I need to take the differences between FQTR within in a given FYEARQ according to the 'FQTR' then the 'Published' index and have it as one frame. In my attempt which works as long as there isn't multiple values for an item ('NIY', 'OANCFY') for the given reference data.

所以我需要在给定的FYEARQ中根据'FQTR'然后'发布'索引取得FQTR之间的差异并将其作为一个帧。在我的尝试中,只要给定参考数据的项目('NIY','OANCFY')没有多个值,它就会起作用。

cfgtmp = cftmp.groupby('FYEARQ')
ft = dict()
for group_name, subdf in cftmp.dropna().drop_duplicates().groupby('FYEARQ'):
    tmp = pd.concat([subdf.head(1), subdf.diff()]).dropna()
    tmp['FQTR'] = subdf['FQTR']
    tmp['FYEARQ'] = subdf['FYEARQ']
    tmp['ReportNumber'] = subdf['ReportNumber']
    ft.update({group_name : tmp})
    print group_name
    print 'differences'
    print tmp
    print ' '

the pd.concat attempts to handle the differences between the quarters ('FQTR'). It returns:

pd.concat尝试处理季度之间的差异('FQTR')。它返回:

2007
differences
                       ReportNumber   NIY  OANCFY  FQTR  FYEARQ
Reference  Published                                           
2007-12-31 2008-02-21             1  3131    3073     4    2007

2008
differences
                       ReportNumber   NIY  OANCFY  FQTR  FYEARQ
Reference  Published                                           
2008-03-31 2008-05-08             1  1189    1482     1    2008
2008-06-30 2008-08-07             1  1107    1011     2    2008
2008-09-30 2008-11-06             1  1106    1270     3    2008

2009
differences
                                ReportNumber  NIY  OANCFY  FQTR  FYEARQ
Reference  Published                                                   
2009-03-31 2009-05-07 16:00:00             1  942    1441     1    2009
2009-06-30 2009-08-06 00:00:00             1  806    1576     2    2009
2009-09-30 2009-11-07 00:00:00             1  710    1406     3    2009
2009-12-31 2010-02-24 16:00:00             1  723    1175     4    2009

2010
differences
                                ReportNumber  NIY  OANCFY  FQTR  FYEARQ
Reference  Published                                                   
2010-03-31 2010-05-07 16:00:00             1  677    1172     1    2010
2010-06-30 2010-08-06 08:00:00             1  715    1269     2    2010
2010-09-30 2010-11-08 16:00:00             1  368     709     3    2010
2010-12-31 2011-02-24 08:00:00             1 -799     796     4    2010

2011
differences
                                ReportNumber   NIY  OANCFY  FQTR  FYEARQ
Reference  Published                                                    
2011-03-31 2011-05-06 16:00:00             1   310     390     1    2011
2011-06-30 2011-08-09 08:00:00             1   155     340     2    2011
2011-09-30 2011-11-09 00:00:00             1   -71     492     3    2011
2011-12-31 2012-03-06 00:00:00             1 -6119     563     4    2011

2012
differences
                                ReportNumber  NIY  OANCFY  FQTR  FYEARQ
Reference  Published                                                   
2012-03-31 2012-05-05 16:00:00             1   42     540     1    2012
2012-06-30 2012-08-02 16:00:00             1 -336     459     2    2012
2012-09-30 2012-11-06 16:00:00             1 -381     786     3    2012
2012-12-31 2013-03-05 00:00:00             1  456     923     4    2012

The problem with this solution is that it is only effective for the 'ReportNumber' == 1

此解决方案的问题在于它仅对'ReportNumber'== 1有效

I then use pd.concat to flatten it back into one frame:

然后我使用pd.concat将其展平为一帧:

pd.concat([ft[f] for f in ft])

Any suggestions?

2 个解决方案

#1

Your cftmp.dropna(). parts discarded the data associated with ReportNumber other than 1. In you example dataframe, those data happen to have nan for OANCFY.

你的cftmp.dropna()。部件丢弃与ReportNumber相关的数据而不是1.在您的示例数据框中,这些数据恰好具有针对OANCFY的nan。

However, to avoid using loop, you can do something like this: get the 1st observation using head and the delta's using diff(), then concat them together.

但是,为了避免使用循环,您可以执行以下操作:使用head获取第一个观察值,使用diff()获取delta,然后将它们连接在一起。

In [71]:

newdf = pd.concat((df.groupby(['FYEARQ', 
                               'ReportNumber']).head(1),
                   df.groupby(['FYEARQ', 
                               'ReportNumber']).diff().dropna())).reset_index()\
                                                                 .sort('Reference')\
                                                                 .dropna(subset=['OANCFY'])\
                                                                 .reset_index(drop=True)\
                                                                 .fillna(method='pad')

newdf['FQTR'] = newdf.FQTR.groupby(newdf.FYEARQ).cumsum()

print newdf

     Reference           Published  FQTR  FYEARQ   NIY  OANCFY  ReportNumber
0   2007-12-31 2008-02-21 00:00:00     4    2007  3131    3073             1
1   2008-03-31 2008-05-08 00:00:00     1    2008  1189    1482             1
2   2008-06-30 2008-08-07 00:00:00     2    2008  1107    1011             1
3   2008-09-30 2008-11-06 00:00:00     3    2008  1106    1270             1
4   2009-03-31 2009-05-07 16:00:00     1    2009   942    1441             1
5   2009-06-30 2009-08-06 00:00:00     2    2009   806    1576             1
6   2009-09-30 2009-11-07 00:00:00     3    2009   710    1406             1
7   2009-12-31 2010-02-24 16:00:00     4    2009   723    1175             1
8   2010-03-31 2010-05-07 16:00:00     1    2010   677    1172             1
9   2010-06-30 2010-08-06 08:00:00     2    2010   715    1269             1
10  2010-09-30 2010-11-08 16:00:00     3    2010   368     709             1
11  2010-12-31 2011-02-24 08:00:00     4    2010  -799     796             1
12  2011-03-31 2011-05-06 16:00:00     1    2011   310     390             1
13  2011-06-30 2011-08-09 08:00:00     2    2011   155     340             1
14  2011-09-30 2011-11-09 00:00:00     3    2011   -71     492             1
15  2011-12-31 2012-03-06 00:00:00     4    2011 -6119     563             1
16  2012-03-31 2012-05-05 16:00:00     1    2012    42     540             1
17  2012-06-30 2012-08-02 16:00:00     2    2012  -336     459             1
18  2012-09-30 2012-11-06 16:00:00     3    2012  -381     786             1
19  2012-12-31 2013-03-05 00:00:00     4    2012   456     923             1

#2

I'm not totally sure from the description of your question what you want, but I'm going to assume the following:

我不完全确定你的问题描述你想要什么,但我将假设以下内容:

The quarter-on-quarter difference, within a year only, per report number.

每个报告编号仅在一年内的季度差异。

That certainly looks like what you're looking for.

这肯定看起来像你在寻找什么。

Here's how to get started:

以下是如何入门:

Sort your DataFrame by FQTR. This will make sure the diff operation orders by quarter

按FQTR对DataFrame排序。这将确保差异操作按季度排序

group by year and report number.

按年份分组和报告编号。

pick the columns you want to diff, and apply pd.DataFrame.diff

选择要diff的列,并应用pd.DataFrame.diff

Here's an example. First up, some nice copy/pastable data:

这是一个例子。首先,一些很好的复制/可处理数据:

In [156]: df = {'FQTR': {0: 4, 1: 1, 2: 1, 3: 2, 4: 2, 5: 3, 6: 3, 7: 4, 8: 4, 9: 4, 10: 1, 11: 2, 12: 3, 13: 4, 14: 1, 15: 2, 16: 3, 17: 4, 18: 1, 19: 1, 20: 2, 21: 2, 22: 3, 23: 3, 24: 4, 25: 4, 26: 1, 27: 1, 28: 2, 29: 3, 30: 4}, 'FYEARQ': {0: 2007, 1: 2008, 2: 2008, 3: 2008, 4: 2008, 5: 2008, 6: 2008, 7: 2008, 8: 2008, 9: 2008, 10: 2009, 11: 2009, 12: 2009, 13: 2009, 14: 2010, 15: 2010, 16: 2010, 17: 2010, 18: 2011, 19: 2011, 20: 2011, 21: 2011, 22: 2011, 23: 2011, 24: 2011, 25: 2011, 26: 2012, 27: 2012, 28: 2012, 29: 2012, 30: 2012}, 'NIY': {0: 3131.0, 1: 1189.0, 2: 1149.0, 3: 2296.0, 4: 2214.0, 5: 3402.0, 6: 3277.0, 7: nan, 8: 4202.0, 9: 4031.0, 10: 942.0, 11: 1748.0, 12: 2458.0, 13: 3181.0, 14: 677.0, 15: 1392.0, 16: 1760.0, 17: 961.0, 18: 310.0, 19: 319.0, 20: 465.0, 21: 443.0, 22: 394.0, 23: 411.0, 24: -5725.0, 25: -5754.0, 26: 42.0, 27: 10.0, 28: -294.0, 29: -675.0, 30: -219.0}, 'OANCFY': {0: 3073.0, 1: 1482.0, 2: nan, 3: 2493.0, 4: nan, 5: 3763.0, 6: nan, 7: 4959.0, 8: nan, 9: nan, 10: 1441.0, 11: 3017.0, 12: 4423.0, 13: 5598.0, 14: 1172.0, 15: 2441.0, 16: 3150.0, 17: 3946.0, 18: 390.0, 19: nan, 20: 730.0, 21: nan, 22: 1222.0, 23: nan, 24: 1785.0, 25: nan, 26: 540.0, 27: nan, 28: 999.0, 29: 1785.0, 30: 2708.0}, 'Published': {0: '2008-02-21 00:00:00', 1: '2008-05-08 00:00:00', 2: '2009-05-07 16:00:00', 3: '2008-08-07 00:00:00', 4: '2009-08-18 00:00:00', 5: '2008-11-06 00:00:00', 6: '2009-11-07 00:00:00', 7: '2009-02-17 16:00:00', 8: '2009-02-18 00:00:00', 9: '2010-03-21 00:00:00', 10: '2009-05-07 16:00:00', 11: '2009-08-06 00:00:00', 12: '2009-11-07 00:00:00', 13: '2010-02-24 16:00:00', 14: '2010-05-07 16:00:00', 15: '2010-08-06 08:00:00', 16: '2010-11-08 16:00:00', 17: '2011-02-24 08:00:00', 18: '2011-05-06 16:00:00', 19: '2012-08-16 16:00:00', 20: '2011-08-09 08:00:00', 21: '2012-08-16 16:00:00', 22: '2011-11-09 00:00:00', 23: '2012-11-06 16:00:00', 24: '2012-03-06 00:00:00', 25: '2013-03-05 00:00:00', 26: '2012-05-05 16:00:00', 27: '2012-08-16 16:00:00', 28: '2012-08-02 16:00:00', 29: '2012-11-06 16:00:00', 30: '2013-03-05 00:00:00'}, 'Reference': {0: '2007-12-31', 1: '2008-03-31', 2: '2008-03-31', 3: '2008-06-30', 4: '2008-06-30', 5: '2008-09-30', 6: '2008-09-30', 7: '2008-12-31', 8: '2008-12-31', 9: '2008-12-31', 10: '2009-03-31', 11: '2009-06-30', 12: '2009-09-30', 13: '2009-12-31', 14: '2010-03-31', 15: '2010-06-30', 16: '2010-09-30', 17: '2010-12-31', 18: '2011-03-31', 19: '2011-03-31', 20: '2011-06-30', 21: '2011-06-30', 22: '2011-09-30', 23: '2011-09-30', 24: '2011-12-31', 25: '2011-12-31', 26: '2012-03-31', 27: '2012-03-31', 28: '2012-06-30', 29: '2012-09-30', 30: '2012-12-31'}, 'ReportNumber': {0: 1, 1: 1, 2: 2, 3: 1, 4: 3, 5: 1, 6: 3, 7: 1, 8: 3, 9: 5, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1, 19: 2, 20: 1, 21: 2, 22: 1, 23: 2, 24: 1, 25: 2, 26: 1, 27: 2, 28: 1, 29: 1, 30: 1}}
In [165]: df.head()
Out[165]: 
    Reference             Published  ReportNumber   NIY  OANCFY  FQTR  FYEARQ
0  2007-12-31   2008-02-21 00:00:00             1  3131    3073     4    2007
1  2008-03-31   2008-05-08 00:00:00             1  1189    1482     1    2008
2  2008-03-31   2009-05-07 16:00:00             2  1149     NaN     1    2008
3  2008-06-30   2008-08-07 00:00:00             1  2296    2493     2    2008
4  2008-06-30   2009-08-18 00:00:00             3  2214     NaN     2    2008

Now apply the logic described above:

现在应用上面描述的逻辑:

In [166]: diffs = df.sort('FQTR').groupby(['FYEARQ',
    'ReportNumber'])[['NIY', 'OANCFY']].apply(pd.DataFrame.diff)

Rename the resulting columns:

重命名结果列:

In [177]: diffs.columns = ['%s_diff' % col for col in diffs.columns]

and, if you want, join the differenced columns back to the original df:

并且,如果需要,将差异列连接回原始df:

In [180]: with_diffs = pd.concat([df, diffs], axis=1)

Leaving you with something pretty close to what you want:

离开你的东西非常接近你想要的东西:

In [182]: with_diffs[['Published', 'ReportNumber', 'NIY', 'FQTR', 'FYEARQ', 'NIY_diff']]
Out[182]: 
               Published  ReportNumber   NIY  FQTR  FYEARQ  NIY_diff
0    2008-02-21 00:00:00             1  3131     4    2007       NaN
1    2008-05-08 00:00:00             1  1189     1    2008       NaN
2    2009-05-07 16:00:00             2  1149     1    2008       NaN
3    2008-08-07 00:00:00             1  2296     2    2008      1107
4    2009-08-18 00:00:00             3  2214     2    2008       NaN
5    2008-11-06 00:00:00             1  3402     3    2008      1106
6    2009-11-07 00:00:00             3  3277     3    2008      1063
7    2009-02-17 16:00:00             1   NaN     4    2008       NaN
8    2009-02-18 00:00:00             3  4202     4    2008       925
9    2010-03-21 00:00:00             5  4031     4    2008       NaN
10   2009-05-07 16:00:00             1   942     1    2009       NaN
11   2009-08-06 00:00:00             1  1748     2    2009       806
12   2009-11-07 00:00:00             1  2458     3    2009       710
13   2010-02-24 16:00:00             1  3181     4    2009       723
14   2010-05-07 16:00:00             1   677     1    2010       NaN
15   2010-08-06 08:00:00             1  1392     2    2010       715
16   2010-11-08 16:00:00             1  1760     3    2010       368
17   2011-02-24 08:00:00             1   961     4    2010      -799
18   2011-05-06 16:00:00             1   310     1    2011       NaN
19   2012-08-16 16:00:00             2   319     1    2011       NaN
20   2011-08-09 08:00:00             1   465     2    2011       155
21   2012-08-16 16:00:00             2   443     2    2011       124
22   2011-11-09 00:00:00             1   394     3    2011       -71
23   2012-11-06 16:00:00             2   411     3    2011       -32
24   2012-03-06 00:00:00             1 -5725     4    2011     -6119
25   2013-03-05 00:00:00             2 -5754     4    2011     -6165
26   2012-05-05 16:00:00             1    42     1    2012       NaN
27   2012-08-16 16:00:00             2    10     1    2012       NaN
28   2012-08-02 16:00:00             1  -294     2    2012      -336
29   2012-11-06 16:00:00             1  -675     3    2012      -381
30   2013-03-05 00:00:00             1  -219     4    2012       456

Clearly there's a little bit of work still to do to tidy up (i.e. drop the NaN values if you want) but that's left as an exercise!

显然,还有一些工作要做,以便整理(即如果你想要的话,可以降低NaN值),但这仍然是一个练习!

#1