I have some data where some of the column values are the cumsum of the year (YTD sums). I want to change those columns to reflect the differences rather than the cumulative sums. The data looks as follows:
我有一些数据,其中一些列值是年度的cumsum(年初至今的总和)。我想更改这些列以反映差异而不是累积总和。数据如下:
ReportNumber NIY OANCFY FQTR FYEARQ
Reference Published
2007-12-31 2008-02-21 00:00:00 1 3131 3073 4 2007
2008-03-31 2008-05-08 00:00:00 1 1189 1482 1 2008
2009-05-07 16:00:00 2 1149 NaN 1 2008
2008-06-30 2008-08-07 00:00:00 1 2296 2493 2 2008
2009-08-18 00:00:00 3 2214 NaN 2 2008
2008-09-30 2008-11-06 00:00:00 1 3402 3763 3 2008
2009-11-07 00:00:00 3 3277 NaN 3 2008
2008-12-31 2009-02-17 16:00:00 1 NaN 4959 4 2008
2009-02-18 00:00:00 3 4202 NaN 4 2008
2010-03-21 00:00:00 5 4031 NaN 4 2008
2009-03-31 2009-05-07 16:00:00 1 942 1441 1 2009
2009-06-30 2009-08-06 00:00:00 1 1748 3017 2 2009
2009-09-30 2009-11-07 00:00:00 1 2458 4423 3 2009
2009-12-31 2010-02-24 16:00:00 1 3181 5598 4 2009
2010-03-31 2010-05-07 16:00:00 1 677 1172 1 2010
2010-06-30 2010-08-06 08:00:00 1 1392 2441 2 2010
2010-09-30 2010-11-08 16:00:00 1 1760 3150 3 2010
2010-12-31 2011-02-24 08:00:00 1 961 3946 4 2010
2011-03-31 2011-05-06 16:00:00 1 310 390 1 2011
2012-08-16 16:00:00 2 319 NaN 1 2011
2011-06-30 2011-08-09 08:00:00 1 465 730 2 2011
2012-08-16 16:00:00 2 443 NaN 2 2011
2011-09-30 2011-11-09 00:00:00 1 394 1222 3 2011
2012-11-06 16:00:00 2 411 NaN 3 2011
2011-12-31 2012-03-06 00:00:00 1 -5725 1785 4 2011
2013-03-05 00:00:00 2 -5754 NaN 4 2011
2012-03-31 2012-05-05 16:00:00 1 42 540 1 2012
2012-08-16 16:00:00 2 10 NaN 1 2012
2012-06-30 2012-08-02 16:00:00 1 -294 999 2 2012
2012-09-30 2012-11-06 16:00:00 1 -675 1785 3 2012
2012-12-31 2013-03-05 00:00:00 1 -219 2708 4 2012
So I need to take the differences between FQTR within in a given FYEARQ according to the 'FQTR' then the 'Published' index and have it as one frame. In my attempt which works as long as there isn't multiple values for an item ('NIY', 'OANCFY') for the given reference data.
所以我需要在给定的FYEARQ中根据'FQTR'然后'发布'索引取得FQTR之间的差异并将其作为一个帧。在我的尝试中,只要给定参考数据的项目('NIY','OANCFY')没有多个值,它就会起作用。
cfgtmp = cftmp.groupby('FYEARQ')
ft = dict()
for group_name, subdf in cftmp.dropna().drop_duplicates().groupby('FYEARQ'):
tmp = pd.concat([subdf.head(1), subdf.diff()]).dropna()
tmp['FQTR'] = subdf['FQTR']
tmp['FYEARQ'] = subdf['FYEARQ']
tmp['ReportNumber'] = subdf['ReportNumber']
ft.update({group_name : tmp})
print group_name
print 'differences'
print tmp
print ' '
the pd.concat attempts to handle the differences between the quarters ('FQTR'). It returns:
pd.concat尝试处理季度之间的差异('FQTR')。它返回:
2007
differences
ReportNumber NIY OANCFY FQTR FYEARQ
Reference Published
2007-12-31 2008-02-21 1 3131 3073 4 2007
2008
differences
ReportNumber NIY OANCFY FQTR FYEARQ
Reference Published
2008-03-31 2008-05-08 1 1189 1482 1 2008
2008-06-30 2008-08-07 1 1107 1011 2 2008
2008-09-30 2008-11-06 1 1106 1270 3 2008
2009
differences
ReportNumber NIY OANCFY FQTR FYEARQ
Reference Published
2009-03-31 2009-05-07 16:00:00 1 942 1441 1 2009
2009-06-30 2009-08-06 00:00:00 1 806 1576 2 2009
2009-09-30 2009-11-07 00:00:00 1 710 1406 3 2009
2009-12-31 2010-02-24 16:00:00 1 723 1175 4 2009
2010
differences
ReportNumber NIY OANCFY FQTR FYEARQ
Reference Published
2010-03-31 2010-05-07 16:00:00 1 677 1172 1 2010
2010-06-30 2010-08-06 08:00:00 1 715 1269 2 2010
2010-09-30 2010-11-08 16:00:00 1 368 709 3 2010
2010-12-31 2011-02-24 08:00:00 1 -799 796 4 2010
2011
differences
ReportNumber NIY OANCFY FQTR FYEARQ
Reference Published
2011-03-31 2011-05-06 16:00:00 1 310 390 1 2011
2011-06-30 2011-08-09 08:00:00 1 155 340 2 2011
2011-09-30 2011-11-09 00:00:00 1 -71 492 3 2011
2011-12-31 2012-03-06 00:00:00 1 -6119 563 4 2011
2012
differences
ReportNumber NIY OANCFY FQTR FYEARQ
Reference Published
2012-03-31 2012-05-05 16:00:00 1 42 540 1 2012
2012-06-30 2012-08-02 16:00:00 1 -336 459 2 2012
2012-09-30 2012-11-06 16:00:00 1 -381 786 3 2012
2012-12-31 2013-03-05 00:00:00 1 456 923 4 2012
The problem with this solution is that it is only effective for the 'ReportNumber' == 1
此解决方案的问题在于它仅对'ReportNumber'== 1有效
I then use pd.concat to flatten it back into one frame:
然后我使用pd.concat将其展平为一帧:
pd.concat([ft[f] for f in ft])
Any suggestions?
2 个解决方案
#1
1
Your cftmp.dropna().
parts discarded the data associated with ReportNumber other than 1. In you example dataframe, those data happen to have nan
for OANCFY.
你的cftmp.dropna()。部件丢弃与ReportNumber相关的数据而不是1.在您的示例数据框中,这些数据恰好具有针对OANCFY的nan。
However, to avoid using loop, you can do something like this: get the 1st observation using head
and the delta's using diff()
, then concat
them together.
但是,为了避免使用循环,您可以执行以下操作:使用head获取第一个观察值,使用diff()获取delta,然后将它们连接在一起。
In [71]:
newdf = pd.concat((df.groupby(['FYEARQ',
'ReportNumber']).head(1),
df.groupby(['FYEARQ',
'ReportNumber']).diff().dropna())).reset_index()\
.sort('Reference')\
.dropna(subset=['OANCFY'])\
.reset_index(drop=True)\
.fillna(method='pad')
newdf['FQTR'] = newdf.FQTR.groupby(newdf.FYEARQ).cumsum()
print newdf
Reference Published FQTR FYEARQ NIY OANCFY ReportNumber
0 2007-12-31 2008-02-21 00:00:00 4 2007 3131 3073 1
1 2008-03-31 2008-05-08 00:00:00 1 2008 1189 1482 1
2 2008-06-30 2008-08-07 00:00:00 2 2008 1107 1011 1
3 2008-09-30 2008-11-06 00:00:00 3 2008 1106 1270 1
4 2009-03-31 2009-05-07 16:00:00 1 2009 942 1441 1
5 2009-06-30 2009-08-06 00:00:00 2 2009 806 1576 1
6 2009-09-30 2009-11-07 00:00:00 3 2009 710 1406 1
7 2009-12-31 2010-02-24 16:00:00 4 2009 723 1175 1
8 2010-03-31 2010-05-07 16:00:00 1 2010 677 1172 1
9 2010-06-30 2010-08-06 08:00:00 2 2010 715 1269 1
10 2010-09-30 2010-11-08 16:00:00 3 2010 368 709 1
11 2010-12-31 2011-02-24 08:00:00 4 2010 -799 796 1
12 2011-03-31 2011-05-06 16:00:00 1 2011 310 390 1
13 2011-06-30 2011-08-09 08:00:00 2 2011 155 340 1
14 2011-09-30 2011-11-09 00:00:00 3 2011 -71 492 1
15 2011-12-31 2012-03-06 00:00:00 4 2011 -6119 563 1
16 2012-03-31 2012-05-05 16:00:00 1 2012 42 540 1
17 2012-06-30 2012-08-02 16:00:00 2 2012 -336 459 1
18 2012-09-30 2012-11-06 16:00:00 3 2012 -381 786 1
19 2012-12-31 2013-03-05 00:00:00 4 2012 456 923 1
#2
1
I'm not totally sure from the description of your question what you want, but I'm going to assume the following:
我不完全确定你的问题描述你想要什么,但我将假设以下内容:
- The quarter-on-quarter difference, within a year only, per report number.
每个报告编号仅在一年内的季度差异。
That certainly looks like what you're looking for.
这肯定看起来像你在寻找什么。
Here's how to get started:
以下是如何入门:
- Sort your
DataFrame
byFQTR
. This will make sure thediff
operation orders by quarter - group by year and report number.
- pick the columns you want to diff, and apply
pd.DataFrame.diff
按FQTR对DataFrame排序。这将确保差异操作按季度排序
按年份分组和报告编号。
选择要diff的列,并应用pd.DataFrame.diff
Here's an example. First up, some nice copy/pastable data:
这是一个例子。首先,一些很好的复制/可处理数据:
In [156]: df = {'FQTR': {0: 4, 1: 1, 2: 1, 3: 2, 4: 2, 5: 3, 6: 3, 7: 4, 8: 4, 9: 4, 10: 1, 11: 2, 12: 3, 13: 4, 14: 1, 15: 2, 16: 3, 17: 4, 18: 1, 19: 1, 20: 2, 21: 2, 22: 3, 23: 3, 24: 4, 25: 4, 26: 1, 27: 1, 28: 2, 29: 3, 30: 4}, 'FYEARQ': {0: 2007, 1: 2008, 2: 2008, 3: 2008, 4: 2008, 5: 2008, 6: 2008, 7: 2008, 8: 2008, 9: 2008, 10: 2009, 11: 2009, 12: 2009, 13: 2009, 14: 2010, 15: 2010, 16: 2010, 17: 2010, 18: 2011, 19: 2011, 20: 2011, 21: 2011, 22: 2011, 23: 2011, 24: 2011, 25: 2011, 26: 2012, 27: 2012, 28: 2012, 29: 2012, 30: 2012}, 'NIY': {0: 3131.0, 1: 1189.0, 2: 1149.0, 3: 2296.0, 4: 2214.0, 5: 3402.0, 6: 3277.0, 7: nan, 8: 4202.0, 9: 4031.0, 10: 942.0, 11: 1748.0, 12: 2458.0, 13: 3181.0, 14: 677.0, 15: 1392.0, 16: 1760.0, 17: 961.0, 18: 310.0, 19: 319.0, 20: 465.0, 21: 443.0, 22: 394.0, 23: 411.0, 24: -5725.0, 25: -5754.0, 26: 42.0, 27: 10.0, 28: -294.0, 29: -675.0, 30: -219.0}, 'OANCFY': {0: 3073.0, 1: 1482.0, 2: nan, 3: 2493.0, 4: nan, 5: 3763.0, 6: nan, 7: 4959.0, 8: nan, 9: nan, 10: 1441.0, 11: 3017.0, 12: 4423.0, 13: 5598.0, 14: 1172.0, 15: 2441.0, 16: 3150.0, 17: 3946.0, 18: 390.0, 19: nan, 20: 730.0, 21: nan, 22: 1222.0, 23: nan, 24: 1785.0, 25: nan, 26: 540.0, 27: nan, 28: 999.0, 29: 1785.0, 30: 2708.0}, 'Published': {0: '2008-02-21 00:00:00', 1: '2008-05-08 00:00:00', 2: '2009-05-07 16:00:00', 3: '2008-08-07 00:00:00', 4: '2009-08-18 00:00:00', 5: '2008-11-06 00:00:00', 6: '2009-11-07 00:00:00', 7: '2009-02-17 16:00:00', 8: '2009-02-18 00:00:00', 9: '2010-03-21 00:00:00', 10: '2009-05-07 16:00:00', 11: '2009-08-06 00:00:00', 12: '2009-11-07 00:00:00', 13: '2010-02-24 16:00:00', 14: '2010-05-07 16:00:00', 15: '2010-08-06 08:00:00', 16: '2010-11-08 16:00:00', 17: '2011-02-24 08:00:00', 18: '2011-05-06 16:00:00', 19: '2012-08-16 16:00:00', 20: '2011-08-09 08:00:00', 21: '2012-08-16 16:00:00', 22: '2011-11-09 00:00:00', 23: '2012-11-06 16:00:00', 24: '2012-03-06 00:00:00', 25: '2013-03-05 00:00:00', 26: '2012-05-05 16:00:00', 27: '2012-08-16 16:00:00', 28: '2012-08-02 16:00:00', 29: '2012-11-06 16:00:00', 30: '2013-03-05 00:00:00'}, 'Reference': {0: '2007-12-31', 1: '2008-03-31', 2: '2008-03-31', 3: '2008-06-30', 4: '2008-06-30', 5: '2008-09-30', 6: '2008-09-30', 7: '2008-12-31', 8: '2008-12-31', 9: '2008-12-31', 10: '2009-03-31', 11: '2009-06-30', 12: '2009-09-30', 13: '2009-12-31', 14: '2010-03-31', 15: '2010-06-30', 16: '2010-09-30', 17: '2010-12-31', 18: '2011-03-31', 19: '2011-03-31', 20: '2011-06-30', 21: '2011-06-30', 22: '2011-09-30', 23: '2011-09-30', 24: '2011-12-31', 25: '2011-12-31', 26: '2012-03-31', 27: '2012-03-31', 28: '2012-06-30', 29: '2012-09-30', 30: '2012-12-31'}, 'ReportNumber': {0: 1, 1: 1, 2: 2, 3: 1, 4: 3, 5: 1, 6: 3, 7: 1, 8: 3, 9: 5, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1, 19: 2, 20: 1, 21: 2, 22: 1, 23: 2, 24: 1, 25: 2, 26: 1, 27: 2, 28: 1, 29: 1, 30: 1}}
In [165]: df.head()
Out[165]:
Reference Published ReportNumber NIY OANCFY FQTR FYEARQ
0 2007-12-31 2008-02-21 00:00:00 1 3131 3073 4 2007
1 2008-03-31 2008-05-08 00:00:00 1 1189 1482 1 2008
2 2008-03-31 2009-05-07 16:00:00 2 1149 NaN 1 2008
3 2008-06-30 2008-08-07 00:00:00 1 2296 2493 2 2008
4 2008-06-30 2009-08-18 00:00:00 3 2214 NaN 2 2008
Now apply the logic described above:
现在应用上面描述的逻辑:
In [166]: diffs = df.sort('FQTR').groupby(['FYEARQ',
'ReportNumber'])[['NIY', 'OANCFY']].apply(pd.DataFrame.diff)
Rename the resulting columns:
重命名结果列:
In [177]: diffs.columns = ['%s_diff' % col for col in diffs.columns]
and, if you want, join the differenced columns back to the original df
:
并且,如果需要,将差异列连接回原始df:
In [180]: with_diffs = pd.concat([df, diffs], axis=1)
Leaving you with something pretty close to what you want:
离开你的东西非常接近你想要的东西:
In [182]: with_diffs[['Published', 'ReportNumber', 'NIY', 'FQTR', 'FYEARQ', 'NIY_diff']]
Out[182]:
Published ReportNumber NIY FQTR FYEARQ NIY_diff
0 2008-02-21 00:00:00 1 3131 4 2007 NaN
1 2008-05-08 00:00:00 1 1189 1 2008 NaN
2 2009-05-07 16:00:00 2 1149 1 2008 NaN
3 2008-08-07 00:00:00 1 2296 2 2008 1107
4 2009-08-18 00:00:00 3 2214 2 2008 NaN
5 2008-11-06 00:00:00 1 3402 3 2008 1106
6 2009-11-07 00:00:00 3 3277 3 2008 1063
7 2009-02-17 16:00:00 1 NaN 4 2008 NaN
8 2009-02-18 00:00:00 3 4202 4 2008 925
9 2010-03-21 00:00:00 5 4031 4 2008 NaN
10 2009-05-07 16:00:00 1 942 1 2009 NaN
11 2009-08-06 00:00:00 1 1748 2 2009 806
12 2009-11-07 00:00:00 1 2458 3 2009 710
13 2010-02-24 16:00:00 1 3181 4 2009 723
14 2010-05-07 16:00:00 1 677 1 2010 NaN
15 2010-08-06 08:00:00 1 1392 2 2010 715
16 2010-11-08 16:00:00 1 1760 3 2010 368
17 2011-02-24 08:00:00 1 961 4 2010 -799
18 2011-05-06 16:00:00 1 310 1 2011 NaN
19 2012-08-16 16:00:00 2 319 1 2011 NaN
20 2011-08-09 08:00:00 1 465 2 2011 155
21 2012-08-16 16:00:00 2 443 2 2011 124
22 2011-11-09 00:00:00 1 394 3 2011 -71
23 2012-11-06 16:00:00 2 411 3 2011 -32
24 2012-03-06 00:00:00 1 -5725 4 2011 -6119
25 2013-03-05 00:00:00 2 -5754 4 2011 -6165
26 2012-05-05 16:00:00 1 42 1 2012 NaN
27 2012-08-16 16:00:00 2 10 1 2012 NaN
28 2012-08-02 16:00:00 1 -294 2 2012 -336
29 2012-11-06 16:00:00 1 -675 3 2012 -381
30 2013-03-05 00:00:00 1 -219 4 2012 456
Clearly there's a little bit of work still to do to tidy up (i.e. drop the NaN
values if you want) but that's left as an exercise!
显然,还有一些工作要做,以便整理(即如果你想要的话,可以降低NaN值),但这仍然是一个练习!
#1
1
Your cftmp.dropna().
parts discarded the data associated with ReportNumber other than 1. In you example dataframe, those data happen to have nan
for OANCFY.
你的cftmp.dropna()。部件丢弃与ReportNumber相关的数据而不是1.在您的示例数据框中,这些数据恰好具有针对OANCFY的nan。
However, to avoid using loop, you can do something like this: get the 1st observation using head
and the delta's using diff()
, then concat
them together.
但是,为了避免使用循环,您可以执行以下操作:使用head获取第一个观察值,使用diff()获取delta,然后将它们连接在一起。
In [71]:
newdf = pd.concat((df.groupby(['FYEARQ',
'ReportNumber']).head(1),
df.groupby(['FYEARQ',
'ReportNumber']).diff().dropna())).reset_index()\
.sort('Reference')\
.dropna(subset=['OANCFY'])\
.reset_index(drop=True)\
.fillna(method='pad')
newdf['FQTR'] = newdf.FQTR.groupby(newdf.FYEARQ).cumsum()
print newdf
Reference Published FQTR FYEARQ NIY OANCFY ReportNumber
0 2007-12-31 2008-02-21 00:00:00 4 2007 3131 3073 1
1 2008-03-31 2008-05-08 00:00:00 1 2008 1189 1482 1
2 2008-06-30 2008-08-07 00:00:00 2 2008 1107 1011 1
3 2008-09-30 2008-11-06 00:00:00 3 2008 1106 1270 1
4 2009-03-31 2009-05-07 16:00:00 1 2009 942 1441 1
5 2009-06-30 2009-08-06 00:00:00 2 2009 806 1576 1
6 2009-09-30 2009-11-07 00:00:00 3 2009 710 1406 1
7 2009-12-31 2010-02-24 16:00:00 4 2009 723 1175 1
8 2010-03-31 2010-05-07 16:00:00 1 2010 677 1172 1
9 2010-06-30 2010-08-06 08:00:00 2 2010 715 1269 1
10 2010-09-30 2010-11-08 16:00:00 3 2010 368 709 1
11 2010-12-31 2011-02-24 08:00:00 4 2010 -799 796 1
12 2011-03-31 2011-05-06 16:00:00 1 2011 310 390 1
13 2011-06-30 2011-08-09 08:00:00 2 2011 155 340 1
14 2011-09-30 2011-11-09 00:00:00 3 2011 -71 492 1
15 2011-12-31 2012-03-06 00:00:00 4 2011 -6119 563 1
16 2012-03-31 2012-05-05 16:00:00 1 2012 42 540 1
17 2012-06-30 2012-08-02 16:00:00 2 2012 -336 459 1
18 2012-09-30 2012-11-06 16:00:00 3 2012 -381 786 1
19 2012-12-31 2013-03-05 00:00:00 4 2012 456 923 1
#2
1
I'm not totally sure from the description of your question what you want, but I'm going to assume the following:
我不完全确定你的问题描述你想要什么,但我将假设以下内容:
- The quarter-on-quarter difference, within a year only, per report number.
每个报告编号仅在一年内的季度差异。
That certainly looks like what you're looking for.
这肯定看起来像你在寻找什么。
Here's how to get started:
以下是如何入门:
- Sort your
DataFrame
byFQTR
. This will make sure thediff
operation orders by quarter - group by year and report number.
- pick the columns you want to diff, and apply
pd.DataFrame.diff
按FQTR对DataFrame排序。这将确保差异操作按季度排序
按年份分组和报告编号。
选择要diff的列,并应用pd.DataFrame.diff
Here's an example. First up, some nice copy/pastable data:
这是一个例子。首先,一些很好的复制/可处理数据:
In [156]: df = {'FQTR': {0: 4, 1: 1, 2: 1, 3: 2, 4: 2, 5: 3, 6: 3, 7: 4, 8: 4, 9: 4, 10: 1, 11: 2, 12: 3, 13: 4, 14: 1, 15: 2, 16: 3, 17: 4, 18: 1, 19: 1, 20: 2, 21: 2, 22: 3, 23: 3, 24: 4, 25: 4, 26: 1, 27: 1, 28: 2, 29: 3, 30: 4}, 'FYEARQ': {0: 2007, 1: 2008, 2: 2008, 3: 2008, 4: 2008, 5: 2008, 6: 2008, 7: 2008, 8: 2008, 9: 2008, 10: 2009, 11: 2009, 12: 2009, 13: 2009, 14: 2010, 15: 2010, 16: 2010, 17: 2010, 18: 2011, 19: 2011, 20: 2011, 21: 2011, 22: 2011, 23: 2011, 24: 2011, 25: 2011, 26: 2012, 27: 2012, 28: 2012, 29: 2012, 30: 2012}, 'NIY': {0: 3131.0, 1: 1189.0, 2: 1149.0, 3: 2296.0, 4: 2214.0, 5: 3402.0, 6: 3277.0, 7: nan, 8: 4202.0, 9: 4031.0, 10: 942.0, 11: 1748.0, 12: 2458.0, 13: 3181.0, 14: 677.0, 15: 1392.0, 16: 1760.0, 17: 961.0, 18: 310.0, 19: 319.0, 20: 465.0, 21: 443.0, 22: 394.0, 23: 411.0, 24: -5725.0, 25: -5754.0, 26: 42.0, 27: 10.0, 28: -294.0, 29: -675.0, 30: -219.0}, 'OANCFY': {0: 3073.0, 1: 1482.0, 2: nan, 3: 2493.0, 4: nan, 5: 3763.0, 6: nan, 7: 4959.0, 8: nan, 9: nan, 10: 1441.0, 11: 3017.0, 12: 4423.0, 13: 5598.0, 14: 1172.0, 15: 2441.0, 16: 3150.0, 17: 3946.0, 18: 390.0, 19: nan, 20: 730.0, 21: nan, 22: 1222.0, 23: nan, 24: 1785.0, 25: nan, 26: 540.0, 27: nan, 28: 999.0, 29: 1785.0, 30: 2708.0}, 'Published': {0: '2008-02-21 00:00:00', 1: '2008-05-08 00:00:00', 2: '2009-05-07 16:00:00', 3: '2008-08-07 00:00:00', 4: '2009-08-18 00:00:00', 5: '2008-11-06 00:00:00', 6: '2009-11-07 00:00:00', 7: '2009-02-17 16:00:00', 8: '2009-02-18 00:00:00', 9: '2010-03-21 00:00:00', 10: '2009-05-07 16:00:00', 11: '2009-08-06 00:00:00', 12: '2009-11-07 00:00:00', 13: '2010-02-24 16:00:00', 14: '2010-05-07 16:00:00', 15: '2010-08-06 08:00:00', 16: '2010-11-08 16:00:00', 17: '2011-02-24 08:00:00', 18: '2011-05-06 16:00:00', 19: '2012-08-16 16:00:00', 20: '2011-08-09 08:00:00', 21: '2012-08-16 16:00:00', 22: '2011-11-09 00:00:00', 23: '2012-11-06 16:00:00', 24: '2012-03-06 00:00:00', 25: '2013-03-05 00:00:00', 26: '2012-05-05 16:00:00', 27: '2012-08-16 16:00:00', 28: '2012-08-02 16:00:00', 29: '2012-11-06 16:00:00', 30: '2013-03-05 00:00:00'}, 'Reference': {0: '2007-12-31', 1: '2008-03-31', 2: '2008-03-31', 3: '2008-06-30', 4: '2008-06-30', 5: '2008-09-30', 6: '2008-09-30', 7: '2008-12-31', 8: '2008-12-31', 9: '2008-12-31', 10: '2009-03-31', 11: '2009-06-30', 12: '2009-09-30', 13: '2009-12-31', 14: '2010-03-31', 15: '2010-06-30', 16: '2010-09-30', 17: '2010-12-31', 18: '2011-03-31', 19: '2011-03-31', 20: '2011-06-30', 21: '2011-06-30', 22: '2011-09-30', 23: '2011-09-30', 24: '2011-12-31', 25: '2011-12-31', 26: '2012-03-31', 27: '2012-03-31', 28: '2012-06-30', 29: '2012-09-30', 30: '2012-12-31'}, 'ReportNumber': {0: 1, 1: 1, 2: 2, 3: 1, 4: 3, 5: 1, 6: 3, 7: 1, 8: 3, 9: 5, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1, 19: 2, 20: 1, 21: 2, 22: 1, 23: 2, 24: 1, 25: 2, 26: 1, 27: 2, 28: 1, 29: 1, 30: 1}}
In [165]: df.head()
Out[165]:
Reference Published ReportNumber NIY OANCFY FQTR FYEARQ
0 2007-12-31 2008-02-21 00:00:00 1 3131 3073 4 2007
1 2008-03-31 2008-05-08 00:00:00 1 1189 1482 1 2008
2 2008-03-31 2009-05-07 16:00:00 2 1149 NaN 1 2008
3 2008-06-30 2008-08-07 00:00:00 1 2296 2493 2 2008
4 2008-06-30 2009-08-18 00:00:00 3 2214 NaN 2 2008
Now apply the logic described above:
现在应用上面描述的逻辑:
In [166]: diffs = df.sort('FQTR').groupby(['FYEARQ',
'ReportNumber'])[['NIY', 'OANCFY']].apply(pd.DataFrame.diff)
Rename the resulting columns:
重命名结果列:
In [177]: diffs.columns = ['%s_diff' % col for col in diffs.columns]
and, if you want, join the differenced columns back to the original df
:
并且,如果需要,将差异列连接回原始df:
In [180]: with_diffs = pd.concat([df, diffs], axis=1)
Leaving you with something pretty close to what you want:
离开你的东西非常接近你想要的东西:
In [182]: with_diffs[['Published', 'ReportNumber', 'NIY', 'FQTR', 'FYEARQ', 'NIY_diff']]
Out[182]:
Published ReportNumber NIY FQTR FYEARQ NIY_diff
0 2008-02-21 00:00:00 1 3131 4 2007 NaN
1 2008-05-08 00:00:00 1 1189 1 2008 NaN
2 2009-05-07 16:00:00 2 1149 1 2008 NaN
3 2008-08-07 00:00:00 1 2296 2 2008 1107
4 2009-08-18 00:00:00 3 2214 2 2008 NaN
5 2008-11-06 00:00:00 1 3402 3 2008 1106
6 2009-11-07 00:00:00 3 3277 3 2008 1063
7 2009-02-17 16:00:00 1 NaN 4 2008 NaN
8 2009-02-18 00:00:00 3 4202 4 2008 925
9 2010-03-21 00:00:00 5 4031 4 2008 NaN
10 2009-05-07 16:00:00 1 942 1 2009 NaN
11 2009-08-06 00:00:00 1 1748 2 2009 806
12 2009-11-07 00:00:00 1 2458 3 2009 710
13 2010-02-24 16:00:00 1 3181 4 2009 723
14 2010-05-07 16:00:00 1 677 1 2010 NaN
15 2010-08-06 08:00:00 1 1392 2 2010 715
16 2010-11-08 16:00:00 1 1760 3 2010 368
17 2011-02-24 08:00:00 1 961 4 2010 -799
18 2011-05-06 16:00:00 1 310 1 2011 NaN
19 2012-08-16 16:00:00 2 319 1 2011 NaN
20 2011-08-09 08:00:00 1 465 2 2011 155
21 2012-08-16 16:00:00 2 443 2 2011 124
22 2011-11-09 00:00:00 1 394 3 2011 -71
23 2012-11-06 16:00:00 2 411 3 2011 -32
24 2012-03-06 00:00:00 1 -5725 4 2011 -6119
25 2013-03-05 00:00:00 2 -5754 4 2011 -6165
26 2012-05-05 16:00:00 1 42 1 2012 NaN
27 2012-08-16 16:00:00 2 10 1 2012 NaN
28 2012-08-02 16:00:00 1 -294 2 2012 -336
29 2012-11-06 16:00:00 1 -675 3 2012 -381
30 2013-03-05 00:00:00 1 -219 4 2012 456
Clearly there's a little bit of work still to do to tidy up (i.e. drop the NaN
values if you want) but that's left as an exercise!
显然,还有一些工作要做,以便整理(即如果你想要的话,可以降低NaN值),但这仍然是一个练习!