It is actually a de-aggregation because I have a dataset structured this way:
它实际上是一个去聚合,因为我有一个以这种方式构建的数据集:
id type first_year last_year
A t1 2009 2014
A t1 2010 2015
B t1 2007 2009
B t2 2008 2011
But I need to aggregate by id/year and have overlapping start/end entries.
但是我需要按id / year聚合并且具有重叠的开始/结束条目。
Data is in a pandas data frame like so:
数据在大熊猫数据框中如下:
test_frame = pd.DataFrame([['A','t1',2009,2014],
['A','t1',2010,2015],
['B','t1',2007,2009],
['B','t2',2008,2011]],
columns = ['id','type','first_year','last_year'])
I'm hoping to get data returned in a few different ways:
我希望以几种不同的方式返回数据:
id year count
A 2009 1
A 2010 2
A 2011 2
...
B 2007 1
B 2008 2
B 2009 1
And perhaps like this:
也许是这样的:
id year type count
A 2009 t1 1
A 2010 t1 2
A 2011 t1 2
...
B 2007 t1 1
B 2008 t1 1
B 2008 t2 1
B 2009 t2 1
B 2010 t2 1
This basically works for the 1st method, but as you can imagine it is pretty slow using itertuples for a large dataset. Any more pandas-ic way?
这基本上适用于第一种方法,但正如您可以想象的那样,对于大型数据集使用itertuples非常慢。还有更多的熊猫方式吗?
out_frame = pd.DataFrame(columns = ['id','type','year'])
for rows in test_frame.itertuples():
for year in range(int(rows[3]),int(rows[4])):
d2 = pd.DataFrame({'id': [rows[1]],'year': [year]},columns = ['id','year'])
out_frame = out_frame.append(d2)
output1 = out_frame.groupby(['id','year'])['year'].count()
output1
2 个解决方案
#1
2
You can use stack
and resample
:
您可以使用堆栈和重新采样:
import pandas as pd
test_frame = pd.DataFrame([['A','t1',2009,2014],
['A','t1',2010,2015],
['B','t1',2007,2009],
['B','t2',2008,2011]],
columns = ['id','type','first_year','last_year'])
print test_frame
id type first_year last_year
0 A t1 2009 2014
1 A t1 2010 2015
2 B t1 2007 2009
3 B t2 2008 2011
#stack df, drop and rename column year
test_frame = test_frame.set_index(['id','type'], append=True).stack().reset_index(level=[1,2,3])
test_frame = test_frame.drop('level_3', axis=1).rename(columns={0:'year'})
#convert year to datetime
test_frame['year'] = pd.to_datetime(test_frame['year'], format="%Y")
print test_frame
id type year
0 A t1 2009-01-01
0 A t1 2014-01-01
1 A t1 2010-01-01
1 A t1 2015-01-01
2 B t1 2007-01-01
2 B t1 2009-01-01
3 B t2 2008-01-01
3 B t2 2011-01-01
#resample and fill missing data
out_frame = test_frame.groupby(test_frame.index).apply(lambda x: x.set_index('year').resample('1AS', how='first',fill_method='ffill')).reset_index(level=1)
print out_frame
year id type
0 2009-01-01 A t1
0 2010-01-01 A t1
0 2011-01-01 A t1
0 2012-01-01 A t1
0 2013-01-01 A t1
0 2014-01-01 A t1
1 2010-01-01 A t1
1 2011-01-01 A t1
1 2012-01-01 A t1
1 2013-01-01 A t1
1 2014-01-01 A t1
1 2015-01-01 A t1
2 2007-01-01 B t1
2 2008-01-01 B t1
2 2009-01-01 B t1
3 2008-01-01 B t2
3 2009-01-01 B t2
3 2010-01-01 B t2
3 2011-01-01 B t2
#convert to year
out_frame['year'] = out_frame['year'].dt.year
output1 = out_frame.groupby(['id','year', 'type'])['year'].count().reset_index(name='count')
print output1
id year type count
0 A 2009 t1 1
1 A 2010 t1 2
2 A 2011 t1 2
3 A 2012 t1 2
4 A 2013 t1 2
5 A 2014 t1 2
6 A 2015 t1 1
7 B 2007 t1 1
8 B 2008 t1 1
9 B 2008 t2 1
10 B 2009 t1 1
11 B 2009 t2 1
12 B 2010 t2 1
13 B 2011 t2 1
output2 = out_frame.groupby(['id','year'])['year'].count().reset_index(name='count')
print output2
id year count
0 A 2009 1
1 A 2010 2
2 A 2011 2
3 A 2012 2
4 A 2013 2
5 A 2014 2
6 A 2015 1
7 B 2007 1
8 B 2008 2
9 B 2009 2
10 B 2010 1
11 B 2011 1
#2
0
My answer is based on expanding the year columns using a PeriodIndex. Then taking this period range and stacking (rotating) it so that it creates multi-level index.
我的答案是基于使用PeriodIndex扩展年份列。然后取这个周期范围并堆叠(旋转)它,以便创建多级索引。
def expand_period(row):
"Creates a series from first to last year and appends it to the row"
p = pd.period_range(row["first_year"], row["last_year"], freq="A")
return row.append(p.to_series()).drop(["first_year","last_year"])
#original data frame
tf = pd.DataFrame([['A','t1',2009,2014],
['A','t1',2010,2015],
['B','t1',2007,2009],
['B','t2',2008,2011]],
columns = ['id','type','first_year','last_year'])
#Drop the year columns but replace them with expanded series
tfexpanded = tf.apply(expand_period, 1).set_index(["id","type"])
#Rotate the axis so that you have a 3 level index
tfindexed = tfexpanded.stack()
#This is not necessary but improves readability when watching the output
tfindexed[:] = 1
#Group-By as you did before
answer = tfindexed.groupby(level=[0,1,2]).count()
Of course, the whole thing can be simplified using lambdas and chained methods.
当然,使用lambdas和链式方法可以简化整个过程。
#1
2
You can use stack
and resample
:
您可以使用堆栈和重新采样:
import pandas as pd
test_frame = pd.DataFrame([['A','t1',2009,2014],
['A','t1',2010,2015],
['B','t1',2007,2009],
['B','t2',2008,2011]],
columns = ['id','type','first_year','last_year'])
print test_frame
id type first_year last_year
0 A t1 2009 2014
1 A t1 2010 2015
2 B t1 2007 2009
3 B t2 2008 2011
#stack df, drop and rename column year
test_frame = test_frame.set_index(['id','type'], append=True).stack().reset_index(level=[1,2,3])
test_frame = test_frame.drop('level_3', axis=1).rename(columns={0:'year'})
#convert year to datetime
test_frame['year'] = pd.to_datetime(test_frame['year'], format="%Y")
print test_frame
id type year
0 A t1 2009-01-01
0 A t1 2014-01-01
1 A t1 2010-01-01
1 A t1 2015-01-01
2 B t1 2007-01-01
2 B t1 2009-01-01
3 B t2 2008-01-01
3 B t2 2011-01-01
#resample and fill missing data
out_frame = test_frame.groupby(test_frame.index).apply(lambda x: x.set_index('year').resample('1AS', how='first',fill_method='ffill')).reset_index(level=1)
print out_frame
year id type
0 2009-01-01 A t1
0 2010-01-01 A t1
0 2011-01-01 A t1
0 2012-01-01 A t1
0 2013-01-01 A t1
0 2014-01-01 A t1
1 2010-01-01 A t1
1 2011-01-01 A t1
1 2012-01-01 A t1
1 2013-01-01 A t1
1 2014-01-01 A t1
1 2015-01-01 A t1
2 2007-01-01 B t1
2 2008-01-01 B t1
2 2009-01-01 B t1
3 2008-01-01 B t2
3 2009-01-01 B t2
3 2010-01-01 B t2
3 2011-01-01 B t2
#convert to year
out_frame['year'] = out_frame['year'].dt.year
output1 = out_frame.groupby(['id','year', 'type'])['year'].count().reset_index(name='count')
print output1
id year type count
0 A 2009 t1 1
1 A 2010 t1 2
2 A 2011 t1 2
3 A 2012 t1 2
4 A 2013 t1 2
5 A 2014 t1 2
6 A 2015 t1 1
7 B 2007 t1 1
8 B 2008 t1 1
9 B 2008 t2 1
10 B 2009 t1 1
11 B 2009 t2 1
12 B 2010 t2 1
13 B 2011 t2 1
output2 = out_frame.groupby(['id','year'])['year'].count().reset_index(name='count')
print output2
id year count
0 A 2009 1
1 A 2010 2
2 A 2011 2
3 A 2012 2
4 A 2013 2
5 A 2014 2
6 A 2015 1
7 B 2007 1
8 B 2008 2
9 B 2009 2
10 B 2010 1
11 B 2011 1
#2
0
My answer is based on expanding the year columns using a PeriodIndex. Then taking this period range and stacking (rotating) it so that it creates multi-level index.
我的答案是基于使用PeriodIndex扩展年份列。然后取这个周期范围并堆叠(旋转)它,以便创建多级索引。
def expand_period(row):
"Creates a series from first to last year and appends it to the row"
p = pd.period_range(row["first_year"], row["last_year"], freq="A")
return row.append(p.to_series()).drop(["first_year","last_year"])
#original data frame
tf = pd.DataFrame([['A','t1',2009,2014],
['A','t1',2010,2015],
['B','t1',2007,2009],
['B','t2',2008,2011]],
columns = ['id','type','first_year','last_year'])
#Drop the year columns but replace them with expanded series
tfexpanded = tf.apply(expand_period, 1).set_index(["id","type"])
#Rotate the axis so that you have a 3 level index
tfindexed = tfexpanded.stack()
#This is not necessary but improves readability when watching the output
tfindexed[:] = 1
#Group-By as you did before
answer = tfindexed.groupby(level=[0,1,2]).count()
Of course, the whole thing can be simplified using lambdas and chained methods.
当然,使用lambdas和链式方法可以简化整个过程。