I have a pandas DataFrame with years as index, one column with stock ID, a second column with returns. The DataFrame has ~200k rows. I want to add 3 additional columns, with the cumulative returns of each stock in the next 5, 10 and 20 years respectively. To this purpose, I am grouping by the ID column and applying a function to the grouped object, which I show in a simple example below. I knew this was going to take some time, but as of now the code has been in execution for 23 hours and is still running.
我有一个pandas DataFrame,其中包含多年作为索引,一列包含股票ID,第二列包含退货。 DataFrame有大约200k行。我想添加3个额外的列,分别列出未来5年,10年和20年的每个库存的累计回报。为此,我按ID列分组并将一个函数应用于分组对象,我在下面的一个简单示例中显示了该函数。我知道这需要一些时间,但截至目前,代码已经执行了23个小时并且仍在运行。
I have 2 questions then:
那我有两个问题:
- Why exactly is python taking so much time to execute the code? Where is the bottleneck?
- Any ideas on how can I change the code to make it faster?
为什么python花了这么多时间来执行代码呢?瓶颈在哪里?
关于如何更改代码以使其更快的任何想法?
Here is my code, applied to a simpler example.
这是我的代码,适用于一个更简单的例子。
In [1]: import pandas as pd
In [2]: simple_df = pd.DataFrame([[1,1,1,2,2],[0.1,0.05,0.15,0.3,0.2]], columns=[2010,2011,2012,2011,2012], index=['ID','Return']).T
In [3]: simple_df
Out[3]:
ID Return
2010 1.0 0.10
2011 1.0 0.05
2012 1.0 0.15
2011 2.0 0.30
2012 2.0 0.20
In [4]: grouped = simple_df.groupby('ID', sort=False)
In [5]: create_df = lambda x: pd.DataFrame({i: x.Return.shift(-i) for i in range(0,3)})
In [6]: df_1 = grouped.apply(create_df)
In [7]: df_1
Out[7]:
0 1 2
2010 0.10 0.05 0.15
2011 0.05 0.15 NaN
2012 0.15 NaN NaN
2011 0.30 0.20 NaN
2012 0.20 NaN NaN
In [8]: df_2 =(df_1+1).cumprod(axis=1)-1
In [9]: df_2
Out[9]:
0 1 2
2010 0.10 0.1550 0.32825
2011 0.05 0.2075 NaN
2012 0.15 NaN NaN
2011 0.30 0.5600 NaN
2012 0.20 NaN NaN
In [10]: simple_df['Return_3y'] = df_2.iloc[:,2]
In [11]: simple_df
Out[11]:
ID Return Return_3y
2010 1.0 0.10 0.32825
2011 1.0 0.05 NaN
2012 1.0 0.15 NaN
2011 2.0 0.30 NaN
2012 2.0 0.20 NaN
1 个解决方案
#1
0
Instead of apply
, use DataFrameGroupBy.shift
with concat
:
而不是apply,使用带有concat的DataFrameGroupBy.shift:
np.random.seed(234)
N = 10000
idx = np.random.randint(1990, 2020, size=N)
simple_df = pd.DataFrame({'ID':np.random.randint(1000, size=N),
'Return':np.random.rand(N)}, index=idx).sort_values('ID')
print (simple_df)
In [147]: %%timeit
...: grouped = simple_df.groupby('ID', sort=False)
...: create_df = lambda x: pd.DataFrame({i: x.Return.shift(-i) for i in range(0,3)})
...: df_1 = grouped.apply(create_df)
...: df_2 =(df_1+1).cumprod(axis=1)-1
...:
1.01 s ± 6.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [148]: %%timeit
...: g = simple_df.groupby('ID', sort=False)
...: df2 = pd.concat([g['Return'].shift(-i) for i in range(3)], axis=1, keys=range(3))
...: df2 =(df2+1).cumprod(axis=1)-1
...:
3.91 ms ± 53.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#1
0
Instead of apply
, use DataFrameGroupBy.shift
with concat
:
而不是apply,使用带有concat的DataFrameGroupBy.shift:
np.random.seed(234)
N = 10000
idx = np.random.randint(1990, 2020, size=N)
simple_df = pd.DataFrame({'ID':np.random.randint(1000, size=N),
'Return':np.random.rand(N)}, index=idx).sort_values('ID')
print (simple_df)
In [147]: %%timeit
...: grouped = simple_df.groupby('ID', sort=False)
...: create_df = lambda x: pd.DataFrame({i: x.Return.shift(-i) for i in range(0,3)})
...: df_1 = grouped.apply(create_df)
...: df_2 =(df_1+1).cumprod(axis=1)-1
...:
1.01 s ± 6.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [148]: %%timeit
...: g = simple_df.groupby('ID', sort=False)
...: df2 = pd.concat([g['Return'].shift(-i) for i in range(3)], axis=1, keys=range(3))
...: df2 =(df2+1).cumprod(axis=1)-1
...:
3.91 ms ± 53.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)