在python中运行多个OLS回归

时间:2021-10-20 15:49:01

I need to run a lot of OLS regressions (~1.600). I have collected 60 data points for each of the ~1.600 observations.

我需要运行大量的OLS回归(~1.600)。在~1.600次观察中,我收集了60个数据点。

I am using the Fama & French 5 factor model, where the 60 data points for each of the observations is matched with the dates in the sample. E.g. I have the five factor parameters for a start date ['2010-1-1'] to an end date ['2015-1-1'] in a dataframe.

我使用的是Fama和French 5 factor模型,其中每个观测的60个数据点与样本中的日期相匹配。例:我在dataframe中有开始日期['2010-1-1']到结束日期['2015-1-1']的五个因素参数。

I need to run these parameters against stock returns for a given stock. Now, since the five factor parameters are collected in a dataframe, with around 96.000 rows (1600*60) and five columns (for each factor), I need to select the first 60 observations, run these against a set of returns with OLS, store the estimated coefficients and then select the next 60 observations for both the factor parameters and stock returns.

我需要根据给定股票的回报运行这些参数。现在,因为五个因素参数收集dataframe,大约96.000行(1600 * 60)和五个列(每个因素),我需要选择第一个60的观察,对一组运行这些回报与OLS估计存储系数,然后选择下一个60观测参数和股票收益的因素。

I have tried using slicing like:

我试过用切片法:

start = 0
stop = 59

empty_list = []

for i in my_data:
    coef = my_date[i][start:stop]
    # run regression with the coef slice and store them in a dataframe
    start += 60
    stop += 60

However, I can't seem to get this to work. Any suggestions for how to solve this?

然而,我似乎并不能让它起作用。对于如何解决这个问题有什么建议吗?

1 个解决方案

#1


1  

use groupby + np.arange() // 60

使用groupby + np.arange() // 60。

from statsmodels.api import formula
import pandas as pd

df = pd.DataFrame(
    np.random.randn(96000, 6),
    columns=['f1', 'f2', 'f3', 'f4', 'f5', 'r']

)

f = 'r ~ f1 + f2 + f3 + f4 + f5'

def regress(df, f):
    return formula.ols(f, df).fit().params

results = df.groupby(np.arange(len(df)) // 60).apply(regress, f=f)

results.head()

   Intercept        f1        f2        f3        f4        f5
0  -0.108910  0.205059  0.006981  0.088200  0.064486 -0.003423
1   0.155242 -0.057223 -0.097207 -0.098114  0.163142 -0.029543
2   0.014305 -0.123687 -0.120924  0.017383 -0.168981  0.090547
3  -0.254084 -0.063028 -0.092831  0.137913  0.185524 -0.088452
4   0.025795 -0.126270  0.043018 -0.064970 -0.034431  0.081162

#1


1  

use groupby + np.arange() // 60

使用groupby + np.arange() // 60。

from statsmodels.api import formula
import pandas as pd

df = pd.DataFrame(
    np.random.randn(96000, 6),
    columns=['f1', 'f2', 'f3', 'f4', 'f5', 'r']

)

f = 'r ~ f1 + f2 + f3 + f4 + f5'

def regress(df, f):
    return formula.ols(f, df).fit().params

results = df.groupby(np.arange(len(df)) // 60).apply(regress, f=f)

results.head()

   Intercept        f1        f2        f3        f4        f5
0  -0.108910  0.205059  0.006981  0.088200  0.064486 -0.003423
1   0.155242 -0.057223 -0.097207 -0.098114  0.163142 -0.029543
2   0.014305 -0.123687 -0.120924  0.017383 -0.168981  0.090547
3  -0.254084 -0.063028 -0.092831  0.137913  0.185524 -0.088452
4   0.025795 -0.126270  0.043018 -0.064970 -0.034431  0.081162