如何迭代pandas数据帧的列以运行回归

时间:2023-02-11 21:40:41

I'm sure this is simple, but as a complete newbie to python, I'm having trouble figuring out how to iterate over variables in a pandas dataframe and run a regression with each.

我确信这很简单,但作为python的完全新手,我无法弄清楚如何在pandas数据帧中迭代变量并对每个变量运行回归。

Here's what I'm doing:

这就是我正在做的事情:

all_data = {}
for ticker in ['FIUIX', 'FSAIX', 'FSAVX', 'FSTMX']:
    all_data[ticker] = web.get_data_yahoo(ticker, '1/1/2010', '1/1/2015')

prices = DataFrame({tic: data['Adj Close'] for tic, data in all_data.iteritems()})  
returns = prices.pct_change()

I know I can run a regression like this:

我知道我可以运行这样的回归:

regs = sm.OLS(returns.FIUIX,returns.FSTMX).fit()

but suppose I want to do this for each column in the dataframe. In particular, I want to regress FIUIX on FSTMX, and then FSAIX on FSTMX, and then FSAVX on FSTMX. After each regression I want to store the residuals.

但是假设我想为数据框中的每一列执行此操作。特别是,我想在FSTMX上回归FIUIX,然后在FSTMX上回归FSAI​​X,然后在FSTMX上回归FSAVX。在每次回归之后我想存储残差。

I've tried various versions of the following, but I must be getting the syntax wrong:

我已经尝试了以下各种版本,但我必须得到错误的语法:

resids = {}
for k in returns.keys():
    reg = sm.OLS(returns[k],returns.FSTMX).fit()
    resids[k] = reg.resid

I think the problem is I don't know how to refer to the returns column by key, so returns[k] is probably wrong.

我认为问题是我不知道如何按键引用返回列,因此返回[k]可能是错误的。

Any guidance on the best way to do this would be much appreciated. Perhaps there's a common pandas approach I'm missing.

任何关于最佳方法的指导都将非常感激。也许我缺少一种常见的熊猫方法。

8 个解决方案

#1


168  

for column in df:
    print(df[column])

#2


30  

You can use iteritems():

你可以使用iteritems():

for name, values in df.iteritems():
    print '{name}: {value}'.format(name=name, value=values[0])

#3


14  

You can index dataframe columns by the position using ix.

您可以使用ix按位置索引数据框列。

df1.ix[:,1]

This returns the first column for example. (0 would be the index)

例如,返回第一列。 (0将是索引)

df1.ix[0,]

This returns the first row.

这将返回第一行。

df1.ix[:,1]

This would be the value at the intersection of row 0 and column 1:

这将是第0行和第1列交叉处的值:

df1.ix[0,1]

and so on. So you can enumerate() returns.keys(): and use the number to index the dataframe.

等等。所以你可以枚举()returns.keys():并使用数字来索引数据帧。

#4


9  

This answer is to iterate over selected columns as well as all columns in a DF.

这个答案是迭代选定的列以及DF中的所有列。

df.columns gives a list containing all the columns' names in the DF. Now that isn't very helpful if you want to iterate over all the columns. But it comes in handy when you want to iterate over columns of your choosing only.

df.columns给出一个列表,其中包含DF中所有列的名称。现在,如果要迭代所有列,这不是很有用。但是当你想要迭代你选择的列时它会派上用场。

We can use Python's list slicing easily to slice df.columns according to our needs. For eg, to iterate over all columns but the first one, we can do:

我们可以根据需要轻松地使用Python的列表切片来切片df.columns。例如,要遍历除第一列之外的所有列,我们可以:

for column in df.columns[1:]:
    print(column)

Similarly to iterate over all the columns in reversed order, we can do:

类似于以相反的顺序迭代所有列,我们可以这样做:

for column in df.columns[::-1]:
    print(column)

We can iterate over all the columns in a lot of cool ways using this technique. Also remember that you can get the indices of all columns easily using:

我们可以使用这种技术以很多很酷的方式迭代所有列。还要记住,您可以使用以下方法轻松获取所有列的索引:

for ind, column in enumerate(df.columns):
    print(ind, column)

#5


4  

A workaround is to transpose the DataFrame and iterate over the rows.

解决方法是转置DataFrame并迭代行。

for column_name, column in df.transpose().iterrows():
    print column_name

#6


1  

Using list comprehension, you can get all the columns names (header):

使用列表推导,您可以获取所有列名称(标题):

[column for column in df]

[df栏中的列]

#7


1  

I'm a bit late but here's how I did this. The steps:

我有点晚了,但这就是我这样做的方式。步骤:

  1. Create a list of all columns
  2. 创建所有列的列表
  3. Use itertools to take x combinations
  4. 使用itertools来获取x组合
  5. Append each result R squared value to a result dataframe along with excluded column list
  6. 将每个结果R平方值附加到结果数据框以及排除列列表
  7. Sort the result DF in descending order of R squared to see which is the best fit.
  8. 按R平方的降序对结果DF进行排序,看哪个最合适。

This is the code I used on DataFrame called aft_tmt. Feel free to extrapolate to your use case..

这是我在DataFrame上使用的名为aft_tmt的代码。随意推断你的用例..

import pandas as pd
# setting options to print without truncating output
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

import statsmodels.formula.api as smf
import itertools

# This section gets the column names of the DF and removes some columns which I don't want to use as predictors.
itercols = aft_tmt.columns.tolist()
itercols.remove("sc97")
itercols.remove("sc")
itercols.remove("grc")
itercols.remove("grc97")
print itercols
len(itercols)

# results DF
regression_res = pd.DataFrame(columns = ["Rsq", "predictors", "excluded"])

# excluded cols
exc = []

# change 9 to the number of columns you want to combine from N columns.
#Possibly run an outer loop from 0 to N/2?
for x in itertools.combinations(itercols, 9):
    lmstr = "+".join(x)
    m = smf.ols(formula = "sc ~ " + lmstr, data = aft_tmt)
    f = m.fit()
    exc = [item for item in x if item not in itercols]
    regression_res = regression_res.append(pd.DataFrame([[f.rsquared, lmstr, "+".join([y for y in itercols if y not in list(x)])]], columns = ["Rsq", "predictors", "excluded"]))

regression_res.sort_values(by="Rsq", ascending = False)

#8


1  

Based on the accepted answer, if an index corresponding to each column is also desired:

根据接受的答案,如果还需要与每列对应的索引:

for i, column in enumerate(df):
    print i, df[column]

The above df[column] type is Series, which can simply be converted into numpy ndarrays:

上面的df [column]类型是Series,可以简单地转换为numpy ndarrays:

for i, column in enumerate(df):
    print i, np.asarray(df[column])

#1


168  

for column in df:
    print(df[column])

#2


30  

You can use iteritems():

你可以使用iteritems():

for name, values in df.iteritems():
    print '{name}: {value}'.format(name=name, value=values[0])

#3


14  

You can index dataframe columns by the position using ix.

您可以使用ix按位置索引数据框列。

df1.ix[:,1]

This returns the first column for example. (0 would be the index)

例如,返回第一列。 (0将是索引)

df1.ix[0,]

This returns the first row.

这将返回第一行。

df1.ix[:,1]

This would be the value at the intersection of row 0 and column 1:

这将是第0行和第1列交叉处的值:

df1.ix[0,1]

and so on. So you can enumerate() returns.keys(): and use the number to index the dataframe.

等等。所以你可以枚举()returns.keys():并使用数字来索引数据帧。

#4


9  

This answer is to iterate over selected columns as well as all columns in a DF.

这个答案是迭代选定的列以及DF中的所有列。

df.columns gives a list containing all the columns' names in the DF. Now that isn't very helpful if you want to iterate over all the columns. But it comes in handy when you want to iterate over columns of your choosing only.

df.columns给出一个列表,其中包含DF中所有列的名称。现在,如果要迭代所有列,这不是很有用。但是当你想要迭代你选择的列时它会派上用场。

We can use Python's list slicing easily to slice df.columns according to our needs. For eg, to iterate over all columns but the first one, we can do:

我们可以根据需要轻松地使用Python的列表切片来切片df.columns。例如,要遍历除第一列之外的所有列,我们可以:

for column in df.columns[1:]:
    print(column)

Similarly to iterate over all the columns in reversed order, we can do:

类似于以相反的顺序迭代所有列,我们可以这样做:

for column in df.columns[::-1]:
    print(column)

We can iterate over all the columns in a lot of cool ways using this technique. Also remember that you can get the indices of all columns easily using:

我们可以使用这种技术以很多很酷的方式迭代所有列。还要记住,您可以使用以下方法轻松获取所有列的索引:

for ind, column in enumerate(df.columns):
    print(ind, column)

#5


4  

A workaround is to transpose the DataFrame and iterate over the rows.

解决方法是转置DataFrame并迭代行。

for column_name, column in df.transpose().iterrows():
    print column_name

#6


1  

Using list comprehension, you can get all the columns names (header):

使用列表推导,您可以获取所有列名称(标题):

[column for column in df]

[df栏中的列]

#7


1  

I'm a bit late but here's how I did this. The steps:

我有点晚了,但这就是我这样做的方式。步骤:

  1. Create a list of all columns
  2. 创建所有列的列表
  3. Use itertools to take x combinations
  4. 使用itertools来获取x组合
  5. Append each result R squared value to a result dataframe along with excluded column list
  6. 将每个结果R平方值附加到结果数据框以及排除列列表
  7. Sort the result DF in descending order of R squared to see which is the best fit.
  8. 按R平方的降序对结果DF进行排序,看哪个最合适。

This is the code I used on DataFrame called aft_tmt. Feel free to extrapolate to your use case..

这是我在DataFrame上使用的名为aft_tmt的代码。随意推断你的用例..

import pandas as pd
# setting options to print without truncating output
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

import statsmodels.formula.api as smf
import itertools

# This section gets the column names of the DF and removes some columns which I don't want to use as predictors.
itercols = aft_tmt.columns.tolist()
itercols.remove("sc97")
itercols.remove("sc")
itercols.remove("grc")
itercols.remove("grc97")
print itercols
len(itercols)

# results DF
regression_res = pd.DataFrame(columns = ["Rsq", "predictors", "excluded"])

# excluded cols
exc = []

# change 9 to the number of columns you want to combine from N columns.
#Possibly run an outer loop from 0 to N/2?
for x in itertools.combinations(itercols, 9):
    lmstr = "+".join(x)
    m = smf.ols(formula = "sc ~ " + lmstr, data = aft_tmt)
    f = m.fit()
    exc = [item for item in x if item not in itercols]
    regression_res = regression_res.append(pd.DataFrame([[f.rsquared, lmstr, "+".join([y for y in itercols if y not in list(x)])]], columns = ["Rsq", "predictors", "excluded"]))

regression_res.sort_values(by="Rsq", ascending = False)

#8


1  

Based on the accepted answer, if an index corresponding to each column is also desired:

根据接受的答案,如果还需要与每列对应的索引:

for i, column in enumerate(df):
    print i, df[column]

The above df[column] type is Series, which can simply be converted into numpy ndarrays:

上面的df [column]类型是Series,可以简单地转换为numpy ndarrays:

for i, column in enumerate(df):
    print i, np.asarray(df[column])