Part 1
For each of the four datasets...
- Compute the mean and variance of both x and y
- Compute the correlation coefficient between x and y
- Compute the linear regression line: y=β0+β1x+ϵy=β0+β1x+ϵ (hint: use statsmodels and look at the Statsmodels notebook
panda库有求均值的函数mean(),方差var(),相关系数corr()。另外,线性回归用到了statsmodels库中的ols,最后用summary提取出相关数据。(方法由https://nbviewer.jupyter.org/github/schmit/cme193-ipython-notebooks-lecture/blob/master/3.%20Statsmodels.ipynb提供)
代码如下
anascombe = pd.read_csv('https://raw.githubusercontent.com/schmit/cme193-ipython-notebooks-lecture/master/data/anscombe.csv') print("均值为:") print(anascombe.groupby(['dataset']).mean()) print("\n方差为:") print(anascombe.groupby(['dataset']).var()) print("\n相关系数为:") print(anascombe.groupby(['dataset']).corr()) print("\n线性回归:") for i in range(4): X = sm.add_constant(np.array(anascombe[i:i+11].x)) Y = np.array(anascombe[i:i+11].y) res = sm.OLS(Y, X).fit() print(res.summary())
结果
均值为: x y dataset I 9.0 7.500909 II 9.0 7.500909 III 9.0 7.500000 IV 9.0 7.500909 方差为: x y dataset I 11.0 4.127269 II 11.0 4.127629 III 11.0 4.122620 IV 11.0 4.123249 相关系数为: x y dataset I x 1.000000 0.816421 y 0.816421 1.000000 II x 1.000000 0.816237 y 0.816237 1.000000 III x 1.000000 0.816287 y 0.816287 1.000000 IV x 1.000000 0.816521 y 0.816521 1.000000 线性回归: OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.667 Model: OLS Adj. R-squared: 0.629 Method: Least Squares F-statistic: 17.99 Date: Sun, 10 Jun 2018 Prob (F-statistic): 0.00217 Time: 22:36:49 Log-Likelihood: -16.841 No. Observations: 11 AIC: 37.68 Df Residuals: 9 BIC: 38.48 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 3.0001 1.125 2.667 0.026 0.456 5.544 x1 0.5001 0.118 4.241 0.002 0.233 0.767 ============================================================================== Omnibus: 0.082 Durbin-Watson: 3.212 Prob(Omnibus): 0.960 Jarque-Bera (JB): 0.289 Skew: -0.122 Prob(JB): 0.865 Kurtosis: 2.244 Cond. No. 29.1 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.657 Model: OLS Adj. R-squared: 0.619 Method: Least Squares F-statistic: 17.24 Date: Sun, 10 Jun 2018 Prob (F-statistic): 0.00247 Time: 22:36:49 Log-Likelihood: -17.291 No. Observations: 11 AIC: 38.58 Df Residuals: 9 BIC: 39.38 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 3.0101 1.172 2.569 0.030 0.359 5.661 x1 0.5101 0.123 4.153 0.002 0.232 0.788 ============================================================================== Omnibus: 0.562 Durbin-Watson: 3.011 Prob(Omnibus): 0.755 Jarque-Bera (JB): 0.578 Skew: -0.304 Prob(JB): 0.749 Kurtosis: 2.057 Cond. No. 29.1 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.633 Model: OLS Adj. R-squared: 0.593 Method: Least Squares F-statistic: 15.54 Date: Sun, 10 Jun 2018 Prob (F-statistic): 0.00339 Time: 22:36:49 Log-Likelihood: -17.627 No. Observations: 11 AIC: 39.25 Df Residuals: 9 BIC: 40.05 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 3.2156 1.208 2.662 0.026 0.483 5.948 x1 0.4993 0.127 3.943 0.003 0.213 0.786 ============================================================================== Omnibus: 1.155 Durbin-Watson: 2.623 Prob(Omnibus): 0.561 Jarque-Bera (JB): 0.881 Skew: -0.467 Prob(JB): 0.644 Kurtosis: 1.975 Cond. No. 29.1 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.729 Model: OLS Adj. R-squared: 0.699 Method: Least Squares F-statistic: 24.24 Date: Sun, 10 Jun 2018 Prob (F-statistic): 0.000820 Time: 22:36:49 Log-Likelihood: -16.074 No. Observations: 11 AIC: 36.15 Df Residuals: 9 BIC: 36.94 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 2.9415 1.049 2.804 0.021 0.568 5.314 x1 0.5415 0.110 4.924 0.001 0.293 0.790 ============================================================================== Omnibus: 1.370 Durbin-Watson: 2.795 Prob(Omnibus): 0.504 Jarque-Bera (JB): 0.835 Skew: -0.307 Prob(JB): 0.659 Kurtosis: 1.798 Cond. No. 29.1 ==============================================================================
Part 2
Using Seaborn, visualize all four datasets.
hint: use sns.FacetGrid combined with plt.scatter
在以前的联系里我已经介绍过seaborn,该库对matplotlib进行了二次封装,各种函数相比matplotlib更加简便,且画图效果更好。
代码如下
pic = sns.FacetGrid(anascombe, col='dataset') pic = pic.map(plt.scatter, 'x', 'y')