Jupyter 设置

    In [1]: 
  

%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

使用pandas 和 pyplot

    In [2]: 
  

import pandas as pd, numpy as np
import  as plt
#Jupyter 设置
%matplotlib inline

读取数据及简单的预处理

载入数据集

    In [3]: 
  

df_fx_data = pd.read_csv('')
df_fx_data.head()

      Out[3]: 
    

	Date	Value
0	2017-06-09	0.8941
1	2017-06-08	0.8912
2	2017-06-07	0.8878
3	2017-06-06	0.8875
4	2017-06-05	0.8888

查看日期的数据类型：str

    In [4]: 
  

type(df_fx_data['Date'][0])

      Out[4]: 
    

str

    In [5]: 
  

df_fx_data['Date'] = pd.to_datetime(df_fx_data['Date'])
df_fx_data.head()

      Out[5]: 
    

	Date	Value
0	2017-06-09	0.8941
1	2017-06-08	0.8912
2	2017-06-07	0.8878
3	2017-06-06	0.8875
4	2017-06-05	0.8888

现在日期类型为： pandas._libs.

    In [6]: 
  

type(df_fx_data['Date'][0])

      Out[6]: 
    

pandas._libs.

用变量Date作为索引

    In [7]: 
  

indexed_df = df_fx_data.set_index('Date')

时间序列的变量即为Value

    In [8]: 
  

ts = indexed_df['Value']
ts.head(5)

      Out[8]: 
    

Date
2017-06-09    0.8941
2017-06-08    0.8912
2017-06-07    0.8878
2017-06-06    0.8875
2017-06-05    0.8888
Name: Value, dtype: float64

绘制折线图

    In [9]: 
  

plt.plot(ts)

      Out[9]: 
    

[<.Line2D at 0x23e7a80d470>]

重取样时间

以周为最小单位重取样时间，重取样方式为每周的均值

    In [10]: 
  

ts_week = ts.resample('W').mean()
ts_week.plot()

      Out[10]: 
    

<._subplots.AxesSubplot at 0x23e7a8472e8>

resample的规则：
年月日周：A M D W
时分秒：H T S
每3年：3A
每5小时：5H

以每两年为最小单位重取样时间，重取样方式为每年的加总

    In [11]: 
  

ts.resample('2A').sum().plot()

      Out[11]: 
    

<._subplots.AxesSubplot at 0x23e7a82b6a0>

自相关与偏自相关函数

自相关: lag=n时为 Yt 和 Yt−n 的相关系数
偏自相关: lag=n时在不考虑 Yt 和 Yt−n 与之间其他$Y_{t-m}\quad(m<n)$ 相关性的情况下="" $y_t$="" 和="" $y_{t-n}$的相关性

    In [12]: 
  

from  import plot_pacf, plot_acf
from  import MaxNLocator
def acf_pacf(series, lags=40, title=None, figsize=(10,6)):
    #ACF and PACF plots
    # 求自相关和偏自相关函数
    # series 输入的时间序列
    # lags 自相关和偏自相关函数的滞后取值范围
    fig = plt.figure(figsize=figsize)
    ax1 = fig.add_subplot(211)
    ax1.set_xlabel('lags')
    ax1.set_ylabel('Autocorrelation')
    # x坐标为整数
    ax1.xaxis.set_major_locator(MaxNLocator(integer=True))
    plot_acf(series.dropna(), lags=lags, ax=ax1, title='ACF of %s'%title)
    ax2 = fig.add_subplot(212)
    ax2.set_xlabel('lags')
    ax2.set_ylabel('Partial Autocorrelation')
    ax2.xaxis.set_major_locator(MaxNLocator(integer=True))
    plot_pacf(series.dropna(), lags=lags, ax=ax2, title='PACF of %s'%title)
    fig.tight_layout()

C:\Users\iota\Anaconda3\lib\site-packages\statsmodels\compat\:56: FutureWarning: The  module is deprecated and will be removed in a future version. Please use the  module instead.
  from  import datetools

ts_week在lag数较高的区间自相关强并显著, 但lags=2以上的自相关很可能仅是由的lag=1的自相关累积下来的. PACF的lag=1 可以印证这一点.

    In [13]: 
  

acf_pacf(ts_week, lags=20, title='Weekly Euro Dollar Exchange Rate')

白噪声检验

一个白噪声序列
浅色部分为95%置信度区间

    In [14]: 
  

from random import gauss
from random import seed
from pandas import Series
from  import autocorrelation_plot
# seed random number generator
seed(1)
# create white noise series
whitenoise = Series([gauss(0.0, 1.0) for i in range(1000)])
acf_pacf(whitenoise, lags=40, title='White Noise Series')

汇率时间序列, 显然不是一个白噪声序列

    In [15]: 
  

acf_pacf(ts_week, lags=40, title='Weekly Euro Dollar Exchange Rate')

稳定时间序列

检查稳定性(Stationarity)

稳定性(Stationarity): 在稳定的时间序列中变量的统计属性，诸如均值(mean)，方差(variance)，自相关(autocorrelation)，自协方差(autocovariance) 等不随时间改变。

adfuller(timeseries, autolag='AIC') autolag: 按某个度量自动选择测试使用的lag数量以使该度量最小化
AIC: Akaike Information Criterion, 估算两个回归模型之间的KL散度/相对熵; KL散度/相对熵用于衡量近似分布与真实分布的接近程度
p-value: 显著性概率

    In [16]: 
  

from  import adfuller

def test_stationarity(timeseries, window=10):
    # 求移动平均和移动标准差
    # window为选取的时间窗口个数
    rolmean = timeseries.rolling(window=window, center=False).mean()
    rolstd = timeseries.rolling(window=window, center=False).std()

    # 将以周重取样后的时间序列、移动平均和移动标准差制成折线图
    orig = plt.plot(timeseries, color='blue', label='Original')
    mean = plt.plot(rolmean, color='red', label='Rolling Mean')
    std = plt.plot(rolstd, color='black', label='Rolling Std')
    plt.legend(loc='best')
    plt.title('Rolling Mean & Standard Deviation')
    plt.show(block=False)

    # 用Augmented Dickey-Fuller检验测试时间序列稳定性:
    print('Results of Augmented Dickey-Fuller Test:')
    # 使用减小AIC的办法估算ADF测试所需的滞后数
    dftest = adfuller(timeseries, autolag='AIC')
    # 将ADF测试结果、显著性概率、所用的滞后数和所用的观测数打印出来
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic', 'p-value', 
                                             'Num Lags Used', 
                                             'Num Observations Used'])
    for key, value in dftest[4].items():
        dfoutput['Critical Value (%s)' % key] = value
    print(dfoutput)

    In [17]: 
  

test_stationarity(ts_week)

Results of Augmented Dickey-Fuller Test:
Test Statistic             -2.058902
p-value                     0.261358
Num Lags Used               2.000000
Num Observations Used    2212.000000
Critical Value (1%)        -3.433310
Critical Value (10%)       -2.567466
Critical Value (5%)        -2.862848
dtype: float64

-2.217962 > Critical Value (5%), p-value > 0.05

移除趋势和季节性

seasonaldecompose: 按照滑动均值将维经过指数转化的时间序列分为趋势(trend), 季节性(seasonality)和残差(residual)三部分
有明显的趋势但季节性并不明显(数值变化范围小）
残差不是白噪音
对数转换: (series)
使用第一差分去掉趋势: $Y' = Y{t} - Y_{t-1}$
比较发现这里log再差分的平稳性较仅差分要好

    In [18]: 
  

from  import seasonal_decompose
from  import acorr_ljungbox
def decompose_plot(series, title=''):
    decomposition = seasonal_decompose(series)
    trend = decomposition.trend
    seasonal = decomposition.seasonal
    residual = decomposition.resid
    fig = decomposition.plot()
    fig.set_size_inches(10,6)
    fig.suptitle(title)
    fig.tight_layout()
    fig2 = acf_pacf(residual, title='Residuals', figsize=(10,4))

# 每周汇率
decompose_plot(ts_week, title='ts_week')

    In [19]: 
  

# 使用第一差分去掉趋势
ts_week_diff = ts_week - ts_week.shift(1)
test_stationarity(ts_week_diff.dropna())
decompose_plot(ts_week_diff.dropna(), title='ts_week_diff')

Results of Augmented Dickey-Fuller Test:
Test Statistic            -29.988468
p-value                     0.000000
Num Lags Used               1.000000
Num Observations Used    2212.000000
Critical Value (1%)        -3.433310
Critical Value (10%)       -2.567466
Critical Value (5%)        -2.862848
dtype: float64

    In [20]: 
  

# 对数转换, 使用第一差分去掉趋势
ts_week_log = np.log(ts_week)
ts_week_log_diff = ts_week_log - ts_week_log.shift(1)
test_stationarity(ts_week_log_diff.dropna())
decompose_plot(ts_week_log_diff.dropna(), title='ts_week_log_diff')

plt.show()

Results of Augmented Dickey-Fuller Test:
Test Statistic            -36.417784
p-value                     0.000000
Num Lags Used               0.000000
Num Observations Used    2213.000000
Critical Value (1%)        -3.433308
Critical Value (10%)       -2.567466
Critical Value (5%)        -2.862847
dtype: float64

建立模型

ARIMA(p, d, q)

找到合适的差分阶数 d

刚才通过一次差分已达到平稳 d=1
也可以通过ACF和PACF来找到合适的d

    In [21]: 
  

acf_pacf(ts_week_log, title='ts_week_log')

序列在lag>1出有显著的正自相关, 需提高差分阶数
lag=1 时自相关为0或负值, 无需差分
lag=1 时自相关< -0.5, 则为过差分
因此ts_week_log 需要差分一次, ts_week_log_diff是差分一次后的ts_week_log

    In [22]: 
  

acf_pacf(ts_week_log_diff, title='ts_week_log_diff', lags=10)

序列在lag>1已无显著的正自相关, 无需差分

AR和MA项数 p, q

得到平稳的时间序列后需要决定p和q
参照上图

p: 如果PACF(lag<=k)显著且PACF(lag>k)均不显著, 考虑将p设为k
q: 如果ACF(lag<=k)显著且ACF(lag>k)均不显著, 考虑将q设为k

p: 如果差分后序列的PACF迅速衰减到0而ACF在lag较高处仍有显著的较强的相关, 和/或 ACF(lag=1) > 0, 考虑增加p
q: 如果差分后序列的ACF迅速衰减到0, 和/或 ACF(lag=1) < 0, 考虑增加q

绘出拟合后的ARIMA的拟合残差的ACF和PACF, 按上述规则调整p和/或q
下图为ARIMA(0,1,0)的拟合残差
可以考虑p=1, q=1

    In [23]: 
  

from .arima_model import ARIMA
# ARIMA的参数: order = p, d, q 
model = ARIMA(ts_week_log, order=(0, 1, 0))  

# 已拟合的模型
results_ARIMA = model.fit()  

# 拟合的数据和原始数据的残差
residuals = pd.DataFrame(results_ARIMA.resid)
acf_pacf(residuals, lags=10, title='Residuals, p=0, d=1, q=0')

这里ARIMA(1,1,0), ARIMA(0,1,1) 和 ARIMA(1,1,1)都是可用的
ARIMA模型是通过优化AIC, HQIC, BIC等得出的, 一般来说AIC较小的模型拟合度较好, 但拟合度较好不能说明预测能力好

    In [24]: 
  

a110=ARIMA(ts_week_log, order=(1, 1, 0)).fit()
a011=ARIMA(ts_week_log, order=(0, 1, 1)).fit()
a111=ARIMA(ts_week_log, order=(1, 1, 1)).fit()
acf_pacf(a110.resid, lags=10, title='Residuals, p=1, d=1, q=0, aic=%f'%a110.aic)
acf_pacf(a011.resid, lags=10, title='Residuals, p=1, d=1, q=1, aic=%f'%a011.aic)
acf_pacf(a111.resid, lags=10, title='Residuals, p=0, d=1, q=1, aic=%f'%a111.aic)

拟合ARIMA模型

    In [25]: 
  

from .arima_model import ARIMA
# ARIMA的参数: order = p, d, q 
model = ARIMA(ts_week_log, order=(1, 1, 1))  

# 已拟合的模型
results_ARIMA = model.fit()  

# ARIMA的结果
print(results_ARIMA.summary())

# 比较拟合的数据和原始数据,
# 注意fittedvalues返回的是d次差分后的序列, 因此和ts_week_log_diff比较
plt.plot(ts_week_log_diff)
plt.plot(results_ARIMA.fittedvalues, color='red')

# 拟合的数据和原始数据的残差
residuals = pd.DataFrame(results_ARIMA.resid)

# 计算拟残差平方和(RSS)
plt.title('RSS: %.4f'% sum((results_ARIMA.fittedvalues-ts_week_log_diff)**2))
acf_pacf(residuals, lags=10, title='Residuals')
# 残差的核密度估计
residuals.plot(kind='kde')
print('\n\nResiduals Summary:\n', residuals.describe())

plt.show()

                             ARIMA Model Results                              
==============================================================================
Dep. Variable:                   No. Observations:                 2214
Model:                 ARIMA(1, 1, 1)   Log Likelihood                6796.505
Method:                       css-mle   . of innovations              0.011
Date:                Tue, 20 Jun 2017   AIC                         -13585.010
Time:                        05:37:20   BIC                         -13562.200
Sample:                    01-12-1975   HQIC                        -13576.678
                         - 06-11-2017                                         
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
const          6.925e-05      0.000      0.223      0.824      -0.001       0.001
ar.     0.1447      0.087      1.666      0.096      -0.026       0.315
ma.     0.1125      0.087      1.286      0.199      -0.059       0.284
                                    Roots                                    
=============================================================================
                 Real           Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1            6.9095           +0.0000j            6.9095            0.0000
MA.1           -8.8887           +0.0000j            8.8887            0.5000
-----------------------------------------------------------------------------


Residuals Summary:
                  0
count  2214.000000
mean      0.000001
std       0.011237
min      -0.061446
25%      -0.006740
50%      -0.000251
75%       0.006910
max       0.063923

ARIMA模型会使拟合和原始数据的残差变为白噪声

    In [26]: 
  

acf_pacf(results_ARIMA.fittedvalues-ts_week_log_diff, title='Residuals')

将拟合的值逆向转换并可视化

    In [27]: 
  

predictions_ARIMA_diff = pd.Series(results_ARIMA.fittedvalues, copy=True)
predictions_ARIMA_diff.head()

      Out[27]: 
    

Date
1975-01-12    0.000069
1975-01-19   -0.002406
1975-01-26    0.001011
1975-02-02   -0.004028
1975-02-09   -0.001174
Freq: W-SUN, dtype: float64

差分的逆向转换为累计和
指数的逆向转换为对数

    In [28]: 
  

# 一阶差分的逆向转换为累计和
predictions_ARIMA_diff_cumsum = predictions_ARIMA_diff.cumsum()

# 对数函数的逆向转换为指数函数
predictions_ARIMA_log = pd.Series(ts_week_log.iloc [0], index=ts_week_log.index)
predictions_ARIMA_log = predictions_ARIMA_log.add(predictions_ARIMA_diff_cumsum,fill_value=0)

比较拟合的序列和原序列的均根差, 即为拟合错误

    In [29]: 
  

predictions_ARIMA = np.exp(predictions_ARIMA_log)
plt.plot(ts_week)
plt.plot(predictions_ARIMA)
plt.title('RMSE: %.4f'% np.sqrt(sum((predictions_ARIMA-ts_week)**2)/len(ts_week)))

      Out[29]: 
    

< at 0x23e7bace5c0>

预测并校验模型

这种校验方法称为 one-step out-of-sample forecast
out-of-sample 是指预测时所用样本/观测值不在训练集内

将原序列分为训练和测试集
将训练集设为历史, 此时时间为t
迭代, 迭代次数为测试集的大小:
1. 在历史上拟合ARIMA模型, 参数为先前得到的p,d,q
2. 生成t+1的预测, 并将其加入历史

比较预测值和真实值

    In [30]: 
  

from  import mean_squared_error
# 我们的ARIMA模型针对的是序列ts_week_log
# 因此使用的数据为ts_week_log
# 因此须将预测做指数转换
# 使用 n_test_obs 个点校验
n_test_obs = 50
size = int(len(ts_week_log) - n_test_obs)

# 切分数据集
train, test = ts_week_log[:size], ts_week_log[size:len(ts_week_log)]

# 历史值
history = [x for x in train]

# 预测值
predictions = list()

# 置信区间
confidence_intervals = list()

print('Predicted vs Expected Values...')
print('\n')
for t in range(len(test)):
    model = ARIMA(history, order=(1,1,1))
    model_fit = model.fit()
    output = model_fit.forecast(steps=1)
    yhat = output[0]
    predictions.append(float(yhat))
    confidence_intervals.append(output[2])
    obs = test[t]
    history.append(obs)
    
    # 须将预测做指数转换
    print('predicted=%f, expected=%f' % (np.exp(yhat), np.exp(obs)))

error = mean_squared_error(test, predictions)

print('\n')
predictions_series = pd.Series(predictions, index = test.index)
print('Test MSE: %.6f' % error)

Predicted vs Expected Values...


predicted=0.886215, expected=0.902880
predicted=0.907080, expected=0.902180
predicted=0.901661, expected=0.902660
predicted=0.902891, expected=0.908120
predicted=0.909561, expected=0.905840
predicted=0.905181, expected=0.896540
predicted=0.894294, expected=0.897600
predicted=0.898145, expected=0.886780
predicted=0.884036, expected=0.885600
predicted=0.885608, expected=0.896075
predicted=0.898827, expected=0.891100
predicted=0.889625, expected=0.891240
predicted=0.891471, expected=0.893720
predicted=0.894386, expected=0.890820
predicted=0.890071, expected=0.894120
predicted=0.895091, expected=0.904860
predicted=0.907586, expected=0.913620
predicted=0.915668, expected=0.917360
predicted=0.918185, expected=0.903520
predicted=0.899997, expected=0.913160
predicted=0.916041, expected=0.936860
predicted=0.942785, expected=0.945060
predicted=0.946632, expected=0.941900
predicted=0.941006, expected=0.936860
predicted=0.935742, expected=0.948440
predicted=0.951627, expected=0.958640
predicted=0.961011, expected=0.954200
predicted=0.952895, expected=0.951600
predicted=0.951158, expected=0.945180
predicted=0.943669, expected=0.938440
predicted=0.936958, expected=0.932520
predicted=0.931239, expected=0.928520
predicted=0.927704, expected=0.935940
predicted=0.938014, expected=0.942780
predicted=0.944393, expected=0.945760
predicted=0.946430, expected=0.945500
predicted=0.945439, expected=0.943900
predicted=0.943574, expected=0.935900
predicted=0.933967, expected=0.926700
predicted=0.924628, expected=0.927140
predicted=0.927547, expected=0.938880
predicted=0.941948, expected=0.941850
predicted=0.942353, expected=0.933150
predicted=0.930947, expected=0.918800
predicted=0.915453, expected=0.914425
predicted=0.913740, expected=0.917740
predicted=0.918738, expected=0.900400
predicted=0.895946, expected=0.891920
predicted=0.890302, expected=0.890350
predicted=0.890181, expected=0.889880


Test MSE: 0.000073

    In [31]: 
  

predictions_series = pd.Series(predictions, index = test.index)

fig, ax = plt.subplots(figsize=(10, 6))
ax.set(title='Spot Exchange Rate, Euro into USD', xlabel='Date', ylabel='Euro into USD')
ax.plot(ts_week[-n_test_obs:], 'o', label='observed')
# 须将预测做指数转换
ax.plot(np.exp(predictions_series), 'g', label='rolling one-step out-of-sample forecast')
legend = ax.legend(loc='upper left')

可以对季节性建模的 SARIMA 模型

    In [32]: 
  

%matplotlib inline
import pandas as pd
import numpy as np
import  as plt
import datetime
from  import relativedelta
import seaborn as sns
import  as sm  
from  import acf  
from  import pacf
from  import seasonal_decompose
from pandas import DatetimeIndex

Portland Oregon 月平均乘公车人数 (/100) January 1960 至 June 1969, 样本数=114

    In [33]: 
  

# 读取CSV
df = pd.read_csv('')
print(df.head(), '\nindex type:\n', type(df.index))

# 按'%Y-%m'格式转换CSV的日期, 为了方便处理, 将时间段转为timestamp
df['Month'] = pd.to_datetime(df['Month'], format='%Y-%m')

# 索引并resample为月
indexed_df = df.set_index('Month')
ts = indexed_df['riders']
ts = ts.resample('M').sum()
print(ts.head(), '\nindex type:\n', type(ts.index))

     Month  riders
0  1960-01     648
1  1960-02     646
2  1960-03     639
3  1960-04     654
4  1960-05     630 
index type:
 <class ''>
Month
1960-01-31    648
1960-02-29    646
1960-03-31    639
1960-04-30    654
1960-05-31    630
Freq: M, Name: riders, dtype: int64 
index type:
 <class ''>

可以看出数据有很明显的上升趋势和季节性(每12个月)

    In [34]: 
  

ts.plot(title='Monthly Num. of Ridership')
test_stationarity(ts)
decomposition = seasonal_decompose(ts)  
fig = plt.figure()  
fig = decomposition.plot()  
fig.set_size_inches(10, 6)
acf_pacf(decomposition.resid, title='Residuals')

Results of Augmented Dickey-Fuller Test:
Test Statistic            -1.536597
p-value                    0.515336
Num Lags Used             12.000000
Num Observations Used    101.000000
Critical Value (1%)       -3.496818
Critical Value (10%)      -2.582277
Critical Value (5%)       -2.890611
dtype: float64

< at 0x23e00b5e128>

平稳化序列

做季节差分, shift参数为12, 即12个时间窗口, 12个月

    In [35]: 
  

ts_sdiff = ts - ts.shift(12)
test_stationarity(ts_sdiff.dropna())

decomposition = seasonal_decompose(ts_sdiff.dropna(), freq=12)  
fig = plt.figure()  
fig = decomposition.plot()  
fig.set_size_inches(10, 6)
acf_pacf(decomposition.resid, title='Residuals')

Results of Augmented Dickey-Fuller Test:
Test Statistic           -2.469741
p-value                   0.123011
Num Lags Used             3.000000
Num Observations Used    98.000000
Critical Value (1%)      -3.498910
Critical Value (10%)     -2.582760
Critical Value (5%)      -2.891516
dtype: float64

< at 0x23e009d1ef0>

用差分去除季节差分后的趋势, 即季节差分的第一差分
注意第一差分后趋势的数量级
得到平稳的时间序列

    In [36]: 
  

ts_diff_of_sdiff = ts_sdiff - ts_sdiff.shift(1)
test_stationarity(ts_diff_of_sdiff.dropna())

decomposition = seasonal_decompose(ts_diff_of_sdiff.dropna(), freq=12)  
fig = plt.figure()  
fig = decomposition.plot()  
fig.set_size_inches(10, 6)
acf_pacf(decomposition.resid, title='Residuals')

Results of Augmented Dickey-Fuller Test:
Test Statistic          -9.258520e+00
p-value                  1.427874e-15
Num Lags Used            0.000000e+00
Num Observations Used    1.000000e+02
Critical Value (1%)     -3.497501e+00
Critical Value (10%)    -2.582435e+00
Critical Value (5%)     -2.890906e+00
dtype: float64

< at 0x23e04b76748>

指数转换不能使序列更平稳

    In [37]: 
  

ts_log = np.log(ts)
ts_log_sdiff = ts_log - ts_log.shift(12)
ts_diff_of_log_sdiff = ts_log_sdiff - ts_log_sdiff.shift(1)
test_stationarity(ts_diff_of_log_sdiff.dropna())

decomposition = seasonal_decompose(ts_diff_of_log_sdiff.dropna(), freq=12)  
fig = plt.figure()  
fig = decomposition.plot()  
fig.set_size_inches(10, 6)
acf_pacf(decomposition.resid, title='Residuals')

Results of Augmented Dickey-Fuller Test:
Test Statistic          -8.882112e+00
p-value                  1.309452e-14
Num Lags Used            0.000000e+00
Num Observations Used    1.000000e+02
Critical Value (1%)     -3.497501e+00
Critical Value (10%)    -2.582435e+00
Critical Value (5%)     -2.890906e+00
dtype: float64

< at 0x23e02c95358>

SARIMA的参数 p, d, q, P, D, Q

已经得知 d=1, D=1 得到平稳的时间序列后需要决定p和q

p: 如果PACF(lag<=k)显著且PACF(lag>k)均不显著, 考虑将p设为k
q: 如果ACF(lag<=k)显著且ACF(lag>k)均不显著, 考虑将q设为k

p: 如果差分后序列的PACF迅速衰减到0而ACF在lag较高处仍有显著的较强的相关, 和/或 ACF(lag=1) > 0, 考虑增加p
q: 如果差分后序列的ACF迅速衰减到0, 和/或 ACF(lag=1) < 0, 考虑增加q

P: 如果季节差分后序列的ACF(lag=季节周期)显著为正, 考虑增加P
Q: 如果季节差分后序列的ACF(lag=季节周期)显著为负, 考虑增加Q

绘出拟合后的ARIMA的拟合残差的ACF和PACF, 按上述规则调整p, q, P, Q
通常P, Q最多为1

这里考虑p=0, q=0
季节周期=12, 季节差分后的ACF(12)显著为负, 可以考虑P=0, Q=1

    In [38]: 
  

acf_pacf(ts_diff_of_sdiff, title='ts_diff_of_sdiff')

拟合SARIMA

    In [39]: 
  

from  import SARIMAX

# SARIMA的参数: order = p, d, q, seasonal_order=P, D, Q, season的周期=12
model = SARIMAX(ts, order=(0,1,0),seasonal_order=(0,1,1,12))

# 已拟合的模型
fitted = model.fit()  

# ARIMA的结果
print(fitted.summary())

# 比较拟合的数据和原始数据,
ts.plot()
fitted.fittedvalues.plot(color='red')

# 拟合的数据和原始数据的残差
residuals = pd.DataFrame(fitted.resid)

acf_pacf(residuals, title='Residuals', figsize=(10,4))

# 计算拟残差平方和(RSS)
plt.title('RSS: %.4f'% sum((fitted.fittedvalues-ts)**2))

# 残差的核密度估计
residuals.plot(kind='kde')
print('\n\nResiduals Summary:\n', residuals.describe())

plt.show()

                                 Statespace Model Results                                 
==========================================================================================
Dep. Variable:                             riders   No. Observations:                  114
Model:             SARIMAX(0, 1, 0)x(0, 1, 1, 12)   Log Likelihood                -504.683
Date:                            Tue, 20 Jun 2017   AIC                           1013.365
Time:                                    05:37:32   BIC                           1018.838
Sample:                                01-31-1960   HQIC                          1015.586
                                     - 06-30-1969                                         
Covariance Type:                              opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
.L12      -0.6937      0.118     -5.867      0.000      -0.925      -0.462
sigma2      1185.5644    183.539      6.459      0.000     825.834    1545.295
===================================================================================
Ljung-Box (Q):                       43.16   Jarque-Bera (JB):                 2.01
Prob(Q):                              0.34   Prob(JB):                         0.37
Heteroskedasticity (H):               1.49   Skew:                             0.28
Prob(H) (two-sided):                  0.25   Kurtosis:                         3.39
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).


Residuals Summary:
                 0
count  114.000000
mean     0.920410
std     72.659634
min   -214.018001
25%    -22.898384
50%     -5.576031
75%     18.514757
max    648.000000

预测并校验模型

这种校验方法称为 one-step out-of-sample forecast
out-of-sample 是指预测时所用样本/观测值不在训练集内

将原序列分为训练和测试集
将训练集设为历史, 此时时间为t
迭代, 迭代次数为测试集的大小:
1. 在历史上拟合SARIMA模型, 参数为先前得到的p,d,q,P,D,Q
2. 生成t+1的预测, 并将其加入历史

比较预测值和真实值

    In [40]: 
  

from  import mean_squared_error
# 我们的ARIMA模型针对的是序列ts_week_log
# 因此使用的数据为ts_week_log
# 因此须将预测做指数转换
# 使用 n_test_obs 个点校验
n_test_obs = 20
size = int(len(ts) - n_test_obs)

# 切分数据集
train, test = ts[:size], ts[size:len(ts)]

# 历史值
history = [x for x in train]

# 预测值
predictions = list()

# 置信区间
confidence_intervals = list()

print('Predicted vs Expected Values...')
print('\n')
for t in range(len(test)):
    model = SARIMAX(history, order=(0,1,0),seasonal_order=(0,1,1,12))
    model_fit = model.fit()
    output = model_fit.forecast(steps=1)
    yhat = output[0]
    predictions.append(float(yhat))
    obs = test[t]
    history.append(obs)
    
    print('predicted=%f, expected=%f' % (yhat, obs))

error = mean_squared_error(test, predictions)

print('\n')
predictions_series = pd.Series(predictions, index = test.index)
print('Test MSE: %.6f' % error)

predictions_series = pd.Series(predictions, index = test.index)

fig, ax = plt.subplots(figsize=(10, 6))
ax.set(title='Bus Riders', xlabel='Month', ylabel='100 Bus Riders')
ax.plot(ts[-n_test_obs:], 'o', label='observed')
ax.plot(predictions_series, 'g', label='rolling one-step out-of-sample forecast')
legend = ax.legend(loc='upper left')

Predicted vs Expected Values...


predicted=1475.509981, expected=1424.000000
predicted=1367.194981, expected=1360.000000
predicted=1441.147682, expected=1429.000000
predicted=1461.899309, expected=1440.000000
predicted=1406.502132, expected=1414.000000
predicted=1438.593286, expected=1424.000000
predicted=1396.225163, expected=1408.000000
predicted=1386.002350, expected=1337.000000
predicted=1299.644308, expected=1258.000000
predicted=1214.892724, expected=1214.000000
predicted=1324.750929, expected=1326.000000
predicted=1353.201730, expected=1417.000000
predicted=1424.747027, expected=1417.000000
predicted=1357.550530, expected=1329.000000
predicted=1407.819886, expected=1461.000000
predicted=1483.069065, expected=1425.000000
predicted=1392.212819, expected=1419.000000
predicted=1441.571043, expected=1432.000000
predicted=1408.266795, expected=1394.000000
predicted=1357.971777, expected=1327.000000


Test MSE: 1049.765808

    In [ ]: 
  

秒客网

TS sample code

Jupyter 设置

读取数据及简单的预处理

载入数据集

重取样时间

自相关与偏自相关函数

白噪声检验

稳定时间序列

检查稳定性(Stationarity)

移除趋势和季节性

建立模型

找到合适的差分阶数 d

AR和MA项数 p, q

拟合ARIMA模型

将拟合的值逆向转换并可视化

预测并校验模型

可以对季节性建模的 SARIMA 模型

平稳化序列

SARIMA的参数 p, d, q, P, D, Q

拟合SARIMA

预测并校验模型

相关文章