08. 训练神经网络3 -- 优化算法

前面一部分主要讲了神经网络在前向传播过程中对数据的处理,包括去均值预处理, 权重初始化, 在线性输出结果和激活函数之间批量归一化BN,以及进入下一层layer之前的随机失活.

那么这一部分将介绍在反向传播过程中对梯度下降的处理,也就是几种优化算法.

真的很佩服这些人…一个梯度下降,你之前可能想的就是 $w - = α d w$ 就好了,但研究者们也能大做文章,厉害厉害!!

随机梯度下降(sgd)
momentum
RMSprop
Adam
学习率衰减 learning_rate decay
局部最优 local optimal

1. 随机梯度下降

其实就是mini-batch,每次iteration不是处理全部数据集,而是从全部数据集中随机选取batch_size个样本量进行前向和反向传播,并完成一次梯度下降.
08. 训练神经网络3 -- 优化算法

python代码:

def sgd(w, dw, config=None):
"""
 Performs vanilla stochastic gradient descent.

 config format:
 - learning_rate: Scalar learning rate.
 """
if config is None: config = {}
    config.setdefault('learning_rate', 1e-2)  #{'learning_rate':1e-2}

    w -= config['learning_rate'] * dw    #随机梯度下降~
return w, config

2. 指数加权平均 Exponentially weighted averages

统计学中又称移动加权平均.

v_{t} = β v_{t - 1} + (1 - β) θ_{t}

θ 是 实 际 温 度, v 是 统 计 规 律 的 温 度

$v_{t} = β v_{t - 1} + (1 - β) θ_{t}$

假设 $β = 0.9$

08. 训练神经网络3 -- 优化算法

得到如图所示的指数加权平均:
08. 训练神经网络3 -- 优化算法

显然 $β$ 越大时,之前第n填填所占的权重 $(0.1 β^{n})$ 就越大,统计的天数n就越多. 当n=10, $β = 0.9$ 时, ${0.9}^{10} < \frac{1}{e}$ 这个时候权重太小就不考虑了.

3. 指数加权平均的偏差修正 bias correction

主要是针对估计的初期部分.
$v_{t} = \frac{β v_{t - 1} + (1 - β) θ_{t}}{1 - β}$

4. 动量梯度下降 Gradient descent with momentum

原理就是:纵向摆动加权平均为0,横向一直是沿着loss减小的方向,因而会加速这个方向的梯度.

$v_{d w} = β v_{d w} + (1 - β) d W$

$v_{d b} = β v_{d b} + (1 - β) d b$

$W = W - α v_{d w}, b = b - α v_{d w}$

Hyperparameters: $α, β$ . $β 是指数加权平均系数, α 是学习率$

python 代码:

def sgd_momentum(w, dw, config=None):
"""
 Performs stochastic gradient descent with momentum.

 config format:
 - learning_rate: Scalar learning rate.
 - momentum: Scalar between 0 and 1 giving the momentum value. ## 指数加权平均系数
 Setting momentum = 0 reduces to sgd.
 - velocity: A numpy array of the same shape as w and dw used to store a
 moving average of the gradients. ## 经过加权平均后的权重,和W的shape是一样的
 """
if config is None: config = {}
    config.setdefault('learning_rate', 1e-2)
    config.setdefault('momentum', 0.9)
    v = config.get('velocity', np.zeros_like(w)) #初始化为0
    next_w = None
## Nesterov Accelerated gradient
    v = config['momentum'] * v - config['learning']*dw
    next_w =  w + v
## Ng 讲解的似乎不太一样
# v = config['momentum'] * v -(1 - config['momentum'])*dw
# next_w = w - config['learning_rate'] * v
    config['velocity'] = v
return next_w, config

5. RMSprop root mean square prop算法

08. 训练神经网络3 -- 优化算法

$S_{d w} = 0, S_{d b} = 0$

$S_{d w} = β_{2} S_{d w} + (1 - β_{2}) (d W)^{2}$

$S_{d b} = β_{2} S_{d b} + (1 - β_{2}) (d b)^{2}$

$W = W - α \frac{d w}{\sqrt{S_{d w}} + ϵ}, b = b - α \frac{d b}{\sqrt{S_{d b}} + ϵ}$

RMSprop原理: 假设纵向需要消除摆动的是参数b,当摆动很大,即 $| d b |$ 很大. 那么通过指数加权平均后,与momentum不同的是的 $S_{d b}^{2}$ 显然是一个比较大的数,那么 $\frac{d b}{\sqrt{S_{d b} + ϵ}}$ 就相对减小,在纵向的梯度变化也会较小.

横向需要加速的是参数w,那么当dw很小的情况下,反过来的道理~~

当然在高维空间中,纵向需要消除的可能是W1,W3,W10…横向需要加速的可能是W2,W5…

def rmsprop(x, dx, config=None):
"""
 Uses the RMSProp update rule, which uses a moving average of squared
 gradient values to set adaptive per-parameter learning rates.

 config format:
 - learning_rate: Scalar learning rate.
 - decay_rate: Scalar between 0 and 1 giving the decay rate for the squared
 gradient cache.
 - epsilon: Small scalar used for smoothing to avoid dividing by zero.
 - cache: Moving average of second moments of gradients.
 """
if config is None: config = {}
    config.setdefault('learning_rate', 1e-2)    ## 学习率
    config.setdefault('decay_rate', 0.99)       ## 指数加权平均衰减率
    config.setdefault('epsilon', 1e-8)
    config.setdefault('cache', np.zeros_like(x))
    next_x = None
    config['cache'] = config['decay_rate']*config['cache']+(1-config['decay_rate'])*(dx**2)
    next_x = x - config['learning_rate']*dx/(np.sqrt(config['cache'])+config['epsilon'])

return next_x, config

6. Adam optimization algorithm

Adaptive Momentum Estimation

08. 训练神经网络3 -- 优化算法

Ng的公式难得写的这么公整,我就不敲了…只是中间的yhat什么鬼…
Adam的原理也很简单:将momentum和RMSprop进行了结合,并且两个都用到了偏差修正.

超参数的设置:

$α$ needs to be tune

$β_{1} : 0.9$ 这个dw的指数加权平均的底数

$β_{2} : 0.99$ 这个是 $(d w^{2})$ 的指数加权平均的底数

$ϵ : 10^{- 8}$

python 代码:

def adam(x, dx, config=None):
"""
 Uses the Adam update rule, which incorporates moving averages of both the
 gradient and its square and a bias correction term.

 config format:
 - learning_rate: Scalar learning rate.
 - beta1: Decay rate for moving average of first moment of gradient. ## dw的指数加权平均衰减率
 - beta2: Decay rate for moving average of second moment of gradient. ## dw^2的指数加权平均衰减率
 - epsilon: Small scalar used for smoothing to avoid dividing by zero.
 - m: Moving average of gradient. ## dw的移动平均值
 - v: Moving average of squared gradient. ## dw^2的移动平均值
 - t: Iteration number.
 """
if config is None: config = {}
    config.setdefault('learning_rate', 1e-3)
    config.setdefault('beta1', 0.9)
    config.setdefault('beta2', 0.999)
    config.setdefault('epsilon', 1e-8)
    config.setdefault('m', np.zeros_like(x))
    config.setdefault('v', np.zeros_like(x))
    config.setdefault('t', 1)      ## 但是并没有传入这个参数啊..默认一直为1?
    next_x = None
    config['m'] = config['beta1']*config['m']+(1-config['beta1'])*dx
    config['v'] = config['beta2']*config['v']+(1-config['beta2'])*(dx**2)
##偏差修正
    m_correct = config['m']/(1-config['beta1']**config['t'])
    v_correct = config['v']/(1-config['beta2']**config['t'])
    next_x = x - config['learning_rate']*m_correct/(np.sqrt(v_correct)+config['epsilon'])
return next_x, config

7. 学习率衰减 learning_rate decay

1)指数衰减

$α = {0.95}^{e p o c h_n u m} * α_{0}$

2)

$α = \frac{1}{1 + d e c a y_r a t e * e p o c h_n u m} * α_{0}$

8. local optimal

08. 训练神经网络3 -- 优化算法
画风soooo cute!!!全世界画人儿都是一样的啊hahahhhhh

鞍点就是梯度也为0,但却不是最优解.

import numpy as np
a = np.array([1,2])
a**2

array([1, 4])

秒客网

08. 训练神经网络3 -- 优化算法

1. 随机梯度下降

2. 指数加权平均 Exponentially weighted averages

3. 指数加权平均的偏差修正 bias correction

4. 动量梯度下降 Gradient descent with momentum

5. RMSprop root mean square prop算法

6. Adam optimization algorithm

7. 学习率衰减 learning_rate decay

1)指数衰减

2)

8. local optimal

相关文章