lecture7,Training Neural Networks, Part 2

1,Fancier optimization

sgd存在的问题：

Very slow progress along shallow dimension, jitter along steep direction。if loss changes quickly in one direction and slowly in another.
local minima or saddle point

几种梯度下降方法

A, sgd+momentum 试图解决sgd中鞍点和极小值点的问题

vx = 0
while true
    dx = compute_gradient(x)
    vx  = rho * vx + dx
    x -= learning_rate * vx
##或者（cs231作业中采用的方案）：
while true
    dx = compute_gradient(x)
    vx  = rho * vx - learning_rate *dx
    x += vx

B, AdaGrad 试图解决sgd中zigzag的问题

grad_squared = 0
while true
    dx = compute_gradient(x)
    grad_squared += dx * dx
    x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-7)

但当迭代次数过多时，grad_squared会很大，x几乎不能改变，所以有了改进版RMSProp

C, RMSProp,利用decay限制了grad_squared的增加

grad_squared = 0
while true
    dx = compute_gradient(x)
    grad_squared = decay_rate * grad_squared + (1-decay) * dx * dx
    x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-7)

D, Adam 结合了momentum和RMSProp

first_moment = 0
second_moment = 0
for t in range(num_iterations):
    dx = compute_gradient(x)
    first_moment = beta1 * first_moment + （1-beta）* dx## Momentum
    second_moment = beta2 * second_moment + (1 - beta2) * dx * dx##RMSProp

    ##Bias correction for the fact thatfirst and second moment estimates start at zero
    first_unbias = first_moment / (1 - beta1 ** t)
    second_unbias = second_moment / (1 - beta2 ** t)

    x -= learning_rate * first_unbias / (np.sqrt(second_unbias) + 1e-7)## Momentum

Adam with beta1 = 0.9,beta2 = 0.999, and learning_rate = 1e-3 or 5e-4is a great starting point for many models!

2，regularization

2.1,dropout :

It is forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons.
The neurons which are“dropped out” in this way do not contribute to the forward pass and do not participate in back-propagation .At test time, we use all the neurons but multiply their outputs by “drop out” rate.(-nips 2012 Alex etc.)

2.2,data augmentation

1,horizontal filps
2,Random crops and scales

Training: sample random crops / scales
ResNet:
1. Pick random L in range [256, 480]
2. Resize training image, short side = L
3. Sample random 224 x 224 patch
Testing: average a fixed set of crops
ResNet:
1. Resize image at 5 scales: {224, 256, 384, 480, 640}
2. For each size, use 10 224 x 224 crops: 4 corners + center, + flips

3,PCA jittering

PCA的推倒：
现有数据矩阵 X ,元素分别为 x1,x2,..,xN ,目的：求一个向量 u 使这些元素经过该向量映射后的方差最大
$x m e a n ⎯ ⎯ ⎯ ⎯ ⎯ ⎯ ⎯ ⎯ = 1 N \sum k = 1 N x k$ 即求使下式取最大值时 u 的值

$1 N \sum k = 1 N (x T k u - x T m e a n u) 2 = u T {1 N \sum k = 1 N (x k - x m e a n) (x T k - x T m e a n)} u$ 设 S=1N∑Nk=1(xk−xmean)(xTk−xTmean) , u 为单位向量（ uTu=1 ）,对 u 求导得到： Su−λu=0 ,
所以 u 是 S 的特征向量，维数大小与数据相同。

秒客网