1,Fancier optimization
sgd存在的问题:
- Very slow progress along shallow dimension, jitter along steep direction。if loss changes quickly in one direction and slowly in another.
- local minima or saddle point
几种梯度下降方法
A, sgd+momentum 试图解决sgd中鞍点和极小值点的问题
vx = 0
while true
dx = compute_gradient(x)
vx = rho * vx + dx
x -= learning_rate * vx
##或者(cs231作业中采用的方案):
while true
dx = compute_gradient(x)
vx = rho * vx - learning_rate *dx
x += vx
B, AdaGrad 试图解决sgd中zigzag的问题
grad_squared = 0
while true
dx = compute_gradient(x)
grad_squared += dx * dx
x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-7)
但当迭代次数过多时,grad_squared会很大,x几乎不能改变,所以有了改进版RMSProp
C, RMSProp,利用decay限制了grad_squared的增加
grad_squared = 0
while true
dx = compute_gradient(x)
grad_squared = decay_rate * grad_squared + (1-decay) * dx * dx
x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-7)
D, Adam 结合了momentum和RMSProp
first_moment = 0
second_moment = 0
for t in range(num_iterations):
dx = compute_gradient(x)
first_moment = beta1 * first_moment + (1-beta)* dx## Momentum
second_moment = beta2 * second_moment + (1 - beta2) * dx * dx##RMSProp
##Bias correction for the fact thatfirst and second moment estimates start at zero
first_unbias = first_moment / (1 - beta1 ** t)
second_unbias = second_moment / (1 - beta2 ** t)
x -= learning_rate * first_unbias / (np.sqrt(second_unbias) + 1e-7)## Momentum
Adam with beta1 = 0.9,beta2 = 0.999, and learning_rate = 1e-3 or 5e-4is a great starting point for many models!
2,regularization
2.1,dropout :
It is forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons.
The neurons which are“dropped out” in this way do not contribute to the forward pass and do not participate in back-propagation .At test time, we use all the neurons but multiply their outputs by “drop out” rate.(-nips 2012 Alex etc.)
2.2,data augmentation
1,horizontal filps
2,Random crops and scales
- Training: sample random crops / scales
ResNet:- Pick random L in range [256, 480]
- Resize training image, short side = L
- Sample random 224 x 224 patch
- Testing: average a fixed set of crops
ResNet:- Resize image at 5 scales: {224, 256, 384, 480, 640}
- For each size, use 10 224 x 224 crops: 4 corners + center, + flips
3,PCA jittering
PCA的推倒:
现有数据矩阵X ,元素分别为x1,x2,..,xN ,目的:求一个向量u 使这些元素经过该向量映射后的方差最大即求使下式取最大值时xmean⎯⎯⎯⎯⎯⎯⎯⎯=1N∑k=1Nxk u 的值
设1N∑k=1N(xTku−xTmeanu)2=uT{1N∑k=1N(xk−xmean)(xTk−xTmean)}u S=1N∑Nk=1(xk−xmean)(xTk−xTmean) ,u 为单位向量(uTu=1 ),对u 求导得到:Su−λu=0 ,
所以u 是S 的特征向量,维数大小与数据相同。