1. 优化器 Optimizer
1.0 基本用法
- 优化器主要是在模型训练阶段对模型可学习参数进行更新, 常用优化器有 SGD,RMSprop,Adam等
- 优化器初始化时传入传入模型的可学习参数,以及其他超参数如
lr
,momentum
等 - 在训练过程中先调用
optimizer.zero_grad()
清空梯度,再调用()
反向传播,最后调用()
更新模型参数
简单使用示例如下所示:
import torch
import numpy as np
import warnings
('ignore') #ignore warnings
x = (-, , 2000)
y = (x)
p = ([1, 2, 3])
xx = (-1).pow(p)
model = (
(3, 1),
(0, 1)
)
loss_fn = (reduction='sum')
learning_rate = 1e-3
optimizer = ((), lr=learning_rate)
for t in range(1, 1001):
y_pred = model(xx)
loss = loss_fn(y_pred, y)
if t % 100 == 0:
print('No.{: 5d}, loss: {:.6f}'.format(t, ()))
optimizer.zero_grad() # 梯度清零
() # 反向传播计算梯度
() # 梯度下降法更新参数
No. 100, loss: 26215.714844
No. 200, loss: 11672.815430
No. 300, loss: 4627.826172
No. 400, loss: 1609.388062
No. 500, loss: 677.805115
No. 600, loss: 473.932159
No. 700, loss: 384.862396
No. 800, loss: 305.365143
No. 900, loss: 229.774719
No. 1000, loss: 161.483841
1.1 PyTorch 中的优化器
所有优化器都是继承父类 Optimizer
,如下列表是 PyTorch 提供的优化器:
- SGD
- ASGD
- Adadelta
- Adagrad
- Adam
- AdamW
- Adamax
- SparseAdam
- RMSprop
- Rprop
- LBFGS
1.2 父类Optimizer 基本原理
Optimizer
是所有优化器的父类,它主要有如下公共方法:
- add_param_group(param_group): 添加模型可学习参数组
- step(closure): 进行一次参数更新
- zero_grad(): 清空上次迭代记录的梯度信息
- state_dict(): 返回 dict 结构的参数状态
- load_state_dict(state_dict): 加载 dict 结构的参数状态
1.2.1 初始化 Optimizer
初始化优化器只需要将模型的可学习参数(params)和超参数(defaults)分别传入优化器的构造函数,下面是Optimizer
的初始化函数核心代码:
class Optimizer(object):
def __init__(self, params, defaults):
# 字典类型,子类传入,用于表示全部参数组的默认超参
= defaults
if isinstance(params, ):
raise TypeError("params argument given to the optimizer should be "
"an iterable of Tensors or dicts, but got " +
(params))
self.param_groups = []
param_groups = list(params)
if not isinstance(param_groups[0], dict):
param_groups = [{'params': param_groups}]
for param_group in param_groups:
self.add_param_group(param_group)
1.2.2 add_param_group
该方法在初始化函数中用到,主要用来向 self.param_groups
添加不同分组的模型参数
def add_param_group(self, param_group):
r"""Add a param group to the :class:`Optimizer` s `param_groups`.
This can be useful when fine tuning a pre-trained network as frozen layers can be made
trainable and added to the :class:`Optimizer` as training progresses.
Arguments:
param_group (dict): Specifies what Tensors should be optimized along with group
specific optimization options.
"""
assert isinstance(param_group, dict), "param group must be a dict"
params = param_group['params']
if isinstance(params, ):
param_group['params'] = [params]
elif isinstance(params, set):
raise TypeError('optimizer parameters need to be organized in ordered collections, but '
'the ordering of tensors in sets will change between runs. Please use a list instead.')
else:
param_group['params'] = list(params)
for param in param_group['params']:
if not isinstance(param, ):
raise TypeError("optimizer can only optimize Tensors, "
"but one of the params is " + (param))
if not param.is_leaf:
raise ValueError("can't optimize a non-leaf Tensor")
# 利用默认参数给所有组设置统一的超参
for name, default in ():
if default is required and name not in param_group:
raise ValueError("parameter group didn't specify a value of required optimization parameter "+name)
else:
param_group.setdefault(name, default)
params = param_group['params']
if len(params) != len(set(params)):
("optimizer contains a parameter group with duplicate parameters; "
"in future, this will cause an error; "
"see /pytorch/pytorch/issues/40967 for more information", stacklevel=3)
param_set = set()
for group in self.param_groups:
param_set.update(set(group['params']))
if not param_set.isdisjoint(set(param_group['params'])):
raise ValueError("some parameters appear in more than one parameter group")
self.param_groups.append(param_group)
利用 add_param_group 函数功能,可以对模型不同的可学习参数组设定不同的超参数,初始化优化器可传入元素是 dict 的 list,每个 dict 中的 key 是 params
或者其他超参数的名字如 lr
,下面是一个实用的例子:对模型的fc
层参数设置不同的学习率
from import SGD
from torch import nn
class DummyModel():
def __init__(self, class_num=10):
super(DummyModel, self).__init__()
= (
nn.Conv2d(3, 64, kernel_size=3, padding=1),
(),
nn.Conv2d(64, 128, kernel_size=3, padding=1),
(),
)
= nn.AdaptiveAvgPool2d(1)
= (128, class_num)
def forward(self, x):
x = (x)
x = (x)
x = ([0], -1)
x = (x)
return x
model = DummyModel().cuda()
optimizer = SGD([
{'params': ()},
{'params': (), 'lr': 1e-3} # 对 fc的参数设置不同的学习率
], lr=1e-2, momentum=0.9)
1.2.3 step
此方法主要完成一次模型参数的更新
- 基类
Optimizer
定义了 step 方法接口,如下所示
def step(self, closure):
r"""Performs a single optimization step (parameter update).
Arguments:
closure (callable): A closure that reevaluates the model and
returns the loss. Optional for most optimizers.
.. note::
Unless otherwise specified, this function should not modify the
``.grad`` field of the parameters.
"""
raise NotImplementedError
- 子类如 SGD 需要实现 step 方法,如下所示:
@torch.no_grad()
def step(self, closure=None):
"""Performs a single optimization step.
Arguments:
closure (callable, optional): A closure that reevaluates the model
and returns the loss.
"""
loss = None
if closure is not None:
with torch.enable_grad():
loss = closure()
for group in self.param_groups:
weight_decay = group['weight_decay']
momentum = group['momentum']
dampening = group['dampening']
nesterov = group['nesterov']
for p in group['params']:
if is None:
continue
d_p =
if weight_decay != 0:
d_p = d_p.add(p, alpha=weight_decay)
if momentum != 0:
param_state = [p]
if 'momentum_buffer' not in param_state:
buf = param_state['momentum_buffer'] = (d_p).detach()
else:
buf = param_state['momentum_buffer']
buf.mul_(momentum).add_(d_p, alpha=1 - dampening)
if nesterov:
d_p = d_p.add(buf, alpha=momentum)
else:
d_p = buf
p.add_(d_p, alpha=-group['lr'])
return loss
- step 方法可传入闭包函数 closure,主要目的是为了实现如
Conjugate Gradient
和LBFGS
等优化算法,这些算法需要对模型进行多次评估 - Python 中闭包概念:在一个内部函数中,对外部作用域的变量进行引用(并且一般外部函数的返回值为内部函数),那么内部函数就被认为是闭包
下面是 closure 的简单示例:
from import CrossEntropyLoss
dummy_model = DummyModel().cuda()
optimizer = SGD(dummy_model.parameters(), lr=1e-2, momentum=0.9, weight_decay=1e-4)
# 定义loss
loss_fn = CrossEntropyLoss()
# 定义数据
batch_size = 2
data = (64, 3, 64, 128).cuda() # 制造假数据shape=64 * 3 * 64 * 128
data_label = (0, 10, size=(64,), dtype=).cuda() # 制造假的label
for batch_index in range(10):
batch_data = data[batch_index*batch_size: batch_index*batch_size + batch_size]
batch_label = data_label[batch_index*batch_size: batch_index*batch_size + batch_size]
def closure():
optimizer.zero_grad() # 清空梯度
output = dummy_model(batch_data) # forward
loss = loss_fn(output, batch_label) # 计算loss
() # backward
print('No.{: 2d} loss: {:.6f}'.format(batch_index, ()))
return loss
(closure=closure) # 更新参数
No. 0 loss: 2.279336
No. 1 loss: 2.278228
No. 2 loss: 2.291000
No. 3 loss: 2.245984
No. 4 loss: 2.236940
No. 5 loss: 2.104764
No. 6 loss: 2.227481
No. 7 loss: 2.108526
No. 8 loss: 2.254484
No. 9 loss: 2.536439
1.2.4 zero_grad
- 在反向传播计算梯度之前对上一次迭代时记录的梯度清零,参数
set_to_none
设置为True
时会直接将参数梯度设置为None
,从而减小内存使用, 但通常情况下不建议设置这个参数,因为梯度设置为None
和0
在 PyTorch 中处理逻辑会不一样。
def zero_grad(self, set_to_none: bool = False):
r"""Sets the gradients of all optimized :class:`` s to zero.
Arguments:
set_to_none (bool): instead of setting to zero, set the grads to None.
This is will in general have lower memory footprint, and can modestly improve performance.
However, it changes certain behaviors. For example:
1. When the user tries to access a gradient and perform manual ops on it,
a None attribute or a Tensor full of 0s will behave differently.
2. If the user requests ``zero_grad(set_to_none=True)`` followed by a backward pass, ``.grad``s
are guaranteed to be None for params that did not receive a gradient.
3. ```` optimizers have a different behavior if the gradient is 0 or None
(in one case it does the step with a gradient of 0 and in the other it skips
the step altogether).
"""
for group in self.param_groups:
for p in group['params']:
if is not None:
if set_to_none:
= None
else:
if .grad_fn is not None:
.detach_()
else:
.requires_grad_(False)
.zero_()
1.2.5 state_dict() 和 load_state_dict
这两个方法实现序列化和反序列化功能。
- state_dict(): 将优化器管理的参数和其状态信息以 dict 形式返回
- load_state_dict(state_dict): 加载之前返回的 dict,更新参数和其状态
- 两个方法可用来实现模型训练中断后继续训练功能
def state_dict(self):
r"""Returns the state of the optimizer as a :class:`dict`.
It contains two entries:
* state - a dict holding current optimization state. Its content
differs between optimizer classes.
* param_groups - a dict containing all parameter groups
"""
# Save order indices instead of Tensors
param_mappings = {}
start_index = 0
def pack_group(group):
nonlocal start_index
packed = {k: v for k, v in () if k != 'params'}
param_mappings.update({id(p): i for i, p in enumerate(group['params'], start_index)
if id(p) not in param_mappings})
packed['params'] = [param_mappings[id(p)] for p in group['params']]
start_index += len(packed['params'])
return packed
param_groups = [pack_group(g) for g in self.param_groups]
# Remap state to use order indices as keys
packed_state = {(param_mappings[id(k)] if isinstance(k, ) else k): v
for k, v in ()}
return {
'state': packed_state,
'param_groups': param_groups,
}