DDP 数据shuffle 的设置
使用DDP要给dataloader传入sampler参数(torch.utils.data.distributed.DistributedSampler(dataset, num_replicas=None, rank=None, shuffle=True, seed=0, drop_last=False)) 。 默认shuffle=True,但按照pytorch DistributedSampler的实现:
1
2
3
4
5
6
7
8
|
def __iter__( self ) - > Iterator[T_co]:
if self .shuffle:
# deterministically shuffle based on epoch and seed
g = torch.Generator()
g.manual_seed( self .seed + self .epoch)
indices = torch.randperm( len ( self .dataset), generator = g).tolist() # type: ignore
else :
indices = list ( range ( len ( self .dataset))) # type: ignore
|
产生随机indix的种子是和当前的epoch有关,所以需要在训练的时候手动set epoch的值来实现真正的shuffle:
1
2
3
4
|
for epoch in range (start_epoch, n_epochs):
if is_distributed:
sampler.set_epoch(epoch)
train(loader)
|
DDP 增大batchsize 效果变差的问题
large batchsize:
理论上的优点:
数据中的噪声影响可能会变小,可能容易接近最优点;
缺点和问题:
降低了梯度的variance;(理论上,对于凸优化问题,低的梯度variance可以得到更好的优化效果; 但是实际上Keskar et al验证了增大batchsize会导致差的泛化能力);
对于非凸优化问题,损失函数包含多个局部最优点,小的batchsize有噪声的干扰可能容易跳出局部最优点,而大的batchsize有可能停在局部最优点跳不出来。
解决方法:
增大learning_rate,但是可能出现问题,在训练开始就用很大的learning_rate 可能导致模型不收敛 (https://arxiv.org/abs/1609.04836)
使用warming up (https://arxiv.org/abs/1706.02677)
warmup
在训练初期就用很大的learning_rate可能会导致训练不收敛的问题,warmup的思想是在训练初期用小的学习率,随着训练慢慢变大学习率,直到base learning_rate,再使用其他decay(CosineAnnealingLR)的方式训练.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
|
# copy from https://github.com/ildoonet/pytorch-gradual-warmup-lr/blob/master/warmup_scheduler/scheduler.py
from torch.optim.lr_scheduler import _LRScheduler
from torch.optim.lr_scheduler import ReduceLROnPlateau
class GradualWarmupScheduler(_LRScheduler):
""" Gradually warm-up(increasing) learning rate in optimizer.
Proposed in 'Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour'.
Args:
optimizer (Optimizer): Wrapped optimizer.
multiplier: target learning rate = base lr * multiplier if multiplier > 1.0. if multiplier = 1.0, lr starts from 0 and ends up with the base_lr.
total_epoch: target learning rate is reached at total_epoch, gradually
after_scheduler: after target_epoch, use this scheduler(eg. ReduceLROnPlateau)
"""
def __init__( self , optimizer, multiplier, total_epoch, after_scheduler = None ):
self .multiplier = multiplier
if self .multiplier < 1. :
raise ValueError( 'multiplier should be greater thant or equal to 1.' )
self .total_epoch = total_epoch
self .after_scheduler = after_scheduler
self .finished = False
super (GradualWarmupScheduler, self ).__init__(optimizer)
def get_lr( self ):
if self .last_epoch > self .total_epoch:
if self .after_scheduler:
if not self .finished:
self .after_scheduler.base_lrs = [base_lr * self .multiplier for base_lr in self .base_lrs]
self .finished = True
return self .after_scheduler.get_last_lr()
return [base_lr * self .multiplier for base_lr in self .base_lrs]
if self .multiplier = = 1.0 :
return [base_lr * ( float ( self .last_epoch) / self .total_epoch) for base_lr in self .base_lrs]
else :
return [base_lr * (( self .multiplier - 1. ) * self .last_epoch / self .total_epoch + 1. ) for base_lr in self .base_lrs]
def step_ReduceLROnPlateau( self , metrics, epoch = None ):
if epoch is None :
epoch = self .last_epoch + 1
self .last_epoch = epoch if epoch ! = 0 else 1 # ReduceLROnPlateau is called at the end of epoch, whereas others are called at beginning
if self .last_epoch < = self .total_epoch:
warmup_lr = [base_lr * (( self .multiplier - 1. ) * self .last_epoch / self .total_epoch + 1. ) for base_lr in self .base_lrs]
for param_group, lr in zip ( self .optimizer.param_groups, warmup_lr):
param_group[ 'lr' ] = lr
else :
if epoch is None :
self .after_scheduler.step(metrics, None )
else :
self .after_scheduler.step(metrics, epoch - self .total_epoch)
def step( self , epoch = None , metrics = None ):
if type ( self .after_scheduler) ! = ReduceLROnPlateau:
if self .finished and self .after_scheduler:
if epoch is None :
self .after_scheduler.step( None )
else :
self .after_scheduler.step(epoch - self .total_epoch)
self ._last_lr = self .after_scheduler.get_last_lr()
else :
return super (GradualWarmupScheduler, self ).step(epoch)
else :
self .step_ReduceLROnPlateau(metrics, epoch)
|
分布式多卡训练DistributedDataParallel踩坑
近几天想研究了多卡训练,就花了点时间,本以为会很轻松,可是好多坑,一步一步踏过来,一般分布式训练分为单机多卡与多机多卡两种类型;
主要有两种方式实现:
1、DataParallel: Parameter Server模式,一张卡位reducer,实现也超级简单,一行代码
DataParallel是基于Parameter server的算法,负载不均衡的问题比较严重,有时在模型较大的时候(比如bert-large),reducer的那张卡会多出3-4g的显存占用
2、DistributedDataParallel:官方建议用新的DDP,采用all-reduce算法,本来设计主要是为了多机多卡使用,但是单机上也能用
为什么要分布式训练?
可以用多张卡,总体跑得更快
可以得到更大的 BatchSize
有些分布式会取得更好的效果
主要分为以下几个部分:
单机多卡,DataParallel(最常用,最简单)
单机多卡,DistributedDataParallel(较高级)、多机多卡,DistributedDataParallel(*)
如何启动训练
模型保存与读取
注意事项
一、单机多卡(DATAPARALLEL)
1
2
3
4
5
6
7
8
9
|
from torch.nn import DataParallel
device = torch.device( "cuda" )
#或者device = torch.device( "cuda:0" if True else "cpu" )
model = MyModel()
model = model.to(device)
model = DataParallel(model)
#或者model = nn.DataParallel(model,device_ids = [ 0 , 1 , 2 , 3 ])
|
比较简单,只需要加一行代码就行, model = DataParallel(model)
二、多机多卡、单机多卡(DISTRIBUTEDDATAPARALLEL)
建议先把注意事项看完在修改代码,防止出现莫名的bug,修改训练代码如下:
其中opt.local_rank要在代码前面解析这个参数,可以去后面看我写的注意事项;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
|
from torch.utils.data.distributed import DistributedSampler
import torch.distributed as dist
import torch
# Initialize Process Group
dist_backend = 'nccl'
print ( 'args.local_rank: ' , opt.local_rank)
torch.cuda.set_device(opt.local_rank)
dist.init_process_group(backend = dist_backend)
model = yourModel()#自己的模型
if torch.cuda.device_count() > 1 :
print ( "Let's use" , torch.cuda.device_count(), "GPUs!" )
# 5) 封装
# model = torch.nn.parallel.DistributedDataParallel(model,
# device_ids=[opt.local_rank],
# output_device=opt.local_rank)
model = torch.nn.parallel.DistributedDataParallel(model.cuda(), device_ids = [opt.local_rank])
device = torch.device(opt.local_rank)
model.to(device)
dataset = ListDataset(train_path, augment = True , multiscale = opt.multiscale_training, img_size = opt.img_size, normalized_labels = True ) #自己的读取数据的代码
world_size = torch.cuda.device_count()
datasampler = DistributedSampler(dataset, num_replicas = dist.get_world_size(), rank = opt.local_rank)
dataloader = torch.utils.data.DataLoader(
dataset,
batch_size = opt.batch_size,
shuffle = False ,
num_workers = opt.n_cpu,
pin_memory = True ,
collate_fn = dataset.collate_fn,
sampler = datasampler
)#在原始读取数据中加sampler参数就行
.....
训练过程中,数据转cuda
imgs = imgs.to(device)
targets = targets.to(device)
|
三、如何启动训练
1、DataParallel方式
正常训练即可,即
python3 train.py
2、DistributedDataParallel方式
需要通过torch.distributed.launch来启动,一般是单节点,
1
|
CUDA_VISIBLE_DEVICES = 0 , 1 python3 - m torch.distributed.launch - - nproc_per_node = 2 train.py
|
其中CUDA_VISIBLE_DEVICES 设置用的显卡编号,--nproc_pre_node 每个节点的显卡数量,一般有几个显卡就用几个显卡
多节点
1
2
|
python3 - m torch.distributed.launch - - nproc_per_node = NUM_GPUS_YOU_HAVE - - nnodes = 2 - - node_rank = 0
#两个节点,在0号节点
|
要是训练成功,就会打印出几个信息,有几个卡就打印几个信息,如下图所示:
四、模型保存与读取
以下a、b是对应的,用a保存,就用a方法加载
1、保存
a、只保存参数
1
|
torch.save(model.module.state_dict(), path)
|
b、保存参数与网络
1
|
torch.save(model.module,path)
|
2、加载
a、多卡加载模型预训练;
1
2
3
4
5
6
|
model = Yourmodel()
if opt.pretrained_weights:
if opt.pretrained_weights.endswith( ".pth" ):
model.load_state_dict(torch.load(opt.pretrained_weights))
else :
model.load_darknet_weights(opt.pretrained_weights)
|
单卡加载模型,需要加载模型时指定主卡读模型,而且这个'cuda:0',是看你训练的模型是0还是1(否则就会出错RuntimeError: Attempting to deserialize object on CUDA device 1 but torch.cuda.device_count() is 1. Please use torch.load with map_location to map your storages to an existing device),可以根据自己的更改:
1
2
3
4
5
6
|
model = Yourmodel()
if opt.pretrained_weights:
if opt.pretrained_weights.endswith( ".pth" ):
model.load_state_dict(torch.load(opt.pretrained_weights,map_location = "cuda:0" ))
else :
model.load_darknet_weights(opt.pretrained_weights)
|
b、单卡加载模型;
同样也要指定读取模型的卡。
1
|
model = torch.load(opt.weights_path, map_location = "cuda:0" )
|
多卡加载预训练模型,以b这种方式还没跑通。
五、注意事项
1、model后面添加module
获取到网络模型后,使用并行方法,并将网络模型和参数移到GPU上。注意,若需要修改网络模块或者获得模型的某个参数,一定要在model后面加上.module,否则会报错,比如:
1
|
model.img_size 要改成 model.module.img_size
|
2、.cuda或者.to(device)等问题
device是自己设置,如果.cuda出错,就要化成相应的device
model
(如:model.to(device))
input
(通常需要使用Variable包装,如:input = Variable(input).to(device))
target
(通常需要使用Variable包装
nn.CrossEntropyLoss()
(如:criterion = nn.CrossEntropyLoss().to(device))
3、args.local_rank的参数
通过torch.distributed.launch来启动训练,torch.distributed.launch 会给模型分配一个args.local_rank的参数,所以在训练代码中要解析这个参数,也可以通过torch.distributed.get_rank()获取进程id。
1
|
parser.add_argument( "--local_rank" , type = int , default = - 1 , help = "number of cpu threads to use during batch generation" )
|
以上为个人经验,希望能给大家一个参考,也希望大家多多支持服务器之家。
原文链接:https://blog.csdn.net/cuifan0814/article/details/115861782