类ChatGPT项目的部署与微调(中):ChatLLaMA和ColossalChat

时间:2022-09-01 01:28:58

第四部分 LLaMA的RLHF版:ChatLLaMA和ColossalChat

4.1 ChatLLaMA(英文版):类似SFT、RM、RL/PPO训练三步骤

由于LLaMA没有使用RLHF方法,初创公司 Nebuly AI开源了RLHF版的LLaMA,即ChatLLaMA

其训练过程类似 ChatGPT,而通过本博客内的《ChatGPT技术原理解析》3.1节,可知训练三个模型(SFT、RM、RL/PPO)得先准备三套数据集

  1. actor_training_data,即用于微调GPT3所用的数据,比如
    [
      {
          "user_input": "here the input of the user",
          "completion": "here the model completion"
      }
    ]


    actor_training_data如何而来呢,有4项途径
    ①使用 100% 合成数据,可以通过运行以下命令综合生成数据集:
    python artifacts/generate_actor_dataset.py,注:此命令需要订阅OpenAI,生成完整数据集的davinci-003成本约为 200 美元(当然 也有免费的途径)

    ②使用具有辅助交互的开源数据集之一,目前支持:
    Anthropic HH RLHF:这个数据集由结构化的 {question/answer pairs} 组成,包括机器人选择和拒绝的答案;
    Stanford Human Preferences Dataset (SHP):这个数据集是从选定的“提问”subreddits 中挑选出来的,并且包括基于最受支持的回答的范围广泛的 {question/answer pairs} 的问题

    可以运行以下命令下载数据集:
    python artifacts/download_dataset.py <dataset_name> --path <path_to_folder_for_download> --number_of_samples <N>
    其中:
    <dataset_name>对于 StanfordNLP/SHP 数据集,可以是“SHP”或“ARLHF”,对于 Anthropic/hh-rlhf 数据集,可以分别是“SHP”或“ARLHF”;
    <path_to_folder_for_download>是要创建数据集的文件夹路径;
    <N>是组成 reward_dataset.json 的样本数

    ③使用 100% 个性化数据集
    用户提供自己的个性化完整数据集,数据集必须是具有以下格式的 JSON 文件:

    [
        {
            "user_input": "here the input of the user",
            "completion": "here the model completion"
        }
    ]

    其中列表包含多个dictionaries,每个dictionary 对应一个数据样本,建议使用超过 1000 个数据样本来进行对actor的训练
     

    ④创建完整的数据集,增加一些自定义数据样本,数据集可以从用户提供的一些提示+响应示例中综合生成(少数 => 10)
  2. reward_training_data,用于训练一个奖励模型的数据,包含三部分的数据: 
    i) prompts,
    ii) completion
    iii) score of the completion assigned accordingly to the user feedback (the Human Feedback in RLHF,即对各个回答的评分score)

    示例如下
    [{
        "user_input": "...",
        "completion": "...",
        "score": 1
    },
        ...
    ]


    同样的,奖励数据怎么来呢?有以下三种方式
    类ChatGPT项目的部署与微调(中):ChatLLaMA和ColossalChat  1 be synthetically scored using a LLM as Human Feedback
    LLM 模型用于为每个entry计算分数
    为此,LLM 需要一个提示模板,其中包含评估生成的文本的所有说明(比如奖励规则,什么情况下该奖 什么情况下不奖都得十分明确)。为此,您应该将key reward添加到文件中templates.json,比如:
    {

        "reward": "Here is the template for the reward model. The rules are:\n\n1.Rule 1\n\n2. Rule 2"
    }

    如果未提供模板,则使用默认模板artifacts/generate_rewards.py,注:所有模板都必须保存在一个名为 .json 的 JSON 文件中templates.json

    获得unlabelled dataset后,您可以通过运行以下命令生成分数:

    ​​​​​​​python artifacts/generate_rewards.py <dataset_path> --model <model_to_use> --temperature <t> --max_tokens <n> --reward_template <path_to_file.json>
    其中,<dataset_path>要评分的reward dataset的路径;
    <model_to_use>用于奖励的模型,默认建议使用text-davinci-003
    <temperature>用于对模型进行评分的temperature,temperature =0.1;
    <max_tokens>
    <reward_template>,这是包含用于生成奖励的模板的文件的路径,如果未提供路径,将使用默认模板

    类ChatGPT项目的部署与微调(中):ChatLLaMA和ColossalChat 2 用户提供他们个性化的完整数据集(至少需要 100 个数据样本),但数据集必须是以下格式的 JSON 文件,取名为:reward_training_data.json
    [
        {
            "user_input": "here type the user input",
            "completion": "here type the completion",
            "score": 4.0
        },
        {
            "user_input": "here type the user input",
            "completion": "random garbage",
            "score": 0.0
        }
    ]
    类ChatGPT项目的部署与微调(中):ChatLLaMA和ColossalChat  3 用户提供的少量示例和使用 LLM 综合扩展的数据集(通过self-instruct的方式提示LLM产生更多所需要的指令数据)
  3. rlhf_training_data,通过RL方法不断优化迭代最优策略的数据
    It can be provided in 2 different ways:
    类ChatGPT项目的部署与微调(中):ChatLLaMA和ColossalChat  Few examples provided by the user and dataset synthetically expanded using LLM(依然可以继续通过self-instruct的方式提示LLM产生更多所需要的指令数据)
    需要将key rlhf添加到templates.json文件中,其中包含有关要执行的任务的信息以及 LLM 生成所需的额外上下文,这是模板的示例(所有模板必须保存在一个名为templates.json):

    {
      "rlhf": "Here is the template for the generating RLHF prompts. The task we want to perform is ..."
    }

    类ChatGPT项目的部署与微调(中):ChatLLaMA和ColossalChat  The user provides the full dataset with possible interactions with the model
    数据集需要包含超过 1000 个提示示例(文件命名为rlhf_training_data.json):

    [
        {
            "user_input": "here the example of user input"
        }
    ]

以下是其主函数的代码,代码结构还是很清晰的

import argparse

from chatllama.rlhf.actor import ActorTrainer
from chatllama.rlhf.config import Config
from chatllama.rlhf.dataset import BaseDataset
from chatllama.rlhf.reward import RewardTrainer
from chatllama.rlhf.trainer import RLTrainer


# Setup argument parser
parser = argparse.ArgumentParser(
    prog="main.py", description="RLHF Training of ChatBots"
)

parser.add_argument("configfile", help="Path to config.yaml file")
parser.add_argument(
    "-t",
    "--type",
    help=(
        "Specify the training type. RL: Training of the model using RL."
        "ACTOR: Training of the actor model. "
        "REWARD: Training of the reward model."
        "RL: The whole pipeline with the three training steps"
    ),
    default="ALL",
    choices=["ALL", "RL", "ACTOR", "REWARD"],
)
parser.add_argument(
    "-a", "--actor", help="Specify actor model by name", default=None
)
parser.add_argument(
    "-r", "--reward", help="Specify reward model by name", default=None
)

# parse arguments
args = parser.parse_args()

# load config.yaml with all the project informations
config = Config(args.configfile)

# overwrite config if specified differently
if args.actor is not None:
    config.actor.model = args.actor
if args.reward is not None:
    config.reward.model = args.reward

# perform the desired training
if args.type == "RL":
    max_seq = min(
        config.actor.max_sequence_length,
        config.reward.max_sequence_length,
        config.critic.max_sequence_length,
    )
    config.actor.max_sequence_length = max_seq
    BaseDataset.clean_dataset(config)
    rlhf_trainer = RLTrainer(config)
    rlhf_trainer.train()
elif args.type == "ACTOR":
    BaseDataset.clean_dataset(config.actor)
    actor_trainer = ActorTrainer(config.actor)
    actor_trainer.train()
elif args.type == "REWARD":
    BaseDataset.clean_dataset(config.reward)
    reward_trainer = RewardTrainer(config.reward)
    reward_trainer.train()
elif args.type == "ALL":
    reward_trainer = RewardTrainer(config.reward)
    reward_trainer.train()
    actor_trainer = ActorTrainer(config.actor)
    actor_trainer.train()
    rlhf_trainer = RLTrainer(config)
    rlhf_trainer.train()

4.2 ColossalChat:通过self-instruct技术指令微调LLaMA且加上RLHF

据介绍(介绍页面,该页面的翻译之一代码地址),Colossal-AI 开源了基于 LLaMA-7B 模型的包含完整 RLHF 流程的类 Chat 模型复现方案 ColossalChat

  • 关于数据集:包含10.4万条问答的中、英双语数据集(这是数据的开源地址)
    该数据集收集并清洗了社交平台上人们的真实提问场景作为种子数据集,且利用 self-instruct 技术扩充数据(通过prompt OpenAI API),花费约 900 美元进行标注
    对比其他 self-instruct 方法生成的数据集,该数据集的种子数据更加真实、丰富,生成的数据集涵盖的话题更多,该数据可以同时用于微调和 RLHF 训练,通过高质量的数据,ColossalChat 能进行更好地对话交互,同时支持中文

    类ChatGPT项目的部署与微调(中):ChatLLaMA和ColossalChat

  • 关于训练方式:类似instructGPT/ChatGPT的训练三步骤(如果忘了,务必复习下此文的3.1节)
    Stage1 是supervised-fintuning,即使用上文提到的数据集进行监督微调
    Stage2 训练一个奖励模型(初始化为阶段1的SFT模型),它通过模型对于同一个 prompt 的不同输出进行人工排序,根据排序结果监督训练出一个奖励模型
    Stage3 是通过阶段2训练出来的奖励函数微调出一个RL模型,微调过程中通过PPO算法限制RL模型的参数更新范围(以阶段1的SFT模型的策略为参考基准,PPO算法避免与基线模型SFT的策略偏离过远)

    具体而言,为两个阶段进行:

    类ChatGPT项目的部署与微调(中):ChatLLaMA和ColossalChat

    类ChatGPT项目的部署与微调(中):ChatLLaMA和ColossalChat 如上图底部,首先是 Make Experience 部分,利用 SFT 、Actor、RM、Critic模型计算生成 Experience 存入 buffer 中;之后是参数更新部分,利用 Experience 计算价值损失(value loss),类似

    类ChatGPT项目的部署与微调(中):ChatLLaMA和ColossalChat

    和策略损失(policy loss),类似

    类ChatGPT项目的部署与微调(中):ChatLLaMA和ColossalChat

    ​​​​​​​类ChatGPT项目的部署与微调(中):ChatLLaMA和ColossalChat 如上图顶部即是PTX 部分(上面的目标函数类ChatGPT项目的部署与微调(中):ChatLLaMA和ColossalChat中加在最后的偏置项)
    ​​​​​​​ColossalChat 计算 Actor 的现有输出response 和预训练语料的回答部分的交叉熵损失函数(calculates the cross-entropy loss between the Actor’s output response and the response part of the input corpus)
    用来在 PPO 梯度中加入预训练梯度(add pre-training gradients to the PPO gradient)
    以保持语言模型比如GPT2原有的核心性能(maintain the language model’s original performance and prevent forgetting),防止忘了最早从哪里出发的(GPT2 类ChatGPT项目的部署与微调(中):ChatLLaMA和ColossalChat SFT 类ChatGPT项目的部署与微调(中):ChatLLaMA和ColossalChat RM 类ChatGPT项目的部署与微调(中):ChatLLaMA和ColossalChat​​​​​​​ RLHF)

    最后将策略损失、价值损失和 PTX 损失加和(the policy loss, value loss, and PTX loss are summed up),进行反向传播和参数更新
  • 关于代码实现
    类ChatGPT项目的部署与微调(中):ChatLLaMA和ColossalChat
    首先通过ColossalAI/applications/Chat/coati/trainer/sft.py,训练一个SFT模型
    import math
    import time
    from abc import ABC
    from typing import Optional
    
    import loralib as lora
    import torch
    import torch.distributed as dist
    import wandb
    from coati.models.loss import GPTLMLoss
    from torch import nn
    from torch.optim import Adam, Optimizer
    from torch.optim.lr_scheduler import LambdaLR
    from torch.utils.data import DataLoader
    from torch.utils.data.distributed import DistributedSampler
    from tqdm import tqdm
    from transformers.tokenization_utils_base import PreTrainedTokenizerBase
    from transformers.trainer import get_scheduler
    
    from colossalai.logging import get_dist_logger
    
    from .strategies import Strategy
    from .utils import is_rank_0
    
    
    class SFTTrainer(ABC):
        """
            Trainer to use while training reward model.
    
        Args:
            model (torch.nn.Module): the model to train
            strategy (Strategy): the strategy to use for training
            optim(Optimizer): the optimizer to use for training
            train_dataloader: the dataloader to use for training
            eval_dataloader: the dataloader to use for evaluation
            batch_size (int, defaults to 1): the batch size while training
            max_epochs (int, defaults to 2): the number of epochs to train
            optim_kwargs (dict, defaults to {'lr':1e-4}): the kwargs to use while initializing optimizer
        """
    
        def __init__(
            self,
            model,
            strategy: Strategy,
            optim: Optimizer,
            train_dataloader: DataLoader,
            eval_dataloader: DataLoader = None,
            batch_size: int = 1,
            max_epochs: int = 2,
            accimulation_steps: int = 8,
        ) -> None:
            super().__init__()
            self.strategy = strategy
            self.epochs = max_epochs
            self.train_dataloader = train_dataloader
            self.eval_dataloader = eval_dataloader
    
            self.model = strategy.setup_model(model)
            if "DDP" in str(self.strategy):
                self.model = self.model.module
            self.optimizer = strategy.setup_optimizer(optim, self.model)
    
            self.accimulation_steps = accimulation_steps
            num_update_steps_per_epoch = len(train_dataloader) // self.accimulation_steps
            max_steps = math.ceil(self.epochs * num_update_steps_per_epoch)
    
            self.scheduler = get_scheduler("cosine",
                                           self.optimizer,
                                           num_warmup_steps=math.ceil(max_steps * 0.03),
                                           num_training_steps=max_steps)
    
        def fit(self, logger, log_interval=10):
            wandb.init(project="Coati", name=time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))
            wandb.watch(self.model)
            total_loss = 0
            # epoch_bar = tqdm(range(self.epochs), desc='Epochs', disable=not is_rank_0())
            step_bar = tqdm(range(len(self.train_dataloader) // self.accimulation_steps * self.epochs),
                            desc=f'steps',
                            disable=not is_rank_0())
            for epoch in range(self.epochs):
    
                # process_bar = tqdm(range(len(self.train_dataloader)), desc=f'Train process for{epoch}', disable=not is_rank_0())
                # train
                self.model.train()
                for batch_id, batch in enumerate(self.train_dataloader):
    
                    prompt_ids = batch["input_ids"].to(torch.cuda.current_device())
                    p_mask = batch["attention_mask"].to(torch.cuda.current_device())
                    labels = batch["labels"].to(torch.cuda.current_device())
                    # prompt_ids = prompt_ids.squeeze(1).cuda()
                    # p_mask = p_mask.squeeze(1).cuda()
                    # prompt_logits = self.model(prompt_ids, attention_mask=p_mask, labels=labels)
    
                    outputs = self.model(prompt_ids, attention_mask=p_mask, labels=labels)
    
                    loss = outputs.loss
                    prompt_logits = outputs.logits
    
                    if loss >= 2.5:
                        logger.warning(f"batch_id:{batch_id}, abnormal loss: {loss}")
    
                    loss = loss / self.accimulation_steps
    
                    self.strategy.backward(loss, self.model, self.optimizer)
    
                    total_loss += loss.item()
    
                    # gradient accumulation
                    if (batch_id + 1) % self.accimulation_steps == 0:
                        self.strategy.optimizer_step(self.optimizer)
                        self.optimizer.zero_grad()
                        self.scheduler.step()
                        wandb.log({
                            "loss": total_loss / self.accimulation_steps,
                            "lr": self.scheduler.get_last_lr()[0],
                            "epoch": epoch,
                            "batch_id": batch_id
                        })
                        total_loss = 0
                        step_bar.update()
    
                    # if batch_id % log_interval == 0:
                    # logger.info(f'Train Epoch {epoch}/{self.epochs} Batch {batch_id} Rank {dist.get_rank()} loss {loss.item()}')
                    # wandb.log({"loss": loss.item()})
    
                    # process_bar.update()
    
                # eval
                if self.eval_dataloader is not None:
                    self.model.eval()
                    with torch.no_grad():
                        loss_sum = 0
                        num_seen = 0
                        for batch in self.eval_dataloader:
                            prompt_ids = batch["input_ids"].to(torch.cuda.current_device())
                            p_mask = batch["attention_mask"].to(torch.cuda.current_device())
                            labels = batch["labels"].to(torch.cuda.current_device())
                            # prompt_ids = prompt_ids.squeeze(1).cuda()
                            # p_mask = p_mask.squeeze(1).cuda()
    
                            outputs = self.model(prompt_ids, attention_mask=p_mask, labels=labels)
                            loss = outputs.loss
                            # prompt_logits = outputs.logits
    
                            loss_sum += loss.item()
                            num_seen += prompt_ids.size(0)
    
                        loss_mean = loss_sum / num_seen
                        if dist.get_rank() == 0:
                            logger.info(f'Eval Epoch {epoch}/{self.epochs} loss {loss_mean}')
    
                # epoch_bar.update()
    
        def save_model(self,
                       path: str,
                       only_rank0: bool = False,
                       tokenizer: Optional[PreTrainedTokenizerBase] = None) -> None:
            self.strategy.save_model(model=self.model, path=path, only_rank0=only_rank0, tokenizer=tokenizer)
    ​​​
    其次,通过ColossalAI/applications/Chat/coati/trainer/rm.py 训练一个奖励模型
    from abc import ABC
    from datetime import datetime
    from typing import Optional
    
    import pandas as pd
    import torch
    import torch.distributed as dist
    from torch.optim import Optimizer, lr_scheduler
    from torch.utils.data import DataLoader, Dataset, DistributedSampler
    from tqdm import tqdm
    from transformers.tokenization_utils_base import PreTrainedTokenizerBase
    
    from .strategies import Strategy
    from .utils import is_rank_0
    
    
    class RewardModelTrainer(ABC):
        """
            Trainer to use while training reward model.
    
        Args:
            model (torch.nn.Module): the model to train
            strategy (Strategy): the strategy to use for training
            optim(Optimizer): the optimizer to use for training
            loss_fn (callable): the loss function to use for training
            train_dataset (Dataset): the dataset to use for training
            valid_dataset (Dataset): the dataset to use for validation
            eval_dataset (Dataset): the dataset to use for evaluation
            batch_size (int, defaults to 1): the batch size while training
            max_epochs (int, defaults to 2): the number of epochs to train
        """
    
        def __init__(
            self,
            model,
            strategy: Strategy,
            optim: Optimizer,
            loss_fn,
            train_dataset: Dataset,
            valid_dataset: Dataset,
            eval_dataset: Dataset,
            batch_size: int = 1,
            max_epochs: int = 1,
        ) -> None:
            super().__init__()
            self.strategy = strategy
            self.epochs = max_epochs
            train_sampler = None
    
            if dist.is_initialized() and dist.get_world_size() > 1:
                train_sampler = DistributedSampler(train_dataset, shuffle=True, seed=42, drop_last=True)
            self.train_dataloader = DataLoader(train_dataset,
                                               shuffle=(train_sampler is None),
                                               sampler=train_sampler,
                                               batch_size=batch_size)
            self.valid_dataloader = DataLoader(valid_dataset, batch_size=batch_size, shuffle=True)
            self.eval_dataloader = DataLoader(eval_dataset, batch_size=batch_size, shuffle=True)
    
            self.model = strategy.setup_model(model)
            self.loss_fn = loss_fn
            self.optimizer = strategy.setup_optimizer(optim, self.model)
            self.scheduler = lr_scheduler.CosineAnnealingLR(self.optimizer, self.train_dataloader.__len__() // 100)
    
        def eval_acc(self, dataloader):
            dist = 0
            on = 0
            cnt = 0
            self.model.eval()
            with torch.no_grad():
                for chosen_ids, c_mask, reject_ids, r_mask in dataloader:
                    chosen_ids = chosen_ids.squeeze(1).to(torch.cuda.current_device())
                    c_mask = c_mask.squeeze(1).to(torch.cuda.current_device())
                    reject_ids = reject_ids.squeeze(1).to(torch.cuda.current_device())
                    r_mask = r_mask.squeeze(1).to(torch.cuda.current_device())
                    chosen_reward = self.model(chosen_ids, attention_mask=c_mask)
                    reject_reward = self.model(reject_ids, attention_mask=r_mask)
                    for i in range(len(chosen_reward)):
                        cnt += 1
                        if chosen_reward[i] > reject_reward[i]:
                            on += 1
                    dist += (chosen_reward - reject_reward).mean().item()
                dist_mean = dist / len(dataloader)
                acc = on / cnt
            self.model.train()
            return dist_mean, acc
    
        def fit(self):
            time = datetime.now()
            epoch_bar = tqdm(range(self.epochs), desc='Train epoch', disable=not is_rank_0())
            for epoch in range(self.epochs):
                step_bar = tqdm(range(self.train_dataloader.__len__()),
                                desc='Train step of epoch %d' % epoch,
                                disable=not is_rank_0())
                # train
                self.model.train()
                cnt = 0
                acc = 0
                dist = 0
                for chosen_ids, c_mask, reject_ids, r_mask in self.train_dataloader:
                    chosen_ids = chosen_ids.squeeze(1).to(torch.cuda.current_device())
                    c_mask = c_mask.squeeze(1).to(torch.cuda.current_device())
                    reject_ids = reject_ids.squeeze(1).to(torch.cuda.current_device())
                    r_mask = r_mask.squeeze(1).to(torch.cuda.current_device())
                    chosen_reward = self.model(chosen_ids, attention_mask=c_mask)
                    reject_reward = self.model(reject_ids, attention_mask=r_mask)
                    loss = self.loss_fn(chosen_reward, reject_reward)
                    self.strategy.backward(loss, self.model, self.optimizer)
                    self.strategy.optimizer_step(self.optimizer)
                    self.optimizer.zero_grad()
                    cnt += 1
                    if cnt == 100:
                        self.scheduler.step()
                        dist, acc = self.eval_acc(self.valid_dataloader)
                        cnt = 0
                        if is_rank_0():
                            log = pd.DataFrame([[step_bar.n, loss.item(), dist, acc]],
                                               columns=['step', 'loss', 'dist', 'acc'])
                            log.to_csv('log_%s.csv' % time, mode='a', header=False, index=False)
                    step_bar.update()
                    step_bar.set_postfix({'dist': dist, 'acc': acc})
    
                # eval
                dist, acc = self.eval_acc(self.eval_dataloader)
                if is_rank_0():
                    log = pd.DataFrame([[step_bar.n, loss.item(), dist, acc]], columns=['step', 'loss', 'dist', 'acc'])
                    log.to_csv('log.csv', mode='a', header=False, index=False)
                epoch_bar.update()
                step_bar.set_postfix({'dist': dist, 'acc': acc})
                step_bar.close()
    
        def save_model(self,
                       path: str,
                       only_rank0: bool = False,
                       tokenizer: Optional[PreTrainedTokenizerBase] = None) -> None:
            self.strategy.save_model(model=self.model, path=path, only_rank0=only_rank0, tokenizer=tokenizer)

    最后,通过ColossalAI/applications/Chat/coati/trainer/ppo.py to start PPO training
    from typing import Any, Callable, Dict, List, Optional
    
    import torch
    import torch.nn as nn
    from coati.experience_maker import Experience, NaiveExperienceMaker
    from coati.models.base import Actor, Critic
    from coati.models.generation_utils import update_model_kwargs_fn
    from coati.models.loss import PolicyLoss, ValueLoss
    from coati.replay_buffer import NaiveReplayBuffer
    from torch.optim import Optimizer
    from transformers.tokenization_utils_base import PreTrainedTokenizerBase
    
    from .base import Trainer
    from .callbacks import Callback
    from .strategies import Strategy
    
    
    class PPOTrainer(Trainer):
        """
            Trainer for PPO algorithm.
    
        Args:
            strategy (Strategy): the strategy to use for training
            actor (Actor): the actor model in ppo algorithm
            critic (Critic): the critic model in ppo algorithm
            reward_model (nn.Module): the reward model in rlhf algorithm to make reward of sentences
            initial_model (Actor): the initial model in rlhf algorithm to generate reference logits to limit the update of actor
            actor_optim (Optimizer): the optimizer to use for actor model
            critic_optim (Optimizer): the optimizer to use for critic model
            kl_coef (float, defaults to 0.1): the coefficient of kl divergence loss
            train_batch_size (int, defaults to 8): the batch size to use for training
            buffer_limit (int, defaults to 0): the max_size limitaiton of replay buffer
            buffer_cpu_offload (bool, defaults to True): whether to offload replay buffer to cpu
            eps_clip (float, defaults to 0.2): the clip coefficient of policy loss
            value_clip (float, defaults to 0.4): the clip coefficient of value loss
            experience_batch_size (int, defaults to 8): the batch size to use for experience generation
            max_epochs (int, defaults to 1): the number of epochs of training process
            tokenier (Callable, optional): the tokenizer to use for tokenizing the input
            sample_replay_buffer (bool, defaults to False): whether to sample from replay buffer
            dataloader_pin_memory (bool, defaults to True): whether to pin memory for data loader
            callbacks (List[Callback], defaults to []): the callbacks to call during training process
            generate_kwargs (dict, optional): the kwargs to use while model generating
        """
    
        def __init__(self,
                     strategy: Strategy,
                     actor: Actor,
                     critic: Critic,
                     reward_model: nn.Module,
                     initial_model: Actor,
                     actor_optim: Optimizer,
                     critic_optim: Optimizer,
                     kl_coef: float = 0.1,
                     ptx_coef: float = 0.9,
                     train_batch_size: int = 8,
                     buffer_limit: int = 0,
                     buffer_cpu_offload: bool = True,
                     eps_clip: float = 0.2,
                     value_clip: float = 0.4,
                     experience_batch_size: int = 8,
                     max_epochs: int = 1,
                     tokenizer: Optional[Callable[[Any], dict]] = None,
                     sample_replay_buffer: bool = False,
                     dataloader_pin_memory: bool = True,
                     callbacks: List[Callback] = [],
                     **generate_kwargs) -> None:
            experience_maker = NaiveExperienceMaker(actor, critic, reward_model, initial_model, kl_coef)
            replay_buffer = NaiveReplayBuffer(train_batch_size, buffer_limit, buffer_cpu_offload)
            generate_kwargs = _set_default_generate_kwargs(strategy, generate_kwargs, actor)
            super().__init__(strategy, experience_maker, replay_buffer, experience_batch_size, max_epochs, tokenizer,
                             sample_replay_buffer, dataloader_pin_memory, callbacks, **generate_kwargs)
            self.actor = actor
            self.critic = critic
    
            self.actor_loss_fn = PolicyLoss(eps_clip)
            self.critic_loss_fn = ValueLoss(value_clip)
            self.ptx_loss_fn = nn.CrossEntropyLoss(ignore_index=-100)
            self.ptx_coef = ptx_coef
            self.actor_optim = actor_optim
            self.critic_optim = critic_optim
    
        def training_step(self, experience: Experience) -> Dict[str, float]:
            self.actor.train()
            self.critic.train()
            # policy loss
            num_actions = experience.action_mask.size(1)
            action_log_probs = self.actor(experience.sequences, num_actions, attention_mask=experience.attention_mask)
            actor_loss = self.actor_loss_fn(action_log_probs,
                                            experience.action_log_probs,
                                            experience.advantages,
                                            action_mask=experience.action_mask)
    
            # ptx loss
            if self.ptx_coef != 0:
                ptx = next(iter(self.pretrain_dataloader))['input_ids'].to(torch.cuda.current_device())
                label = next(iter(self.pretrain_dataloader))['labels'].to(torch.cuda.current_device())[:, 1:]
                attention_mask = next(iter(self.pretrain_dataloader))['attention_mask'].to(torch.cuda.current_device())
                ptx_log_probs = self.actor.get_base_model()(ptx, attention_mask=attention_mask)['logits'][..., :-1, :]
                ptx_loss = self.ptx_loss_fn(ptx_log_probs.view(-1, ptx_log_probs.size(-1)), label.view(-1))
                actor_loss = ptx_loss * self.ptx_coef + actor_loss * (1 - self.ptx_coef)
    
            self.strategy.backward(actor_loss, self.actor, self.actor_optim)
            self.strategy.optimizer_step(self.actor_optim)
            self.actor_optim.zero_grad()
    
            # value loss
            values = self.critic(experience.sequences,
                                 action_mask=experience.action_mask,
                                 attention_mask=experience.attention_mask)
            critic_loss = self.critic_loss_fn(values,
                                              experience.values,
                                              experience.reward,
                                              action_mask=experience.action_mask)
            self.strategy.backward(critic_loss, self.critic, self.critic_optim)
            self.strategy.optimizer_step(self.critic_optim)
            self.critic_optim.zero_grad()
    
            return {'reward': experience.reward.mean().item()}
    
    
    def _set_default_generate_kwargs(strategy: Strategy, generate_kwargs: dict, actor: Actor) -> None:
        origin_model = strategy._unwrap_actor(actor)
        new_kwargs = {**generate_kwargs}
        # use huggingface models method directly
        if 'prepare_inputs_fn' not in generate_kwargs and hasattr(origin_model, 'prepare_inputs_for_generation'):
            new_kwargs['prepare_inputs_fn'] = origin_model.prepare_inputs_for_generation
    
        if 'update_model_kwargs_fn' not in generate_kwargs:
            new_kwargs['update_model_kwargs_fn'] = update_model_kwargs_fn
    
        return new_kwargs
    
    
    def save_model(self, path: str, only_rank0: bool = False, tokenizer: Optional[PreTrainedTokenizerBase] = None) -> None:
        self.strategy.save_model(model=self.actor, path=path, only_rank0=only_rank0, tokenizer=tokenizer)

    在获得最终模型权重后,还可通过量化降低推理硬件成本,并启动在线推理服务,仅需单张约 4GB 显存的 GPU 即可完成 70 亿参数模型推理服务部署 

更多请见下文:类ChatGPT项目的部署与微调(下):从ChatGLM-6b到ChatDoctor