第四部分 LLaMA的RLHF版：ChatLLaMA和ColossalChat

4.1 ChatLLaMA(英文版)：类似SFT、RM、RL/PPO训练三步骤

由于LLaMA没有使用RLHF方法，初创公司 Nebuly AI开源了RLHF版的LLaMA，即ChatLLaMA

其训练过程类似 ChatGPT，而通过本博客内的《ChatGPT技术原理解析》3.1节，可知训练三个模型(SFT、RM、RL/PPO)得先准备三套数据集

actor_training_data，即用于微调GPT3所用的数据，比如
[
{
"user_input": "here the input of the user",
"completion": "here the model completion"
}
]

actor_training_data如何而来呢，有4项途径
①使用 100% 合成数据，可以通过运行以下命令综合生成数据集：
python artifacts/generate_actor_dataset.py，注：此命令需要订阅OpenAI，生成完整数据集的davinci-003成本约为 200 美元(当然也有免费的途径)

②使用具有辅助交互的开源数据集之一，目前支持：
Anthropic HH RLHF：这个数据集由结构化的 {question/answer pairs} 组成，包括机器人选择和拒绝的答案；
Stanford Human Preferences Dataset (SHP)：这个数据集是从选定的“提问”subreddits 中挑选出来的，并且包括基于最受支持的回答的范围广泛的 {question/answer pairs} 的问题

可以运行以下命令下载数据集：
```
python artifacts/download_dataset.py <dataset_name> --path <path_to_folder_for_download> --number_of_samples <N>
```
其中：
<dataset_name>对于 StanfordNLP/SHP 数据集，可以是“SHP”或“ARLHF”，对于 Anthropic/hh-rlhf 数据集，可以分别是“SHP”或“ARLHF”；
<path_to_folder_for_download>是要创建数据集的文件夹路径；
<N>是组成 reward_dataset.json 的样本数

③使用 100% 个性化数据集
用户提供自己的个性化完整数据集，数据集必须是具有以下格式的 JSON 文件：
[
{
"user_input": "here the input of the user",
"completion": "here the model completion"
}
]
其中列表包含多个dictionaries，每个dictionary 对应一个数据样本，建议使用超过 1000 个数据样本来进行对actor的训练

④创建完整的数据集，增加一些自定义数据样本，数据集可以从用户提供的一些提示+响应示例中综合生成（少数 => 10）
reward_training_data，用于训练一个奖励模型的数据，包含三部分的数据：
i) prompts,
ii) completion
iii) score of the completion assigned accordingly to the user feedback (the Human Feedback in RLHF，即对各个回答的评分score)

示例如下
[{
   "user_input": "...",
   "completion": "...",
   "score": 1
},
   ...
]

同样的，奖励数据怎么来呢？有以下三种方式
1 be synthetically scored using a LLM as Human Feedback
LLM 模型用于为每个entry计算分数
为此，LLM 需要一个提示模板，其中包含评估生成的文本的所有说明(比如奖励规则，什么情况下该奖什么情况下不奖都得十分明确)。为此，您应该将key reward添加到文件中templates.json，比如：
{
   "reward": "Here is the template for the reward model. The rules are:\n\n1.Rule 1\n\n2. Rule 2"
}
如果未提供模板，则使用默认模板artifacts/generate_rewards.py，注：所有模板都必须保存在一个名为 .json 的 JSON 文件中templates.json

获得unlabelled dataset后，您可以通过运行以下命令生成分数：
```
python artifacts/generate_rewards.py <dataset_path> --model <model_to_use> --temperature <t> --max_tokens <n> --reward_template <path_to_file.json>
```
其中，<dataset_path>要评分的reward dataset的路径；
<model_to_use>用于奖励的模型，默认建议使用text-davinci-003
<temperature>用于对模型进行评分的temperature，temperature =0.1；
<max_tokens>
<reward_template>，这是包含用于生成奖励的模板的文件的路径，如果未提供路径，将使用默认模板

2 用户提供他们个性化的完整数据集(至少需要 100 个数据样本)，但数据集必须是以下格式的 JSON 文件，取名为：reward_training_data.json
```
[
    {
        "user_input": "here type the user input",
        "completion": "here type the completion",
        "score": 4.0
    },
    {
        "user_input": "here type the user input",
        "completion": "random garbage",
        "score": 0.0
    }
]
```
3 用户提供的少量示例和使用 LLM 综合扩展的数据集(通过self-instruct的方式提示LLM产生更多所需要的指令数据)
rlhf_training_data，通过RL方法不断优化迭代最优策略的数据
It can be provided in 2 different ways:
$类ChatGPT项目的部署与微调(中)：ChatLLaMA和ColossalChat$ Few examples provided by the user and dataset synthetically expanded using LLM(依然可以继续通过self-instruct的方式提示LLM产生更多所需要的指令数据)
需要将key rlhf添加到templates.json文件中，其中包含有关要执行的任务的信息以及 LLM 生成所需的额外上下文，这是模板的示例(所有模板必须保存在一个名为templates.json)：
{
"rlhf": "Here is the template for the generating RLHF prompts. The task we want to perform is ..."
}

$类ChatGPT项目的部署与微调(中)：ChatLLaMA和ColossalChat$ The user provides the full dataset with possible interactions with the model
数据集需要包含超过 1000 个提示示例(文件命名为rlhf_training_data.json)：

[
{
"user_input": "here the example of user input"
}
]

以下是其主函数的代码，代码结构还是很清晰的

import argparse

from chatllama.rlhf.actor import ActorTrainer
from chatllama.rlhf.config import Config
from chatllama.rlhf.dataset import BaseDataset
from chatllama.rlhf.reward import RewardTrainer
from chatllama.rlhf.trainer import RLTrainer


# Setup argument parser
parser = argparse.ArgumentParser(
    prog="main.py", description="RLHF Training of ChatBots"
)

parser.add_argument("configfile", help="Path to config.yaml file")
parser.add_argument(
    "-t",
    "--type",
    help=(
        "Specify the training type. RL: Training of the model using RL."
        "ACTOR: Training of the actor model. "
        "REWARD: Training of the reward model."
        "RL: The whole pipeline with the three training steps"
    ),
    default="ALL",
    choices=["ALL", "RL", "ACTOR", "REWARD"],
)
parser.add_argument(
    "-a", "--actor", help="Specify actor model by name", default=None
)
parser.add_argument(
    "-r", "--reward", help="Specify reward model by name", default=None
)

# parse arguments
args = parser.parse_args()

# load config.yaml with all the project informations
config = Config(args.configfile)

# overwrite config if specified differently
if args.actor is not None:
    config.actor.model = args.actor
if args.reward is not None:
    config.reward.model = args.reward

# perform the desired training
if args.type == "RL":
    max_seq = min(
        config.actor.max_sequence_length,
        config.reward.max_sequence_length,
        config.critic.max_sequence_length,
    )
    config.actor.max_sequence_length = max_seq
    BaseDataset.clean_dataset(config)
    rlhf_trainer = RLTrainer(config)
    rlhf_trainer.train()
elif args.type == "ACTOR":
    BaseDataset.clean_dataset(config.actor)
    actor_trainer = ActorTrainer(config.actor)
    actor_trainer.train()
elif args.type == "REWARD":
    BaseDataset.clean_dataset(config.reward)
    reward_trainer = RewardTrainer(config.reward)
    reward_trainer.train()
elif args.type == "ALL":
    reward_trainer = RewardTrainer(config.reward)
    reward_trainer.train()
    actor_trainer = ActorTrainer(config.actor)
    actor_trainer.train()
    rlhf_trainer = RLTrainer(config)
    rlhf_trainer.train()

4.2 ColossalChat：通过self-instruct技术指令微调LLaMA且加上RLHF

据介绍(介绍页面，该页面的翻译之一，代码地址)，Colossal-AI 开源了基于 LLaMA-7B 模型的包含完整 RLHF 流程的类 Chat 模型复现方案 ColossalChat

关于数据集：包含10.4万条问答的中、英双语数据集(这是数据的开源地址)
该数据集收集并清洗了社交平台上人们的真实提问场景作为种子数据集，且利用 self-instruct 技术扩充数据(通过prompt OpenAI API)，花费约 900 美元进行标注
对比其他 self-instruct 方法生成的数据集，该数据集的种子数据更加真实、丰富，生成的数据集涵盖的话题更多，该数据可以同时用于微调和 RLHF 训练，通过高质量的数据，ColossalChat 能进行更好地对话交互，同时支持中文
关于训练方式：类似instructGPT/ChatGPT的训练三步骤(如果忘了，务必复习下此文的3.1节)
Stage1 是supervised-fintuning，即使用上文提到的数据集进行监督微调
Stage2 训练一个奖励模型(初始化为阶段1的SFT模型)，它通过模型对于同一个 prompt 的不同输出进行人工排序，根据排序结果监督训练出一个奖励模型
Stage3 是通过阶段2训练出来的奖励函数微调出一个RL模型，微调过程中通过PPO算法限制RL模型的参数更新范围(以阶段1的SFT模型的策略为参考基准，PPO算法避免与基线模型SFT的策略偏离过远)

具体而言，为两个阶段进行：
$类ChatGPT项目的部署与微调(中)：ChatLLaMA和ColossalChat$ 如上图底部，首先是 Make Experience 部分，利用 SFT 、Actor、RM、Critic模型计算生成 Experience 存入 buffer 中；之后是参数更新部分，利用 Experience 计算价值损失(value loss)，类似
$类ChatGPT项目的部署与微调(中)：ChatLLaMA和ColossalChat$
和策略损失(policy loss)，类似
$类ChatGPT项目的部署与微调(中)：ChatLLaMA和ColossalChat$ 如上图顶部即是PTX 部分(上面的目标函数 $类ChatGPT项目的部署与微调(中)：ChatLLaMA和ColossalChat$ 中加在最后的偏置项)
ColossalChat 计算 Actor 的现有输出response 和预训练语料的回答部分的交叉熵损失函数(calculates the cross-entropy loss between the Actor’s output response and the response part of the input corpus)
用来在 PPO 梯度中加入预训练梯度(add pre-training gradients to the PPO gradient)
以保持语言模型比如GPT2原有的核心性能(maintain the language model’s original performance and prevent forgetting)，防止忘了最早从哪里出发的(GPT2 $类ChatGPT项目的部署与微调(中)：ChatLLaMA和ColossalChat$ SFT $类ChatGPT项目的部署与微调(中)：ChatLLaMA和ColossalChat$ RM $类ChatGPT项目的部署与微调(中)：ChatLLaMA和ColossalChat$ RLHF)

最后将策略损失、价值损失和 PTX 损失加和(the policy loss, value loss, and PTX loss are summed up)，进行反向传播和参数更新

关于代码实现
类ChatGPT项目的部署与微调(中)：ChatLLaMA和ColossalChat

首先通过ColossalAI/applications/Chat/coati/trainer/sft.py，训练一个SFT模型

import math
import time
from abc import ABC
from typing import Optional

import loralib as lora
import torch
import torch.distributed as dist
import wandb
from coati.models.loss import GPTLMLoss
from torch import nn
from torch.optim import Adam, Optimizer
from torch.optim.lr_scheduler import LambdaLR
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler
from tqdm import tqdm
from transformers.tokenization_utils_base import PreTrainedTokenizerBase
from transformers.trainer import get_scheduler

from colossalai.logging import get_dist_logger

from .strategies import Strategy
from .utils import is_rank_0


class SFTTrainer(ABC):
    """
        Trainer to use while training reward model.

    Args:
        model (torch.nn.Module): the model to train
        strategy (Strategy): the strategy to use for training
        optim(Optimizer): the optimizer to use for training
        train_dataloader: the dataloader to use for training
        eval_dataloader: the dataloader to use for evaluation
        batch_size (int, defaults to 1): the batch size while training
        max_epochs (int, defaults to 2): the number of epochs to train
        optim_kwargs (dict, defaults to {'lr':1e-4}): the kwargs to use while initializing optimizer
    """

    def __init__(
        self,
        model,
        strategy: Strategy,
        optim: Optimizer,
        train_dataloader: DataLoader,
        eval_dataloader: DataLoader = None,
        batch_size: int = 1,
        max_epochs: int = 2,
        accimulation_steps: int = 8,
    ) -> None:
        super().__init__()
        self.strategy = strategy
        self.epochs = max_epochs
        self.train_dataloader = train_dataloader
        self.eval_dataloader = eval_dataloader

        self.model = strategy.setup_model(model)
        if "DDP" in str(self.strategy):
            self.model = self.model.module
        self.optimizer = strategy.setup_optimizer(optim, self.model)

        self.accimulation_steps = accimulation_steps
        num_update_steps_per_epoch = len(train_dataloader) // self.accimulation_steps
        max_steps = math.ceil(self.epochs * num_update_steps_per_epoch)

        self.scheduler = get_scheduler("cosine",
                                       self.optimizer,
                                       num_warmup_steps=math.ceil(max_steps * 0.03),
                                       num_training_steps=max_steps)

    def fit(self, logger, log_interval=10):
        wandb.init(project="Coati", name=time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))
        wandb.watch(self.model)
        total_loss = 0
        # epoch_bar = tqdm(range(self.epochs), desc='Epochs', disable=not is_rank_0())
        step_bar = tqdm(range(len(self.train_dataloader) // self.accimulation_steps * self.epochs),
                        desc=f'steps',
                        disable=not is_rank_0())
        for epoch in range(self.epochs):

            # process_bar = tqdm(range(len(self.train_dataloader)), desc=f'Train process for{epoch}', disable=not is_rank_0())
            # train
            self.model.train()
            for batch_id, batch in enumerate(self.train_dataloader):

                prompt_ids = batch["input_ids"].to(torch.cuda.current_device())
                p_mask = batch["attention_mask"].to(torch.cuda.current_device())
                labels = batch["labels"].to(torch.cuda.current_device())
                # prompt_ids = prompt_ids.squeeze(1).cuda()
                # p_mask = p_mask.squeeze(1).cuda()
                # prompt_logits = self.model(prompt_ids, attention_mask=p_mask, labels=labels)

                outputs = self.model(prompt_ids, attention_mask=p_mask, labels=labels)

                loss = outputs.loss
                prompt_logits = outputs.logits

                if loss >= 2.5:
                    logger.warning(f"batch_id:{batch_id}, abnormal loss: {loss}")

                loss = loss / self.accimulation_steps

                self.strategy.backward(loss, self.model, self.optimizer)

                total_loss += loss.item()

                # gradient accumulation
                if (batch_id + 1) % self.accimulation_steps == 0:
                    self.strategy.optimizer_step(self.optimizer)
                    self.optimizer.zero_grad()
                    self.scheduler.step()
                    wandb.log({
                        "loss": total_loss / self.accimulation_steps,
                        "lr": self.scheduler.get_last_lr()[0],
                        "epoch": epoch,
                        "batch_id": batch_id
                    })
                    total_loss = 0
                    step_bar.update()

                # if batch_id % log_interval == 0:
                # logger.info(f'Train Epoch {epoch}/{self.epochs} Batch {batch_id} Rank {dist.get_rank()} loss {loss.item()}')
                # wandb.log({"loss": loss.item()})

                # process_bar.update()

            # eval
            if self.eval_dataloader is not None:
                self.model.eval()
                with torch.no_grad():
                    loss_sum = 0
                    num_seen = 0
                    for batch in self.eval_dataloader:
                        prompt_ids = batch["input_ids"].to(torch.cuda.current_device())
                        p_mask = batch["attention_mask"].to(torch.cuda.current_device())
                        labels = batch["labels"].to(torch.cuda.current_device())
                        # prompt_ids = prompt_ids.squeeze(1).cuda()
                        # p_mask = p_mask.squeeze(1).cuda()

                        outputs = self.model(prompt_ids, attention_mask=p_mask, labels=labels)
                        loss = outputs.loss
                        # prompt_logits = outputs.logits

                        loss_sum += loss.item()
                        num_seen += prompt_ids.size(0)

                    loss_mean = loss_sum / num_seen
                    if dist.get_rank() == 0:
                        logger.info(f'Eval Epoch {epoch}/{self.epochs} loss {loss_mean}')

            # epoch_bar.update()

    def save_model(self,
                   path: str,
                   only_rank0: bool = False,
                   tokenizer: Optional[PreTrainedTokenizerBase] = None) -> None:
        self.strategy.save_model(model=self.model, path=path, only_rank0=only_rank0, tokenizer=tokenizer)

其次，通过ColossalAI/applications/Chat/coati/trainer/rm.py 训练一个奖励模型

from abc import ABC
from datetime import datetime
from typing import Optional

import pandas as pd
import torch
import torch.distributed as dist
from torch.optim import Optimizer, lr_scheduler
from torch.utils.data import DataLoader, Dataset, DistributedSampler
from tqdm import tqdm
from transformers.tokenization_utils_base import PreTrainedTokenizerBase

from .strategies import Strategy
from .utils import is_rank_0


class RewardModelTrainer(ABC):
    """
        Trainer to use while training reward model.

    Args:
        model (torch.nn.Module): the model to train
        strategy (Strategy): the strategy to use for training
        optim(Optimizer): the optimizer to use for training
        loss_fn (callable): the loss function to use for training
        train_dataset (Dataset): the dataset to use for training
        valid_dataset (Dataset): the dataset to use for validation
        eval_dataset (Dataset): the dataset to use for evaluation
        batch_size (int, defaults to 1): the batch size while training
        max_epochs (int, defaults to 2): the number of epochs to train
    """

    def __init__(
        self,
        model,
        strategy: Strategy,
        optim: Optimizer,
        loss_fn,
        train_dataset: Dataset,
        valid_dataset: Dataset,
        eval_dataset: Dataset,
        batch_size: int = 1,
        max_epochs: int = 1,
    ) -> None:
        super().__init__()
        self.strategy = strategy
        self.epochs = max_epochs
        train_sampler = None

        if dist.is_initialized() and dist.get_world_size() > 1:
            train_sampler = DistributedSampler(train_dataset, shuffle=True, seed=42, drop_last=True)
        self.train_dataloader = DataLoader(train_dataset,
                                           shuffle=(train_sampler is None),
                                           sampler=train_sampler,
                                           batch_size=batch_size)
        self.valid_dataloader = DataLoader(valid_dataset, batch_size=batch_size, shuffle=True)
        self.eval_dataloader = DataLoader(eval_dataset, batch_size=batch_size, shuffle=True)

        self.model = strategy.setup_model(model)
        self.loss_fn = loss_fn
        self.optimizer = strategy.setup_optimizer(optim, self.model)
        self.scheduler = lr_scheduler.CosineAnnealingLR(self.optimizer, self.train_dataloader.__len__() // 100)

    def eval_acc(self, dataloader):
        dist = 0
        on = 0
        cnt = 0
        self.model.eval()
        with torch.no_grad():
            for chosen_ids, c_mask, reject_ids, r_mask in dataloader:
                chosen_ids = chosen_ids.squeeze(1).to(torch.cuda.current_device())
                c_mask = c_mask.squeeze(1).to(torch.cuda.current_device())
                reject_ids = reject_ids.squeeze(1).to(torch.cuda.current_device())
                r_mask = r_mask.squeeze(1).to(torch.cuda.current_device())
                chosen_reward = self.model(chosen_ids, attention_mask=c_mask)
                reject_reward = self.model(reject_ids, attention_mask=r_mask)
                for i in range(len(chosen_reward)):
                    cnt += 1
                    if chosen_reward[i] > reject_reward[i]:
                        on += 1
                dist += (chosen_reward - reject_reward).mean().item()
            dist_mean = dist / len(dataloader)
            acc = on / cnt
        self.model.train()
        return dist_mean, acc

    def fit(self):
        time = datetime.now()
        epoch_bar = tqdm(range(self.epochs), desc='Train epoch', disable=not is_rank_0())
        for epoch in range(self.epochs):
            step_bar = tqdm(range(self.train_dataloader.__len__()),
                            desc='Train step of epoch %d' % epoch,
                            disable=not is_rank_0())
            # train
            self.model.train()
            cnt = 0
            acc = 0
            dist = 0
            for chosen_ids, c_mask, reject_ids, r_mask in self.train_dataloader:
                chosen_ids = chosen_ids.squeeze(1).to(torch.cuda.current_device())
                c_mask = c_mask.squeeze(1).to(torch.cuda.current_device())
                reject_ids = reject_ids.squeeze(1).to(torch.cuda.current_device())
                r_mask = r_mask.squeeze(1).to(torch.cuda.current_device())
                chosen_reward = self.model(chosen_ids, attention_mask=c_mask)
                reject_reward = self.model(reject_ids, attention_mask=r_mask)
                loss = self.loss_fn(chosen_reward, reject_reward)
                self.strategy.backward(loss, self.model, self.optimizer)
                self.strategy.optimizer_step(self.optimizer)
                self.optimizer.zero_grad()
                cnt += 1
                if cnt == 100:
                    self.scheduler.step()
                    dist, acc = self.eval_acc(self.valid_dataloader)
                    cnt = 0
                    if is_rank_0():
                        log = pd.DataFrame([[step_bar.n, loss.item(), dist, acc]],
                                           columns=['step', 'loss', 'dist', 'acc'])
                        log.to_csv('log_%s.csv' % time, mode='a', header=False, index=False)
                step_bar.update()
                step_bar.set_postfix({'dist': dist, 'acc': acc})

            # eval
            dist, acc = self.eval_acc(self.eval_dataloader)
            if is_rank_0():
                log = pd.DataFrame([[step_bar.n, loss.item(), dist, acc]], columns=['step', 'loss', 'dist', 'acc'])
                log.to_csv('log.csv', mode='a', header=False, index=False)
            epoch_bar.update()
            step_bar.set_postfix({'dist': dist, 'acc': acc})
            step_bar.close()

    def save_model(self,
                   path: str,
                   only_rank0: bool = False,
                   tokenizer: Optional[PreTrainedTokenizerBase] = None) -> None:
        self.strategy.save_model(model=self.model, path=path, only_rank0=only_rank0, tokenizer=tokenizer)

最后，通过ColossalAI/applications/Chat/coati/trainer/ppo.py to start PPO training

from typing import Any, Callable, Dict, List, Optional

import torch
import torch.nn as nn
from coati.experience_maker import Experience, NaiveExperienceMaker
from coati.models.base import Actor, Critic
from coati.models.generation_utils import update_model_kwargs_fn
from coati.models.loss import PolicyLoss, ValueLoss
from coati.replay_buffer import NaiveReplayBuffer
from torch.optim import Optimizer
from transformers.tokenization_utils_base import PreTrainedTokenizerBase

from .base import Trainer
from .callbacks import Callback
from .strategies import Strategy


class PPOTrainer(Trainer):
    """
        Trainer for PPO algorithm.

    Args:
        strategy (Strategy): the strategy to use for training
        actor (Actor): the actor model in ppo algorithm
        critic (Critic): the critic model in ppo algorithm
        reward_model (nn.Module): the reward model in rlhf algorithm to make reward of sentences
        initial_model (Actor): the initial model in rlhf algorithm to generate reference logits to limit the update of actor
        actor_optim (Optimizer): the optimizer to use for actor model
        critic_optim (Optimizer): the optimizer to use for critic model
        kl_coef (float, defaults to 0.1): the coefficient of kl divergence loss
        train_batch_size (int, defaults to 8): the batch size to use for training
        buffer_limit (int, defaults to 0): the max_size limitaiton of replay buffer
        buffer_cpu_offload (bool, defaults to True): whether to offload replay buffer to cpu
        eps_clip (float, defaults to 0.2): the clip coefficient of policy loss
        value_clip (float, defaults to 0.4): the clip coefficient of value loss
        experience_batch_size (int, defaults to 8): the batch size to use for experience generation
        max_epochs (int, defaults to 1): the number of epochs of training process
        tokenier (Callable, optional): the tokenizer to use for tokenizing the input
        sample_replay_buffer (bool, defaults to False): whether to sample from replay buffer
        dataloader_pin_memory (bool, defaults to True): whether to pin memory for data loader
        callbacks (List[Callback], defaults to []): the callbacks to call during training process
        generate_kwargs (dict, optional): the kwargs to use while model generating
    """

    def __init__(self,
                 strategy: Strategy,
                 actor: Actor,
                 critic: Critic,
                 reward_model: nn.Module,
                 initial_model: Actor,
                 actor_optim: Optimizer,
                 critic_optim: Optimizer,
                 kl_coef: float = 0.1,
                 ptx_coef: float = 0.9,
                 train_batch_size: int = 8,
                 buffer_limit: int = 0,
                 buffer_cpu_offload: bool = True,
                 eps_clip: float = 0.2,
                 value_clip: float = 0.4,
                 experience_batch_size: int = 8,
                 max_epochs: int = 1,
                 tokenizer: Optional[Callable[[Any], dict]] = None,
                 sample_replay_buffer: bool = False,
                 dataloader_pin_memory: bool = True,
                 callbacks: List[Callback] = [],
                 **generate_kwargs) -> None:
        experience_maker = NaiveExperienceMaker(actor, critic, reward_model, initial_model, kl_coef)
        replay_buffer = NaiveReplayBuffer(train_batch_size, buffer_limit, buffer_cpu_offload)
        generate_kwargs = _set_default_generate_kwargs(strategy, generate_kwargs, actor)
        super().__init__(strategy, experience_maker, replay_buffer, experience_batch_size, max_epochs, tokenizer,
                         sample_replay_buffer, dataloader_pin_memory, callbacks, **generate_kwargs)
        self.actor = actor
        self.critic = critic

        self.actor_loss_fn = PolicyLoss(eps_clip)
        self.critic_loss_fn = ValueLoss(value_clip)
        self.ptx_loss_fn = nn.CrossEntropyLoss(ignore_index=-100)
        self.ptx_coef = ptx_coef
        self.actor_optim = actor_optim
        self.critic_optim = critic_optim

    def training_step(self, experience: Experience) -> Dict[str, float]:
        self.actor.train()
        self.critic.train()
        # policy loss
        num_actions = experience.action_mask.size(1)
        action_log_probs = self.actor(experience.sequences, num_actions, attention_mask=experience.attention_mask)
        actor_loss = self.actor_loss_fn(action_log_probs,
                                        experience.action_log_probs,
                                        experience.advantages,
                                        action_mask=experience.action_mask)

        # ptx loss
        if self.ptx_coef != 0:
            ptx = next(iter(self.pretrain_dataloader))['input_ids'].to(torch.cuda.current_device())
            label = next(iter(self.pretrain_dataloader))['labels'].to(torch.cuda.current_device())[:, 1:]
            attention_mask = next(iter(self.pretrain_dataloader))['attention_mask'].to(torch.cuda.current_device())
            ptx_log_probs = self.actor.get_base_model()(ptx, attention_mask=attention_mask)['logits'][..., :-1, :]
            ptx_loss = self.ptx_loss_fn(ptx_log_probs.view(-1, ptx_log_probs.size(-1)), label.view(-1))
            actor_loss = ptx_loss * self.ptx_coef + actor_loss * (1 - self.ptx_coef)

        self.strategy.backward(actor_loss, self.actor, self.actor_optim)
        self.strategy.optimizer_step(self.actor_optim)
        self.actor_optim.zero_grad()

        # value loss
        values = self.critic(experience.sequences,
                             action_mask=experience.action_mask,
                             attention_mask=experience.attention_mask)
        critic_loss = self.critic_loss_fn(values,
                                          experience.values,
                                          experience.reward,
                                          action_mask=experience.action_mask)
        self.strategy.backward(critic_loss, self.critic, self.critic_optim)
        self.strategy.optimizer_step(self.critic_optim)
        self.critic_optim.zero_grad()

        return {'reward': experience.reward.mean().item()}


def _set_default_generate_kwargs(strategy: Strategy, generate_kwargs: dict, actor: Actor) -> None:
    origin_model = strategy._unwrap_actor(actor)
    new_kwargs = {**generate_kwargs}
    # use huggingface models method directly
    if 'prepare_inputs_fn' not in generate_kwargs and hasattr(origin_model, 'prepare_inputs_for_generation'):
        new_kwargs['prepare_inputs_fn'] = origin_model.prepare_inputs_for_generation

    if 'update_model_kwargs_fn' not in generate_kwargs:
        new_kwargs['update_model_kwargs_fn'] = update_model_kwargs_fn

    return new_kwargs


def save_model(self, path: str, only_rank0: bool = False, tokenizer: Optional[PreTrainedTokenizerBase] = None) -> None:
    self.strategy.save_model(model=self.actor, path=path, only_rank0=only_rank0, tokenizer=tokenizer)

在获得最终模型权重后，还可通过量化降低推理硬件成本，并启动在线推理服务，仅需单张约 4GB 显存的 GPU 即可完成 70 亿参数模型推理服务部署

秒客网

类ChatGPT项目的部署与微调(中)：ChatLLaMA和ColossalChat

第四部分 LLaMA的RLHF版：ChatLLaMA和ColossalChat

4.1 ChatLLaMA(英文版)：类似SFT、RM、RL/PPO训练三步骤

4.2 ColossalChat：通过self-instruct技术指令微调LLaMA且加上RLHF

相关文章