【自然语言处理(NLP)】基于PaddleNLP的短文本相似度计算
作者简介:在校大学生一枚,华为云享专家,阿里云专家博主,腾云先锋(TDP)成员,云曦智划项目总负责人,全国高等学校计算机教学与产业实践资源建设专家委员会(TIPCC)志愿者,以及编程爱好者,期待和大家一起学习,一起进步~ . 博客主页:ぃ灵彧が的学习日志 . 本文专栏:机器学习 . 专栏寄语:若你决定灿烂,山无遮,海无拦 .
(文章目录)
前言
(一)、任务描述
短文本语义匹配(SimailarityNet,SImNet)是一个计算短文本相似度的框架,可以根据用户输入的两个文本,计算出相似度得分。SimNet框架在百度各产品上广泛应用,主要包括BOW、CNN、RNN、MMDNN等核心网络结构形式,提供语义相似度计算训练和预测框架,适用于信息检索、新闻推荐、智能客服等多个应用场景,帮助企业解决语义匹配问题。可通过AI开放平台-短文本相似度线上体验。
(二)、工具描述
本实践将通过调用Seq2Vec中内置的模型进行序列建模,完成句子的向量表示。包含最简单的词袋模型和一系列经典的RNN类模型。
一、LCQMC数据准备
(一)、导入LCQMC
在实践中,我们使用PaddleNLP内置数据集LCQMC,这是哈尔滨工业大学在自然语言处理国际会议COLING2018构建的问题语义匹配数据集,其目标是判断两个问题的语义是否相同。部分样例数据如下图1所示:
from paddlenlp.datasets import LCQMC
train_ds, dev_dataset, test_ds = LCQMC.get_datasets(['train', 'dev', 'test'])
(二)、构建DataLoader
数据下载完成后需要构建一个dataloader,每次产生一个batch的数据传递给模型进行训练。
def create_dataloader(dataset,
trans_fn=None,
mode='train',
batch_size=1,
use_gpu=False,
batchify_fn=None):
"""
Creats dataloader.
Args:
dataset(obj:`paddle.io.Dataset`): Dataset instance.
trans_fn(obj:`callable`, optional, defaults to `None`): function to convert a data sample to input ids, etc.
mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will shuffle the dataset randomly.
batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch.
use_gpu(obj:`bool`, optional, defaults to obj:`False`): Whether to use gpu to run.
batchify_fn(obj:`callable`, optional, defaults to `None`): function to generate mini-batch data by merging
the sample list, None for only stack each fields of sample in axis
0(same as :attr::`np.stack(..., axis=0)`).
Returns:
dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches.
"""
if trans_fn:
dataset = dataset.map(trans_fn)
if mode == 'train' and use_gpu:
sampler = paddle.io.DistributedBatchSampler(
dataset=dataset, batch_size=batch_size, shuffle=True)
else:
shuffle = True if mode == 'train' else False
sampler = paddle.io.BatchSampler(
dataset=dataset, batch_size=batch_size, shuffle=shuffle)
dataloader = paddle.io.DataLoader(
dataset,
batch_sampler=sampler,
return_list=True,
collate_fn=batchify_fn)
return dataloader
二、PaddleNLP模型配置
# Constructs the newtork.
model = ppnlp.models.SimNet(
network=args.network,
vocab_size=len(vocab),
num_classes=len(train_ds.label_list))
model = paddle.Model(model)
我们需要定义优化算法和损失函数,这里使用的是Adam优化算法,指定学习率为args.lr。损失函数使用的是交叉熵损失函数,该函数在分类任务上比较常用。定义了一个损失函数之后,还要对它求平均值,因为定义的是一个Batch的损失值。同时还可以定义一个准确率函数,可以在训练的时候输出分类的准确率。
optimizer = paddle.optimizer.Adam(
parameters=model.parameters(), learning_rate=args.lr)
# Defines loss and metric.
criterion = paddle.nn.CrossEntropyLoss()
metric = paddle.metric.Accuracy()
model.prepare(optimizer, criterion, metric)
# Loads pre-trained parameters.
if args.init_from_ckpt:
model.load(args.init_from_ckpt)
print("Loaded checkpoint from %s" % args.init_from_ckpt)
三、模型训练
在模型训练之前,需要先下载词汇表文件simnet_vocab.txt,用于构造词-id映射关系。词表的选择和实际应用数据相关,需根据实际数据选择词表。然后就可以进行模型训练的评估。
!wget https://paddlenlp.bj.bcebos.com/data/simnet_vocab.txt
from functools import partial
import argparse
import os
import random
import time
import paddle
import paddlenlp as ppnlp
from paddlenlp.data import JiebaTokenizer, Pad, Stack, Tuple, Vocab
from paddlenlp.datasets import load_dataset
import numpy as np
def convert_example(example, tokenizer, is_test=False):
"""
Builds model inputs from a sequence for sequence classification tasks.
It use `jieba.cut` to tokenize text.
Args:
example(obj:`list[str]`): List of input data, containing text and label if it have label.
tokenizer(obj: paddlenlp.data.JiebaTokenizer): It use jieba to cut the chinese string.
is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
Returns:
query_ids(obj:`list[int]`): The list of query ids.
title_ids(obj:`list[int]`): The list of title ids.
query_seq_len(obj:`int`): The input sequence query length.
title_seq_len(obj:`int`): The input sequence title length.
label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test.
"""
query, title = example["query"], example["title"]
query_ids = np.array(tokenizer.encode(query), dtype="int64")
query_seq_len = np.array(len(query_ids), dtype="int64")
title_ids = np.array(tokenizer.encode(title), dtype="int64")
title_seq_len = np.array(len(title_ids), dtype="int64")
if not is_test:
label = np.array(example["label"], dtype="int64")
return query_ids, title_ids, query_seq_len, title_seq_len, label
else:
return query_ids, title_ids, query_seq_len, title_seq_len
# yapf: disable
parser = argparse.ArgumentParser(__doc__)
parser.add_argument("--epochs", type=int, default=10, help="Number of epoches for training.")
parser.add_argument('--use_gpu', type=eval, default=False, help="Whether use GPU for training, input should be True or False")
parser.add_argument("--lr", type=float, default=5e-4, help="Learning rate used to train.")
parser.add_argument("--save_dir", type=str, default='checkpoints/', help="Directory to save model checkpoint")
parser.add_argument("--batch_size", type=int, default=64, help="Total examples' number of a batch for training.")
parser.add_argument("--vocab_path", type=str, default="./simnet_vocab.txt", help="The directory to dataset.")
parser.add_argument('--network', type=str, default="lstm", help="Which network you would like to choose bow, cnn, lstm or gru ?")
parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
args = parser.parse_args()
# yapf: enable
def create_dataloader(dataset,
trans_fn=None,
mode='train',
batch_size=1,
use_gpu=False,
batchify_fn=None):
"""
Creats dataloader.
Args:
dataset(obj:`paddle.io.Dataset`): Dataset instance.
trans_fn(obj:`callable`, optional, defaults to `None`): function to convert a data sample to input ids, etc.
mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will shuffle the dataset randomly.
batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch.
use_gpu(obj:`bool`, optional, defaults to obj:`False`): Whether to use gpu to run.
batchify_fn(obj:`callable`, optional, defaults to `None`): function to generate mini-batch data by merging
the sample list, None for only stack each fields of sample in axis
0(same as :attr::`np.stack(..., axis=0)`).
Returns:
dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches.
"""
if trans_fn:
dataset = dataset.map(trans_fn)
if mode == 'train' and use_gpu:
sampler = paddle.io.DistributedBatchSampler(
dataset=dataset, batch_size=batch_size, shuffle=True)
else:
shuffle = True if mode == 'train' else False
sampler = paddle.io.BatchSampler(
dataset=dataset, batch_size=batch_size, shuffle=shuffle)
dataloader = paddle.io.DataLoader(
dataset,
batch_sampler=sampler,
return_list=True,
collate_fn=batchify_fn)
return dataloader
if __name__ == "__main__":
paddle.set_device('gpu') if args.use_gpu else paddle.set_device('cpu')
# Loads vocab.
if not os.path.exists(args.vocab_path):
raise RuntimeError('The vocab_path can not be found in the path %s' %
args.vocab_path)
vocab = Vocab.load_vocabulary(
args.vocab_path, unk_token='[UNK]', pad_token='[PAD]')
# Loads dataset.
train_ds, dev_ds, test_ds = load_dataset(
"lcqmc", splits=["train", "dev", "test"])
# Constructs the newtork.
model = ppnlp.models.SimNet(
network=args.network,
vocab_size=len(vocab),
num_classes=len(train_ds.label_list))
model = paddle.Model(model)
# Reads data and generates mini-batches.
batchify_fn = lambda samples, fn=Tuple(
Pad(axis=0, pad_val=vocab.token_to_idx.get('[PAD]', 0)), # query_ids
Pad(axis=0, pad_val=vocab.token_to_idx.get('[PAD]', 0)), # title_ids
Stack(dtype="int64"), # query_seq_lens
Stack(dtype="int64"), # title_seq_lens
Stack(dtype="int64") # label
): [data for data in fn(samples)]
tokenizer = ppnlp.data.JiebaTokenizer(vocab)
trans_fn = partial(convert_example, tokenizer=tokenizer, is_test=False)
train_loader = create_dataloader(
train_ds,
trans_fn=trans_fn,
batch_size=args.batch_size,
mode='train',
use_gpu=args.use_gpu,
batchify_fn=batchify_fn)
dev_loader = create_dataloader(
dev_ds,
trans_fn=trans_fn,
batch_size=args.batch_size,
mode='validation',
use_gpu=args.use_gpu,
batchify_fn=batchify_fn)
test_loader = create_dataloader(
test_ds,
trans_fn=trans_fn,
batch_size=args.batch_size,
mode='test',
use_gpu=args.use_gpu,
batchify_fn=batchify_fn)
optimizer = paddle.optimizer.Adam(
parameters=model.parameters(), learning_rate=args.lr)
# Defines loss and metric.
criterion = paddle.nn.CrossEntropyLoss()
metric = paddle.metric.Accuracy()
model.prepare(optimizer, criterion, metric)
# Loads pre-trained parameters.
if args.init_from_ckpt:
model.load(args.init_from_ckpt)
print("Loaded checkpoint from %s" % args.init_from_ckpt)
# Starts training and evaluating.
model.fit(
train_loader,
dev_loader,
epochs=args.epochs,
save_dir=args.save_dir, )
# Finally tests model.
results = model.eval(test_loader)
print("Finally test acc: %.5f" % results['acc'])
总结
本系列文章内容为根据清华社出版的《自然语言处理实践》所作的相关笔记和感悟,其中代码均为基于百度飞桨开发,若有任何侵权和不妥之处,请私信于我,定积极配合处理,看到必回!!!
最后,引用本次活动的一句话,来作为文章的结语~( ̄▽ ̄~)~:
【**学习的最大理由是想摆脱平庸,早一天就多一份人生的精彩;迟一天就多一天平庸的困扰。**】