pytorch做seq2seq注意力模型的翻译

以下是对pytorch 1.0版本的seq2seq+注意力模型做法语--英语翻译的理解（这个代码在pytorch0.4上也可以正常跑）：
 # -*- coding: utf-8 -*-

 """

 Translation with a Sequence to Sequence Network and Attention

 *************************************************************

 **Author**: `Sean Robertson <https://github.com/spro/practical-pytorch>`_

 In this project we will be teaching a neural network to translate from

 French to English.

 ::

     [KEY: > input, = target, < output]

     > il est en train de peindre un tableau .

     = he is painting a picture .

     < he is painting a picture .

     > pourquoi ne pas essayer ce vin delicieux ?

     = why not try that delicious wine ?

     < why not try that delicious wine ?

     > elle n est pas poete mais romanciere .

     = she is not a poet but a novelist .

     < she not not a poet but a novelist .

     > vous etes trop maigre .

     = you re too skinny .

     < you re all alone .

 ... to varying degrees of success.

 This is made possible by the simple but powerful idea of the `sequence

 to sequence network <http://arxiv.org/abs/1409.3215>`__, in which two

 recurrent neural networks work together to transform one sequence to

 another. An encoder network condenses an input sequence into a vector,

 and a decoder network unfolds that vector into a new sequence.

 .. figure:: /_static/img/seq-seq-images/seq2seq.png

    :alt:

 To improve upon this model we'll use an `attention

 mechanism <https://arxiv.org/abs/1409.0473>`__, which lets the decoder

 learn to focus over a specific range of the input sequence.

 **Recommended Reading:**

 I assume you have at least installed PyTorch, know Python, and

 understand Tensors:

 -  https://pytorch.org/ For installation instructions

 -  :doc:`/beginner/deep_learning_60min_blitz` to get started with PyTorch in general

 -  :doc:`/beginner/pytorch_with_examples` for a wide and deep overview

 -  :doc:`/beginner/former_torchies_tutorial` if you are former Lua Torch user

 It would also be useful to know about Sequence to Sequence networks and

 how they work:

 -  `Learning Phrase Representations using RNN Encoder-Decoder for

    Statistical Machine Translation <http://arxiv.org/abs/1406.1078>`__

 -  `Sequence to Sequence Learning with Neural

    Networks <http://arxiv.org/abs/1409.3215>`__

 -  `Neural Machine Translation by Jointly Learning to Align and

    Translate <https://arxiv.org/abs/1409.0473>`__

 -  `A Neural Conversational Model <http://arxiv.org/abs/1506.05869>`__

 You will also find the previous tutorials on

 :doc:`/intermediate/char_rnn_classification_tutorial`

 and :doc:`/intermediate/char_rnn_generation_tutorial`

 helpful as those concepts are very similar to the Encoder and Decoder

 models, respectively.

 And for more, read the papers that introduced these topics:

 -  `Learning Phrase Representations using RNN Encoder-Decoder for

    Statistical Machine Translation <http://arxiv.org/abs/1406.1078>`__

 -  `Sequence to Sequence Learning with Neural

    Networks <http://arxiv.org/abs/1409.3215>`__

 -  `Neural Machine Translation by Jointly Learning to Align and

    Translate <https://arxiv.org/abs/1409.0473>`__

 -  `A Neural Conversational Model <http://arxiv.org/abs/1506.05869>`__

 **Requirements**

 """

 from __future__ import unicode_literals, print_function, division

 from io import open

 import unicodedata

 import string

 import re

 import random

 import torch

 import torch.nn as nn

 from torch import optim

 import torch.nn.functional as F

 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

 ######################################################################

 # Loading data files

 # ==================

 #

 # The data for this project is a set of many thousands of English to

 # French translation pairs.

 #

 # `This question on Open Data Stack

 # Exchange <http://opendata.stackexchange.com/questions/3888/dataset-of-sentences-translated-into-many-languages>`__

 # pointed me to the open translation site http://tatoeba.org/ which has

 # downloads available at http://tatoeba.org/eng/downloads - and better

 # yet, someone did the extra work of splitting language pairs into

 # individual text files here: http://www.manythings.org/anki/

 #

 # The English to French pairs are too big to include in the repo, so

 # download to ``data/eng-fra.txt`` before continuing. The file is a tab

 # separated list of translation pairs:

 #

 # ::

 #

 #     I am cold.    J'ai froid.

 #

 # .. Note::

 #    Download the data from

 #    `here <https://download.pytorch.org/tutorial/data.zip>`_

 #    and extract it to the current directory.

 ######################################################################

 # Similar to the character encoding used in the character-level RNN

 # tutorials, we will be representing each word in a language as a one-hot

 # vector, or giant vector of zeros except for a single one (at the index

 # of the word). Compared to the dozens of characters that might exist in a

 # language, there are many many more words, so the encoding vector is much

 # larger. We will however cheat a bit and trim the data to only use a few

 # thousand words per language.

 #

 # .. figure:: /_static/img/seq-seq-images/word-encoding.png

 #    :alt:

 #

 #

 ######################################################################

 # We'll need a unique index per word to use as the inputs and targets of

 # the networks later. To keep track of all this we will use a helper class

 # called ``Lang`` which has word → index (``word2index``) and index → word

 # (``index2word``) dictionaries, as well as a count of each word

 # ``word2count`` to use to later replace rare words.

 #

 SOS_token = 0

 EOS_token = 1

 # 每个单词需要对应唯一的索引作为稍后的网络输入和目标.为了追踪这些索引

 # 则使用一个帮助类 Lang ，类中有 词 → 索引 (word2index) 和 索引 → 词

 # (index2word) 的字典, 以及每个词word2count 用来替换稀疏词汇.

 # 此处创建的Lang 对象来表示源/目标语言，它包含三部分：word2index、

 # index2word 和word2count，分别表示单词到id、id 到单词和单词的词频。

 # word2count的作用是用于过滤一些低频词（把它变成unknown）

 class Lang:

     def __init__(self, name):

         self.name = name

         self.word2index = {}

         self.word2count = {}

         self.index2word = {0: "SOS", 1: "EOS"}

         self.n_words = 2  # Count SOS and EOS

     def addSentence(self, sentence):

         for word in sentence.split(' '):

             self.addWord(word)  # 用于添加单词

     def addWord(self, word):

         if word not in self.word2index:  # 是不是新的词

             # 如果不在word2index里，则需要新的定义字典

             self.word2index[word] = self.n_words

             self.word2count[word] = 1

             self.index2word[self.n_words] = word

             self.n_words += 1  # 相当于每次index+1

         else:

             self.word2count[word] += 1  # 计算每次词的个数

 ######################################################################

 # The files are all in Unicode, to simplify we will turn Unicode

 # characters to ASCII, make everything lowercase, and trim most

 # punctuation.

 #

 # Turn a Unicode string to plain ASCII, thanks to

 # http://*.com/a/518232/2809427

 # 此处是为了将Unicode字符串转换为纯ASCII

 # 原文件是Unicode编码

 def unicodeToAscii(s):

     return ''.join(

         c for c in unicodedata.normalize('NFD', s)

         if unicodedata.category(c) != 'Mn'

     )

 # Lowercase, trim, and remove non-letter characters

 # 小写,修剪和删除非字母字符

 def normalizeString(s):

     s = unicodeToAscii(s.lower().strip())

     s = re.sub(r"([.!?])", r" \1", s)

     s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)

     return s

 ######################################################################

 # To read the data file we will split the file into lines, and then split

 # lines into pairs. The files are all English → Other Language, so if we

 # want to translate from Other Language → English I added the ``reverse``

 # flag to reverse the pairs.

 #

 # 要读取数据文件,我们将把文件分成行,然后将行成对分开. 这些文件

 # 都是英文→其他语言,所以如果我们想从其他语言翻译→英文,我们添加了

 # 翻转标志 reverse来翻转词语对.

 def readLangs(lang1, lang2, reverse=False):

     print("Reading lines...")

     # Read the file and split into lines

     # 读取文件并按行分开

     lines = open('data/%s-%s.txt' % (lang1, lang2), encoding='utf-8'). \

         read().strip().split('\n')

     # Split every line into pairs and normalize

     # 将每一行分成两列并进行标准化

     pairs = [[normalizeString(s) for s in l.split('\t')] for l in lines]

     # Reverse pairs, make Lang instances

     # 翻转对,Lang实例化

     if reverse:

         pairs = [list(reversed(p)) for p in pairs]

         input_lang = Lang(lang2)

         output_lang = Lang(lang1)

     else:

         input_lang = Lang(lang1)

         output_lang = Lang(lang2)

     return input_lang, output_lang, pairs

 ######################################################################

 # Since there are a *lot* of example sentences and we want to train

 # something quickly, we'll trim the data set to only relatively short and

 # simple sentences. Here the maximum length is 10 words (that includes

 # ending punctuation) and we're filtering to sentences that translate to

 # the form "I am" or "He is" etc. (accounting for apostrophes replaced

 # earlier).

 #

 # 由于例句较多,为了方便快速训练,则会将数据集裁剪为相对简短的句子.

 # 这里的单词的最大长度是10词(包括结束标点符号),

 # 保留”I am” 和”He is” 开头的数据

 MAX_LENGTH = 10

 eng_prefixes = (

     "i am ", "i m ",

     "he is", "he s ",

     "she is", "she s",

     "you are", "you re ",

     "we are", "we re ",

     "they are", "they re "

 )

 def filterPair(p):

     return len(p[0].split(' ')) < MAX_LENGTH and \

            len(p[1].split(' ')) < MAX_LENGTH and \

            p[1].startswith(eng_prefixes)

     # 是否满足长度

 def filterPairs(pairs):

     return [pair for pair in pairs if filterPair(pair)]

 ######################################################################

 # The full process for preparing the data is:

 #

 # -  Read text file and split into lines, split lines into pairs

 # -  Normalize text, filter by length and content

 # -  Make word lists from sentences in pairs

 #

 def prepareData(lang1, lang2, reverse=False):

     input_lang, output_lang, pairs = readLangs(lang1, lang2, reverse)

     # 读入数据lang1,lang2,并翻转

     print("Read %s sentence pairs" % len(pairs))

     # 一共读入了多少对

     pairs = filterPairs(pairs)

     # 符合条件的配对有多少对

     print("Trimmed to %s sentence pairs" % len(pairs))

     print("Counting words...")

     for pair in pairs:

         input_lang.addSentence(pair[0])

         output_lang.addSentence(pair[1])

     print("Counted words:")

     print(input_lang.name, input_lang.n_words)

     print(output_lang.name, output_lang.n_words)

     return input_lang, output_lang, pairs

 # 对数据进行预处理

 input_lang, output_lang, pairs = prepareData('eng', 'fra', True)

 print(random.choice(pairs))  # 随机展示一对

 ######################################################################

 # The Seq2Seq Model

 # =================

 #

 # A Recurrent Neural Network, or RNN, is a network that operates on a

 # sequence and uses its own output as input for subsequent steps.

 #

 # A `Sequence to Sequence network <http://arxiv.org/abs/1409.3215>`__, or

 # seq2seq network, or `Encoder Decoder

 # network <https://arxiv.org/pdf/1406.1078v3.pdf>`__, is a model

 # consisting of two RNNs called the encoder and decoder. The encoder reads

 # an input sequence and outputs a single vector, and the decoder reads

 # that vector to produce an output sequence.

 #

 # .. figure:: /_static/img/seq-seq-images/seq2seq.png

 #    :alt:

 #

 # Unlike sequence prediction with a single RNN, where every input

 # corresponds to an output, the seq2seq model frees us from sequence

 # length and order, which makes it ideal for translation between two

 # languages.

 #

 # Consider the sentence "Je ne suis pas le chat noir" → "I am not the

 # black cat". Most of the words in the input sentence have a direct

 # translation in the output sentence, but are in slightly different

 # orders, e.g. "chat noir" and "black cat". Because of the "ne/pas"

 # construction there is also one more word in the input sentence. It would

 # be difficult to produce a correct translation directly from the sequence

 # of input words.

 #

 # With a seq2seq model the encoder creates a single vector which, in the

 # ideal case, encodes the "meaning" of the input sequence into a single

 # vector — a single point in some N dimensional space of sentences.

 #

 ######################################################################

 # The Encoder

 # -----------

 #

 # The encoder of a seq2seq network is a RNN that outputs some value for

 # every word from the input sentence. For every input word the encoder

 # outputs a vector and a hidden state, and uses the hidden state for the

 # next input word.

 #

 # .. figure:: /_static/img/seq-seq-images/encoder-network.png

 #    :alt:

 #

 #

 class EncoderRNN(nn.Module):

     def __init__(self, input_size, hidden_size):

         super(EncoderRNN, self).__init__()

         self.hidden_size = hidden_size

         # 定义隐藏层

         self.embedding = nn.Embedding(input_size, hidden_size)

         # word embedding的定义可以这么理解，例如nn.Embedding(2, 4)

         # 2表示有2个词，4表示4维度，其实也就是一个2x4的矩阵，

         # 如果有100个词，每个词10维，就可以写为nn.Embedding(100, 10)

         # 注意这里的词向量的建立只是初始的词向量，并没有经过任何修改优化

         # 需要建立神经网络通过learning的办法修改word embedding里面的参数

         # 使得word embedding每一个词向量能够表示每一个不同的词。

         self.gru = nn.GRU(hidden_size, hidden_size)  # 用到了上面提到的GRU模型

     def forward(self, input, hidden):

         embedded = self.embedding(input).view(1, 1, -1)  # -1是指自适应，view相当于reshape函数

         output = embedded

         output, hidden = self.gru(output, hidden)

         return output, hidden

     def initHidden(self):  # 初始化

         return torch.zeros(1, 1, self.hidden_size, device=device)

 ######################################################################

 # The Decoder

 # -----------

 #

 # The decoder is another RNN that takes the encoder output vector(s) and

 # outputs a sequence of words to create the translation.

 #

 ######################################################################

 # Simple Decoder

 # ^^^^^^^^^^^^^^

 #

 # In the simplest seq2seq decoder we use only last output of the encoder.

 # This last output is sometimes called the *context vector* as it encodes

 # context from the entire sequence. This context vector is used as the

 # initial hidden state of the decoder.

 #

 # At every step of decoding, the decoder is given an input token and

 # hidden state. The initial input token is the start-of-string ``<SOS>``

 # token, and the first hidden state is the context vector (the encoder's

 # last hidden state).

 #

 # .. figure:: /_static/img/seq-seq-images/decoder-network.png

 #    :alt:

 #

 #

 class DecoderRNN(nn.Module):

     # DecoderRNN与encoderRNN结构类似，结合图片即可搞清逻辑

     def __init__(self, hidden_size, output_size):

         super(DecoderRNN, self).__init__()

         self.hidden_size = hidden_size

         self.embedding = nn.Embedding(output_size, hidden_size)

         self.gru = nn.GRU(hidden_size, hidden_size)

         self.out = nn.Linear(hidden_size, output_size)

         self.softmax = nn.LogSoftmax(dim=1)

     def forward(self, input, hidden):

         output = self.embedding(input).view(1, 1, -1)  # -1是指自适应，view相当于reshape函数

         output = F.relu(output)

         output, hidden = self.gru(output, hidden)  # 此处使用gru神经网络

         # 对上述结果使用softmax,就是图片中左边倒数第二个

         output = self.softmax(self.out(output[0]))

         return output, hidden

     def initHidden(self):

         return torch.zeros(1, 1, self.hidden_size, device=device)

 ######################################################################

 # I encourage you to train and observe the results of this model, but to

 # save space we'll be going straight for the gold and introducing the

 # Attention Mechanism.

 #

 ######################################################################

 # Attention Decoder

 # ^^^^^^^^^^^^^^^^^

 #

 # If only the context vector is passed betweeen the encoder and decoder,

 # that single vector carries the burden of encoding the entire sentence.

 #

 # Attention allows the decoder network to "focus" on a different part of

 # the encoder's outputs for every step of the decoder's own outputs. First

 # we calculate a set of *attention weights*. These will be multiplied by

 # the encoder output vectors to create a weighted combination. The result

 # (called ``attn_applied`` in the code) should contain information about

 # that specific part of the input sequence, and thus help the decoder

 # choose the right output words.

 #

 # .. figure:: https://i.imgur.com/1152PYf.png

 #    :alt:

 #

 # Calculating the attention weights is done with another feed-forward

 # layer ``attn``, using the decoder's input and hidden state as inputs.

 # Because there are sentences of all sizes in the training data, to

 # actually create and train this layer we have to choose a maximum

 # sentence length (input length, for encoder outputs) that it can apply

 # to. Sentences of the maximum length will use all the attention weights,

 # while shorter sentences will only use the first few.

 #

 # .. figure:: /_static/img/seq-seq-images/attention-decoder-network.png

 #    :alt:

 #

 #

 class AttnDecoderRNN(nn.Module):

     def __init__(self, hidden_size, output_size, dropout_p=0.1, max_length=MAX_LENGTH):

         super(AttnDecoderRNN, self).__init__()

         self.hidden_size = hidden_size

         self.output_size = output_size

         self.dropout_p = dropout_p

         self.max_length = max_length

         self.embedding = nn.Embedding(self.output_size, self.hidden_size)

         self.attn = nn.Linear(self.hidden_size * 2, self.max_length)

         self.attn_combine = nn.Linear(self.hidden_size * 2, self.hidden_size)

         self.dropout = nn.Dropout(self.dropout_p)

         self.gru = nn.GRU(self.hidden_size, self.hidden_size)

         self.out = nn.Linear(self.hidden_size, self.output_size)

     def forward(self, input, hidden, encoder_outputs):

         # 对于输入的input内容进行embedding和dropout操作

         # dropout是指随机丢弃一些神经元

         embedded = self.embedding(input).view(1, 1, -1)

         embedded = self.dropout(embedded)

         # 此处相当于学出来了attention的权重

         # 需要注意的是torch的concatenate函数是torch.cat，是在已有的维度上拼接，

         # 而stack是建立一个新的维度，然后再在该纬度上进行拼接。

         attn_weights = F.softmax(

             self.attn(torch.cat((embedded[0], hidden[0]), 1)), dim=1)

         # 将attention权重作用在encoder_outputs上

         # 对存储在两个批batch1和batch2内的矩阵进行批矩阵乘操作。

         # batch1和 batch2都为包含相同数量矩阵的3维张量。

         # 如果batch1是形为b×n×m的张量，batch1是形为b×m×p的张量，

         # 则out和mat的形状都是n×p

         attn_applied = torch.bmm(attn_weights.unsqueeze(0),

                                  encoder_outputs.unsqueeze(0))

         # 拼接操作，将embedded和attn_Applied拼接起来

         output = torch.cat((embedded[0], attn_applied[0]), 1)

         # 返回一个新的张量，对输入的制定位置插入维度 1

         output = self.attn_combine(output).unsqueeze(0)

         output = F.relu(output)

         output, hidden = self.gru(output, hidden)

         output = F.log_softmax(self.out(output[0]), dim=1)

         return output, hidden, attn_weights

     def initHidden(self):

         return torch.zeros(1, 1, self.hidden_size, device=device)

 ######################################################################

 # .. note:: There are other forms of attention that work around the length

 #   limitation by using a relative position approach. Read about "local

 #   attention" in `Effective Approaches to Attention-based Neural Machine

 #   Translation <https://arxiv.org/abs/1508.04025>`__.

 #

 # Training

 # ========

 #

 # Preparing Training Data

 # -----------------------

 #

 # To train, for each pair we will need an input tensor (indexes of the

 # words in the input sentence) and target tensor (indexes of the words in

 # the target sentence). While creating these vectors we will append the

 # EOS token to both sequences.

 #

 def indexesFromSentence(lang, sentence):

     return [lang.word2index[word] for word in sentence.split(' ')]

 def tensorFromSentence(lang, sentence):

     # 获得词的索引

     indexes = indexesFromSentence(lang, sentence)

     # 将EOS标记添加到两个序列中

     indexes.append(EOS_token)

     return torch.tensor(indexes, dtype=torch.long, device=device).view(-1, 1)

 def tensorsFromPair(pair):

     # 每一对为需要输入的张量（输入句子中的词的索引）和目标张量

     # （目标语句中的词的索引）

     input_tensor = tensorFromSentence(input_lang, pair[0])

     target_tensor = tensorFromSentence(output_lang, pair[1])

     return (input_tensor, target_tensor)

 ######################################################################

 # Training the Model

 # ------------------

 #

 # To train we run the input sentence through the encoder, and keep track

 # of every output and the latest hidden state. Then the decoder is given

 # the ``<SOS>`` token as its first input, and the last hidden state of the

 # encoder as its first hidden state.

 #

 # "Teacher forcing" is the concept of using the real target outputs as

 # each next input, instead of using the decoder's guess as the next input.

 # Using teacher forcing causes it to converge faster but `when the trained

 # network is exploited, it may exhibit

 # instability <http://minds.jacobs-university.de/sites/default/files/uploads/papers/ESNTutorialRev.pdf>`__.

 #

 # You can observe outputs of teacher-forced networks that read with

 # coherent grammar but wander far from the correct translation -

 # intuitively it has learned to represent the output grammar and can "pick

 # up" the meaning once the teacher tells it the first few words, but it

 # has not properly learned how to create the sentence from the translation

 # in the first place.

 #

 # Because of the freedom PyTorch's autograd gives us, we can randomly

 # choose to use teacher forcing or not with a simple if statement. Turn

 # ``teacher_forcing_ratio`` up to use more of it.

 #

 teacher_forcing_ratio = 0.5

 # teacher forcing即指使用教师强迫其能够更快的收敛

 # 不过当训练好的网络被利用时，容易表现出不稳定性

 # teacher_forcing_ratio即指教师训练比率

 # 用于训练的函数

 def train(input_tensor, target_tensor, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion,

           max_length=MAX_LENGTH):

     # encoder即指EncoderRNN(input_lang.n_words, hidden_size)

     # attn_decoder即指 AttnDecoderRNN(hidden_size, output_lang.n_words, dropout_p=0.1)

     # hidden=256

     encoder_hidden = encoder.initHidden()

     # encoder_optimizer 即指optim.SGD(encoder.parameters(), lr=learning_rate)

     # decoder_optimizer 即指optim.SGD(decoder.parameters(), lr=learning_rate)

     # nn.Parameter()是Variable的一种，常被用于模块参数(module parameter)。

     # Parameters 是 Variable 的子类。Paramenters和Modules一起使用的时候会有一些特殊的属性，

     # 即：当Paramenters赋值给Module的属性的时候，他会自动的被加到 Module的 参数列表中

     # (即：会出现在 parameters() 迭代器中)。将Varibale赋值给Module属性则不会有这样的影响。

     # 这样做的原因是：我们有时候会需要缓存一些临时的状态(state), 比如：模型中RNN的最后一个隐状态。

     # 如果没有Parameter这个类的话，那么这些临时变量也会注册成为模型变量。

     encoder_optimizer.zero_grad()

     decoder_optimizer.zero_grad()

     # 得到长度

     input_length = input_tensor.size(0)

     target_length = target_tensor.size(0)

     # 初始化outour值

     encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)

     loss = 0

     # 以下循环是学习过程

     for ei in range(input_length):

         encoder_output, encoder_hidden = encoder(input_tensor[ei], encoder_hidden)

         encoder_outputs[ei] = encoder_output[0, 0]  # 这里为什么取 0,0

     # 定义decoder的Input值

     decoder_input = torch.tensor([[SOS_token]], device=device)

     decoder_hidden = encoder_hidden

     use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False

     if use_teacher_forcing:

         # Teacher forcing: Feed the target as the next input

         # 教师强制: 将目标作为下一个输入

         # 你观察教师强迫网络的输出,这些网络是用连贯的语法阅读的,但却远离了正确的翻译 -

         # 直观地来看它已经学会了代表输出语法,并且一旦老师告诉它前几个单词,就可以"拾取"它的意思,

         # 但它没有适当地学会如何从翻译中创建句子.

         for di in range(target_length):

             # 通过decoder得到输出值

             decoder_output, decoder_hidden, decoder_attention = decoder(

                 decoder_input, decoder_hidden, encoder_outputs)

             # 定义损失函数并计算

             loss += criterion(decoder_output, target_tensor[di])

             decoder_input = target_tensor[di]  # Teacher forcing

     else:

         # Without teacher forcing: use its own predictions as the next input

         # 没有教师强迫: 使用自己的预测作为下一个输入

         for di in range(target_length):

             # 通过decoder得到输出值

             decoder_output, decoder_hidden, decoder_attention = decoder(

                 decoder_input, decoder_hidden, encoder_outputs)

             # topk：第k个最小元素,返回第k个最小元素

             # 返回前k个最大元素，注意是前k个，largest=False，返回前k个最小元素

             # 此函数的功能是求取1-D 或N-D Tensor的最低维度的前k个最大的值，返回值为两个Tuple

             # 其中values是前k个最大值的Tuple，indices是对应的下标，默认返回结果是从大到小排序的。

             topv, topi = decoder_output.topk(1)

             decoder_input = topi.squeeze().detach()  # detach from history as input

             loss += criterion(decoder_output, target_tensor[di])

             if decoder_input.item() == EOS_token:

                 break

     # 反向传播

     loss.backward()

     # 更新参数

     encoder_optimizer.step()

     decoder_optimizer.step()

     return loss.item() / target_length

 ######################################################################

 # This is a helper function to print time elapsed and estimated time

 # remaining given the current time and progress %.

 #

 import time

 import math

 # 根据当前时间和进度百分比,这是一个帮助功能,用于打印经过的时间和估计的剩余时间.

 def asMinutes(s):

     m = math.floor(s / 60)

     s -= m * 60

     return '%dm %ds' % (m, s)

 def timeSince(since, percent):

     now = time.time()

     s = now - since

     es = s / (percent)

     rs = es - s

     return '%s (- %s)' % (asMinutes(s), asMinutes(rs))

 ######################################################################

 # The whole training process looks like this:

 #

 # -  Start a timer

 # -  Initialize optimizers and criterion

 # -  Create set of training pairs

 # -  Start empty losses array for plotting

 #

 # Then we call ``train`` many times and occasionally print the progress (%

 # of examples, time so far, estimated time) and average loss.

 #

 def trainIters(encoder, decoder, n_iters, print_every=1000, plot_every=100, learning_rate=0.01):

     start = time.time()

     plot_losses = []

     print_loss_total = 0  # Reset every print_every

     plot_loss_total = 0  # Reset every plot_every

     encoder_optimizer = optim.SGD(encoder.parameters(), lr=learning_rate)

     decoder_optimizer = optim.SGD(decoder.parameters(), lr=learning_rate)

     # 获取训练的一对样本

     training_pairs = [tensorsFromPair(random.choice(pairs))

                       for i in range(n_iters)]

     # 定义出的损失函数

     criterion = nn.NLLLoss()

     for iter in range(1, n_iters + 1):

         training_pair = training_pairs[iter - 1]

         input_tensor = training_pair[0]

         target_tensor = training_pair[1]

         # 训练的过程并用于当损失函数

         loss = train(input_tensor, target_tensor, encoder,

                      decoder, encoder_optimizer, decoder_optimizer, criterion)

         print_loss_total += loss

         plot_loss_total += loss

         if iter % print_every == 0:

             print_loss_avg = print_loss_total / print_every

             print_loss_total = 0

             # 打印进度(样本的百分比,到目前为止的时间,估计的时间)和平均损失.

             print('%s (%d %d%%) %.4f' % (timeSince(start, iter / n_iters),

                                          iter, iter / n_iters * 100, print_loss_avg))

         if iter % plot_every == 0:

             plot_loss_avg = plot_loss_total / plot_every

             plot_losses.append(plot_loss_avg)

             plot_loss_total = 0

     # 绘制图像

     showPlot(plot_losses)

 ######################################################################

 # Plotting results

 # ----------------

 #

 # Plotting is done with matplotlib, using the array of loss values

 # ``plot_losses`` saved while training.

 #

 import matplotlib.pyplot as plt

 plt.switch_backend('agg')

 import matplotlib.ticker as ticker

 import numpy as np

 # 使用matplotlib进行绘图，使用训练时保存的损失值plot_losses数组.

 def showPlot(points):

     plt.figure()

     fig, ax = plt.subplots()

     # this locator puts ticks at regular intervals

     # 这个定位器会定期发出提示信息

     loc = ticker.MultipleLocator(base=0.2)

     ax.yaxis.set_major_locator(loc)

     plt.plot(points)

 ######################################################################

 # Evaluation

 # ==========

 #

 # Evaluation is mostly the same as training, but there are no targets so

 # we simply feed the decoder's predictions back to itself for each step.

 # Every time it predicts a word we add it to the output string, and if it

 # predicts the EOS token we stop there. We also store the decoder's

 # attention outputs for display later.

 #

 def evaluate(encoder, decoder, sentence, max_length=MAX_LENGTH):

     with torch.no_grad():

         # 从sentence中得到对应的变量

         input_tensor = tensorFromSentence(input_lang, sentence)

         # 长度

         input_length = input_tensor.size()[0]

         # encoder即指EncoderRNN(input_lang.n_words, hidden_size)

         # attn_decoder即指 AttnDecoderRNN(hidden_size,

         # output_lang.n_words, dropout_p=0.1)

         # hidden=256

         encoder_hidden = encoder.initHidden()

         # 初始化outputs值

         encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)

         # 以下是学习过程

         for ei in range(input_length):

             encoder_output, encoder_hidden = encoder(input_tensor[ei],

                                                      encoder_hidden)

             encoder_outputs[ei] += encoder_output[0, 0]

         # 定义好decoder部分的input值

         decoder_input = torch.tensor([[SOS_token]], device=device)  # SOS

         # 设置好隐藏层

         decoder_hidden = encoder_hidden

         decoded_words = []

         decoder_attentions = torch.zeros(max_length, max_length)

         for di in range(max_length):

             # 得到结果

             decoder_output, decoder_hidden, decoder_attention = decoder(decoder_input, decoder_hidden, encoder_outputs)

             # attention部分的数据

             decoder_attentions[di] = decoder_attention.data

             # 选择output中的第一个值

             topv, topi = decoder_output.data.topk(1)

             if topi.item() == EOS_token:

                 decoded_words.append('<EOS>')

                 break

             else:

                 decoded_words.append(output_lang.index2word[topi.item()])  # 将output_lang添加到decoded

             decoder_input = topi.squeeze().detach()

         return decoded_words, decoder_attentions[:di + 1]

 ######################################################################

 # We can evaluate random sentences from the training set and print out the

 # input, target, and output to make some subjective quality judgements:

 #

 # 从训练集中评估随机的句子并打印出输入,目标和输出以作出一些主观质量判断

 def evaluateRandomly(encoder, decoder, n=10):

     for i in range(n):

         pair = random.choice(pairs)

         print('>', pair[0])

         print('=', pair[1])

         output_words, attentions = evaluate(encoder, decoder, pair[0])

         output_sentence = ' '.join(output_words)

         print('<', output_sentence)

         print('')

 ######################################################################

 # Training and Evaluating

 # =======================

 #

 # With all these helper functions in place (it looks like extra work, but

 # it makes it easier to run multiple experiments) we can actually

 # initialize a network and start training.

 #

 # Remember that the input sentences were heavily filtered. For this small

 # dataset we can use relatively small networks of 256 hidden nodes and a

 # single GRU layer. After about 40 minutes on a MacBook CPU we'll get some

 # reasonable results.

 #

 # .. Note::

 #    If you run this notebook you can train, interrupt the kernel,

 #    evaluate, and continue training later. Comment out the lines where the

 #    encoder and decoder are initialized and run ``trainIters`` again.

 #

 hidden_size = 256

 # 编码部分

 encoder1 = EncoderRNN(input_lang.n_words, hidden_size).to(device)

 # 加入了attention机制的解码部分

 attn_decoder1 = AttnDecoderRNN(hidden_size, output_lang.n_words, dropout_p=0.1).to(device)

 # 训练部分

 trainIters(encoder1, attn_decoder1, 75000, print_every=5000)

 ######################################################################

 # 随机生成一组结果

 evaluateRandomly(encoder1, attn_decoder1)

 ######################################################################

 # Visualizing Attention

 # ---------------------

 #

 # A useful property of the attention mechanism is its highly interpretable

 # outputs. Because it is used to weight specific encoder outputs of the

 # input sequence, we can imagine looking where the network is focused most

 # at each time step.

 #

 # You could simply run ``plt.matshow(attentions)`` to see attention output

 # displayed as a matrix, with the columns being input steps and rows being

 # output steps:

 #

 output_words, attentions = evaluate(encoder1, attn_decoder1, "je suis trop froid .")

 plt.matshow(attentions.numpy())

 ######################################################################

 # For a better viewing experience we will do the extra work of adding axes

 # and labels:

 def showAttention(input_sentence, output_words, attentions):

     # Set up figure with colorbar

     fig = plt.figure()

     ax = fig.add_subplot(111)

     cax = ax.matshow(attentions.numpy(), cmap='bone')

     fig.colorbar(cax)

     # Set up axes

     ax.set_xticklabels([''] + input_sentence.split(' ') +

                        ['<EOS>'], rotation=90)

     ax.set_yticklabels([''] + output_words)

     # Show label at every tick

     ax.xaxis.set_major_locator(ticker.MultipleLocator(1))

     ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

     plt.show()

 def evaluateAndShowAttention(input_sentence):

     output_words, attentions = evaluate(

         encoder1, attn_decoder1, input_sentence)

     print('input =', input_sentence)

     print('output =', ' '.join(output_words))

     showAttention(input_sentence, output_words, attentions)

 evaluateAndShowAttention("elle a cinq ans de moins que moi .")

 evaluateAndShowAttention("elle est trop petit .")

 evaluateAndShowAttention("je ne crains pas de mourir .")

 evaluateAndShowAttention("c est un jeune directeur plein de talent .")

 ######################################################################

 # Exercises

 # =========

 #

 # -  Try with a different dataset

 #

 #    -  Another language pair

 #    -  Human → Machine (e.g. IOT commands)

 #    -  Chat → Response

 #    -  Question → Answer

 #

 # -  Replace the embeddings with pre-trained word embeddings such as word2vec or

 #    GloVe

 # -  Try with more layers, more hidden units, and more sentences. Compare

 #    the training time and results.

 # -  If you use a translation file where pairs have two of the same phrase

 #    (``I am test \t I am test``), you can use this as an autoencoder. Try

 #    this:

 #

 #    -  Train as an autoencoder

 #    -  Save only the Encoder network

 #    -  Train a new Decoder for translation from there

 #
秒客网

pytorch做seq2seq注意力模型的翻译

相关文章