学习Tensorflow的LSTM的RNN例子

16 Nov 2016

前几天写了学习Embeddings的例子，因为琢磨了各个细节，自己也觉得受益匪浅。于是，开始写下一个LSTM的教程吧。

RNN是一个非常棒的技术，可能它已经向我们揭示了“活”的意义。RNN我已经尝试学习了几次，包括前面我这篇笔记，所以就直接进入代码阅读吧。

读例子程序：

1. 引入库文件

# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
from __future__ import print_function
import os
import numpy as np
import random
import string
import tensorflow as tf
import zipfile
from six.moves import range
from six.moves.urllib.request import urlretrieve

2. 下载数据

然后下载数据，如果前面已经下载过，那直接把text8.zip拷过来就可以用。

url = 'http://mattmahoney.net/dc/'

def maybe_download(filename, expected_bytes):
  """Download a file if not present, and make sure it's the right size."""
  if not os.path.exists(filename):
    filename, _ = urlretrieve(url + filename, filename)
  statinfo = os.stat(filename)
  if statinfo.st_size == expected_bytes:
    print('Found and verified %s' % filename)
  else:
    print(statinfo.st_size)
    raise Exception(
      'Failed to verify ' + filename + '. Can you get to it with a browser?')
  return filename

filename = maybe_download('text8.zip', 31344016)

3. 读入文本

读文件稍微有些不一样，不是处理成list，而是直接读成一个字符串，因为后面用到的就是串数据。

def read_data(filename):
  f = zipfile.ZipFile(filename)
  for name in f.namelist():
    return tf.compat.as_str(f.read(name))
  f.close()

text = read_data(filename)
print('Data size %d' % len(text))

4. 生成训练数据集函数

切割一下，留1000个字符做检验，其他99999000个字符拿来训练。

valid_size = 1000
valid_text = text[:valid_size]
train_text = text[valid_size:]
train_size = len(train_text)
print(train_size, train_text[:64])
print(valid_size, valid_text[:64])

5. 两个工具函数

建立两个函数char2id和id2char，用来把字符对应成数字。

本程序只考虑26个字母外加1个空格字符，其他字符都当做空格来对待。所以可以用两个函数，通过ascii码加减，直接算出对应的数值或字符。

vocabulary_size = len(string.ascii_lowercase) + 1 # [a-z] + ' '
first_letter = ord(string.ascii_lowercase[0])

def char2id(char):
  if char in string.ascii_lowercase:
    return ord(char) - first_letter + 1
  elif char == ' ':
    return 0
  else:
    print('Unexpected character: %s' % char)
    return 0

def id2char(dictid):
  if dictid > 0:
    return chr(dictid + first_letter - 1)
  else:
    return ' '

print(char2id('a'), char2id('z'), char2id(' '), char2id('ï'))
print(id2char(1), id2char(26), id2char(0))

6. 生成训练数据集函数

这次 BatchGenerator 做的比前两天的那个要认真了，用了成员变量来记录位置，而不是用全局变量。

用 BatchGenerator.next() 方法，可以获取一批子字符串用于训练。

batch_size 是每批几串字符串，num_unrollings 是每串子字符串的长度（实际上字符串开头还加了上一次获取的最后一个字符，所以实际上字符串长度要比 num_unrollings 多一个）。

batch_size=64
num_unrollings=10

class BatchGenerator(object):
  def __init__(self, text, batch_size, num_unrollings):
    self._text = text
    self._text_size = len(text)
    self._batch_size = batch_size
    self._num_unrollings = num_unrollings
    segment = self._text_size // batch_size
    self._cursor = [ offset * segment for offset in range(batch_size)]
    self._last_batch = self._next_batch()

  def _next_batch(self):
    """Generate a single batch from the current cursor position in the data."""
    batch = np.zeros(shape=(self._batch_size, vocabulary_size), dtype=np.float)
    for b in range(self._batch_size):
      batch[b, char2id(self._text[self._cursor[b]])] = 1.0
      self._cursor[b] = (self._cursor[b] + 1) % self._text_size
    return batch

  def next(self):
    """Generate the next array of batches from the data. The array consists of the last batch of the previous array, followed by num_unrollings new ones. """
    batches = [self._last_batch]
    for step in range(self._num_unrollings):
      batches.append(self._next_batch())
    self._last_batch = batches[-1]
    return batches

真不愧是优秀程序员写的代码，这个函数写的又让我学习了！

它在初始化的时候先根据 batch_size 把段分好，然后设立一组游标 _cursor ，是一组哦，不是一个哦！然后定义好 _last_batch看或许到哪了。

然后获取需要的字符串的时候，是一批一批的获取各个字符。

这样做，就可以针对整段字符串均匀的取样，从而避免某些地方学的太细，某些地方又没有学到。

值得注意的是，在RNN准备数据的时候，所喂数据的结构是很容易搞错的。在前面博客中，也有很多同学对于他使用 transpose 的意义没法理解。这里需要详细记录一下。

BatchGenerator.next() 返回的数据格式，是一个list，list的长度是 num_unrollings+1，每一个元素，都是一个(batch_size,27)的array，27是 vocabulary_size，一个27维向量代表一个字符，是one-hot encoding的格式。

所以，喂这一批数据进神经网络的时候，理论上是先进去一批的首字符，然后再进去同一批的第二个字符，然后再进去同一批的第三个字符…

也就是说，下图才是真正的RNN的结构，我们要做的，是按照顺序一个一个的按顺序把东西喂进去。这个图，我看到名字叫 RNN-rolled：

学习Tensorflow的LSTM的RNN例子

我们平时看到的向右一路展开的RNN其实向右方向（我用了虚线）是代表先后顺序（同时也带记忆数据流），跟上下方向意义是不一样的。有没有同学误解那么一排东西是可以同时喂进去的？这个图，我看到名字叫 RNN-unrolled。

学习Tensorflow的LSTM的RNN例子

7. 另外两个工具函数

再定义两个用来把训练数据转换成可展现字符串的函数。

characters 先从one-hot encoding变回数字，再用id2char变成字符。

batches2string 则将训练数据变成可以展现的字符串。高手这么一批一批的处理数据逻辑还这么绕，而不是按凡人逻辑一个一个的处理让我觉得有点窒息的感觉，自感智商捉急了。

def characters(probabilities):
  """Turn a 1-hot encoding or a probability distribution over the possible characters back into its (most likely) character representation."""
  return [id2char(c) for c in np.argmax(probabilities, 1)]

def batches2string(batches):
  """Convert a sequence of batches back into their (most likely) string representation."""
  s = [''] * batches[0].shape[0]
  for b in batches:
    s = [''.join(x) for x in zip(s, characters(b))]
  return s

train_batches = BatchGenerator(train_text, batch_size, num_unrollings)
valid_batches = BatchGenerator(valid_text, 1, 1)

print(batches2string(train_batches.next()))
print(batches2string(train_batches.next()))
print(batches2string(valid_batches.next()))
print(batches2string(valid_batches.next()))

8. 另外四个工具函数

四个函数，给训练中输出摘要时使用。

def logprob(predictions, labels):
  """Log-probability of the true labels in a predicted batch."""
  predictions[predictions < 1e-10] = 1e-10
  return np.sum(np.multiply(labels, -np.log(predictions))) / labels.shape[0]

def sample_distribution(distribution):
  """Sample one element from a distribution assumed to be an array of normalized probabilities. """
  r = random.uniform(0, 1)
  s = 0
  for i in range(len(distribution)):
    s += distribution[i]
    if s >= r:
      return i
  return len(distribution) - 1

def sample(prediction):
  """Turn a (column) prediction into 1-hot encoded samples."""
  p = np.zeros(shape=[1, vocabulary_size], dtype=np.float)
  p[0, sample_distribution(prediction[0])] = 1.0
  return p

def random_distribution():
  """Generate a random column of probabilities."""
  b = np.random.uniform(0.0, 1.0, size=[1, vocabulary_size])
  return b/np.sum(b, 1)[:,None]

logprob：用来测量预测工作完成的如何。

先回忆一下 cross_entropy：

C r o s s E n t r o p y = - \sum i N (p r e d i c t i o n s \cdot log (l a b e l s))

那么，

l o g p r o b = C r o s s E n t r o p y N

后面三个函数 sample_distribution samplerandom_distribution 是一起使用的。

random_distribution 就是生成一个平均分布的，加总和为 1 的 array。但是我不知道为何写的这么花哨，我试了半天，似乎 b/np.sum(b, 1)[:,None] 和 b/np.sum(b) 的意思是一样的。

sample 则是靠 sample_distribution 以传入的 prediction 的概率，随机取一个维设成 1 ，其他都设成 0 ，也就是按照 prediction 的概率获得一个随机字母。（为啥不直接取概率最大的那个字母呢？搞这么复杂真的好吗？）

9. 定义Tensorflow模型

分为几个部分：定义变量，定义LSTM Cell，定义输入接口，循环执行LSTM Cell，定义loss，定义优化，定义预测。

num_nodes 是代表这个神经网络中LSTM Cell层的Cell个数。

num_nodes = 64

graph = tf.Graph()
with graph.as_default():

1) 定义变量


  # Parameters:
  # Input gate: input, previous output, and bias.
  ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ib = tf.Variable(tf.zeros([1, num_nodes]))
  # Forget gate: input, previous output, and bias.
  fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  fb = tf.Variable(tf.zeros([1, num_nodes]))
  # Memory cell: input, state and bias. 
  cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  cb = tf.Variable(tf.zeros([1, num_nodes]))
  # Output gate: input, previous output, and bias.
  ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ob = tf.Variable(tf.zeros([1, num_nodes]))
  # Variables saving state across unrollings.
  saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  # Classifier weights and biases.
  w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
  b = tf.Variable(tf.zeros([vocabulary_size]))

LSTM Cell 首先有三个门，input output forget三门。

Memory cell 暂时不知道是个什么。

saved_output 是向上的产出，saved_state 是自己的状态记忆。

w 和 b 是最后用来做一个 full connection 的标准神经网络层，把结果变为 vocabulary_size 个之一。

2) 定义LSTM Cell

  # Definition of the cell computation.
  def lstm_cell(i, o, state):
    """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf Note that in this formulation, we omit the various connections between the previous state and the gates."""
    input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
    forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
    update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
    state = forget_gate * state + input_gate * tf.tanh(update)
    output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
    return output_gate * tf.tanh(state), state

这里定义的 LSTM Cell 似乎并不是我们平时熟悉的那种，而是如下图（http://arxiv.org/pdf/1402.1128v1.pdf）：

学习Tensorflow的LSTM的RNN例子

初看这个图可能不是很能理解，于是我重新画了一下：

学习Tensorflow的LSTM的RNN例子

我手画的图例解释：

(1) ⊗ 代表两个数据源乘上参数后相加。 ⊕ 代表两个数据源相加。

(2) ⊗ 外面再加花边的，代表两个数据源相乘后再取 sigmoid 。

(3) 圆圈里是 g 的，代表取 tanh 。

(4) State−1 下标-1代表这是上一次迭代时的结果。

回想一下，sigmoid 函数产生一个(0,1)的数，tanh 函数产生一个(-1,1)的数。

作为对比，我再引用一个我认为画的最完美的标准 LSTM Cell 图，来自 Colah 的博客：

学习Tensorflow的LSTM的RNN例子

Colah 图例解释：

(1) 方形中带 σ ，代表两个数据源连接在一起后乘参数，再取 sigmoid 。（嗯，这里有不同：Colah 博客中标准的 LSTM Cell 中，这里的操作是先接在一起，再乘参数，而我们这里是先各自乘参数，再相加。）

(2) 方形中带 tanh ，代表两个数据源连接在一起后乘参数，再取 tanh 。（这里也是）

(3) 椭圆形中带 tanh ，代表直接取 tanh 。

(4) ⊗ 代表两个数据源相乘。 ⊕ 代表两个数据源相加。

(5) 两条从过去 −1 到当前 Cell 再到未来 +1 的横向黑色线条箭头，上方代表 state，下方代表 output。

所以像论文里指出的，这里实现的 LSTM Cell 含有更多参数，效果更好？这种比较目前超出我的认知范围，以后再细看。

3) 定义输入接口

  # Input data.
  train_data = list()
  for _ in range(num_unrollings + 1):
    train_data.append(
      tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
  train_inputs = train_data[:num_unrollings]
  train_labels = train_data[1:]  # labels are inputs shifted by one time step.

这里也是一个 batch 同时处理的。但为了容易理解，我先假设 batch_size=1 ，然后假设我们要训练一个字符串 abcdefg。

那么 train_inputs 是 abcdef，train_labels 是 bcdefg 。

4) 循环执行LSTM Cell

  # Unrolled LSTM loop.
  outputs = list()
  output = saved_output
  state = saved_state
  for i in train_inputs:
    output, state = lstm_cell(i, output, state)
    outputs.append(output)

根据前面定义变量的时候规定，初始 saved_output 和 saved_state 都是全零。

依次输入 a b c d e f ，把每一次的输出放在一起形成一个 list 就是 outputs。

5) 定义loss

  # State saving across unrollings.
  with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
    # Classifier.
    logits = tf.nn.xw_plus_b(tf.concat(0, outputs), w, b)
    loss = tf.reduce_mean(
      tf.nn.softmax_cross_entropy_with_logits(
        logits, tf.concat(0, train_labels)))

因为不是顺序执行语言，一般模型如果不是相关的语句，其执行是没有先后顺序的，control_dependencies 的作用就是建立先后顺序，保证前面两句被执行后，才执行后面的内容。

这里也就是先把 saved_output 和 saved_state 保存之后，再计算 logits 和 loss。否则因为下面计算时没有关联到 saved_output 和 saved_state，如果不用 control_dependencies 那上面两句保存就不会被优化语句触发。

tf.concat(0, values) 是指在 0 维上把 values 连接起来。本来 outputs 是一个 list，每一个元素都是一个27维向量表示一个字母（还是假设 batch_size=1）。

通过 tf.concat 把结果连接起来，成为一个向量，可以拿来乘以 w 加上 b 这样进入一个 full connection，从而得到 logits 。

注意：1.3.0版本中concat函数有传参形式有变化

然后再通过 softmax_cross_entropy_with_logits 比较连接并 full connection 的 outputs 和连接起来的 train_labels ，得到 loss 。

6) 定义优化

  # Optimizer.
  global_step = tf.Variable(0)
  learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  gradients, v = zip(*optimizer.compute_gradients(loss))
  gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
  optimizer = optimizer.apply_gradients(
    zip(gradients, v), global_step=global_step)

tf.train.exponential_decay 可以用来实现 learning_rate 的指数型衰减，越到后面 learning_rate 越小。（依赖后面修改 global_step 值来实现）

optimizer 定义成使用标准 Gradient Descent 。每一种 optimizer 都有几个标准接口，我们前面常用的是 minimize 接口，他自动的调整整个 Graph 中可调节的 Variables 尝试最小化 loss。其实 minimize 函数就是这两步并起来： compute_gradients 和 apply_gradients。先计算梯度值，然后再把那些参数减去梯度值。这里把两步分开了，为了在 apply 之前先处理一下梯度值，Tensorflow 给了详细解释，我们来看看[手册][manual-compute-gradients]。

compute_gradients 函数返回一个list，里面是一对一对的 gradient 和 variable，说明针对某个可调整的变量，他的梯度是多少。

clip_by_global_norm 避免梯度值过大产生 Exploding Gradients 梯度爆炸问题，视频里有这么一个图：

学习Tensorflow的LSTM的RNN例子

clip_by_global_norm 的具体计算是，先计算 global_norm ，也就是整个 tensor 的模（二范数）。看这个模是否大于文中的1.25，如果大于，则结果等于 gradients * 1.25 / global_norm，如果不大于，就不变。

最后，apply_gradients。这里传入的 global_step 是会被修改的，每次加一，这样下次计算 learning_rate 的时候就会使用新的 global_step 值。

7) 定义预测

  # Predictions.
  train_prediction = tf.nn.softmax(logits)

  # Sampling and validation eval: batch 1, no unrolling.
  sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
  saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
  saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
  reset_sample_state = tf.group(
    saved_sample_output.assign(tf.zeros([1, num_nodes])),
    saved_sample_state.assign(tf.zeros([1, num_nodes])))
  sample_output, sample_state = lstm_cell(
    sample_input, saved_sample_output, saved_sample_state)
  with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                saved_sample_state.assign(sample_state)]):
    sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

sample_input 是一个1-hot编码过的字符。

建立初始 state 和 output，经过同样的 LSTM Cell，得到下一个预测的字符 sample_prediction。

10. 开始训练

1) 训练

注意到这里喂进去的字符串长度正好是 num_unrollings + 1，恰好对应前面 BatchGenerator.next() 获取的时候得到的字符串长度，也恰好对应了模型定义里 train_inputs 和 train_labels 错开1个字符。

mean_loss 用来加总各步的 loss 值，用来后面输出。（还是建议叫 subtotal_loss）

num_steps = 7001
summary_frequency = 100

with tf.Session(graph=graph) as session:
  tf.initialize_all_variables().run()
  print('Initialized')
  mean_loss = 0
  for step in range(num_steps):
    batches = train_batches.next()
    feed_dict = dict()
    for i in range(num_unrollings + 1):
      feed_dict[train_data[i]] = batches[i]
    _, l, predictions, lr = session.run(
      [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
    mean_loss += l

2) 定期输出摘要

他怎么不用 tensorflow 来计算呀，反而用 numpy 来计算，很奇怪。来仔细看看。

    if step % summary_frequency == 0:
      if step > 0:
        mean_loss = mean_loss / summary_frequency
      # The mean loss is an estimate of the loss over the last few batches.
      print(
        'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
      mean_loss = 0
      labels = np.concatenate(list(batches)[1:])
      print('Minibatch perplexity: %.2f' % float(
        np.exp(logprob(predictions, labels))))
      if step % (summary_frequency * 10) == 0:
        # Generate some samples.
        print('=' * 80)
        for _ in range(5):
          feed = sample(random_distribution())
          sentence = characters(feed)[0]
          reset_sample_state.run()
          for _ in range(79):
            prediction = sample_prediction.eval({sample_input: feed})
            feed = sample(prediction)
            sentence += characters(feed)[0]
          print(sentence)
        print('=' * 80)
      # Measure validation set perplexity.
      reset_sample_state.run()
      valid_logprob = 0
      for _ in range(valid_size):
        b = valid_batches.next()
        predictions = sample_prediction.eval({sample_input: b[0]})
        valid_logprob = valid_logprob + logprob(predictions, b[1])
      print('Validation set perplexity: %.2f' % float(np.exp(
        valid_logprob / valid_size)))

每当 summary_frequency 整数倍步的时候，输出平均 loss 值和 learning_rate ，看看是否有 clip 掉，如果没有 clip 掉，那么都是 10.0 。然后再计算这一部分 train set perplexity。

每当 summary_frequency * 10 整数倍步的时候，尝试输出一些文字结果。

这里尝试得到 5 句，每局 80 个字符的文字结果。

首先以平均分布随机得到一个字符，并作为 sentence 的第一个字符。

然后 reset_sample_state 一下，保证初始化的 state 和 output 都设成 0 。

然后传入第一个字符作为输入，得到第一个预测字符的预测概率 prediction，通过 sample 将其蜕化成一个确定的字符 feed，然后接到 sentence 上，并下一次传给模型作为输入。

这样就得到了一句80字符的句子。重复这个过程 5 次，得到 5 句。

继而，又是每当 summary_frequency 整数倍步的时候，（写的不好啊，明明应当把相近的写在一起。）用 valid_text 来计算平均的 validation set perplexity。

根据信息论，perplexity wikipedia定义和 cross_entropy 的关系如下：

p e r p l e x i t y = e c r o s s_e n t r o p y

结束

谢谢阅读，敬请留言。

原文：https://liusida.github.io/2016/11/16/study-lstm/

秒客网

学习Tensorflow的LSTM的RNN例子

学习Tensorflow的LSTM的RNN例子

读例子程序：

1. 引入库文件

2. 下载数据

3. 读入文本

4. 生成训练数据集函数

5. 两个工具函数

6. 生成训练数据集函数

7. 另外两个工具函数

8. 另外四个工具函数

9. 定义Tensorflow模型

1) 定义变量

2) 定义LSTM Cell

3) 定义输入接口

4) 循环执行LSTM Cell

5) 定义loss

6) 定义优化

7) 定义预测

10. 开始训练

1) 训练

2) 定期输出摘要

结束

相关文章