Attention总结二：

涉及论文：

Show, Attend and Tell: Neural Image Caption Generation with Visual Attentio（用了hard\soft attention attention）

Effective Approaches to Attention-based Neural Machine Translation（提出了global\local attention）

本文参考文章：

Attention - 之二
 不得不了解的五种Attention模型方法及其应用
 attention模型方法综述
 Attention机制论文阅读——global attention和local attention
Global Attention / Local Attention

本文摘要

attention机制本质思想
总结各attention机制（hard\soft\global\local attention）
attention其他相关

1 Attention机制本质思想

本质思想见：这篇文章，此文章中也说了self-attention。
简答来说attention就是(query, key ,value)在机器翻译中key-value是一样的。
PS：NMT中应用的Attention机制基本思想见论文总结：Attentin总结一

2 各种attention

来说一下其他的attention：

hard attention
soft attention
gloabal attention
local attention
self-attention:target = source -> Multi-head attention -（放attention总结三）

2.1 hard attention

论文：Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.
【论文笔记】Attention总结二：Attention本质思想 + Hard/Soft/Global/Local形式Attention
笔记来源：attention模型方法综述

soft attention是保留所有分量进行加权，hard attention是以某种策略选取部分分量。hard attention就是关注部分。
soft attention就是后向传播来训练。

hard attention的特点：
the hard attention model is non-differentiable and requires more complicated techniques such as variance reduction or reinforcement learning to train

具体

模型的encoder利用CNN(VGG net)，提取出图像的L个D维的向量ai,i=1,2,…L,每个向量表示图像的一部分信息。
decoder是一个LSTM，每个timestep的t输入包括三个部分：zt, ht-1,yt-1。其中zt由ai和αti得到。
αti是通过attention模型f_att来计算得到。
本文的f_att是一个多层感知机：
【论文笔记】Attention总结二：Attention本质思想 + Hard/Soft/Global/Local形式Attention
从而可以计算zt
其中attention模型f_att的获得方式有2种：stochastic attention and deterministic attention.

2.1.2 Stochastic “Hard” Attention

st是decoder的第t个时刻的attention关注的位置编号，sti表示第t时刻attention是否关注位置i，sti,i=1,2,…L，[st1,st2,…stL]是one-hot编码，attention每次只focus一个位置的做法，是hard的来源。
模型根据a=(a1,a2,…aL)生成序列y(y1,…,yC)，这里的s={s1,s2,…sC}是时间轴上的重点focus序列，理论上有L^C个。

PS:深度学习思想：研究目标函数，进而研究目标函数对参数的梯度。
【论文笔记】Attention总结二：Attention本质思想 + Hard/Soft/Global/Local形式Attention
用到了著名的jensen不等式来对目标函数(最大化logp(y|a))，对目标函数做了转化(因为没有显式s)，得到目标函数的lower bound，

然后用logp(y|a)代替原始目标函数，对模型的参数W算梯度，再用蒙特卡洛方法对s做抽样。
还有的细节涉及强化学习。

2.1.3 Deterministic “Soft” Attention

The whole model is smooth and differentiable（即目标函数，也就是LSTM的目标函数对权重αti是可微的，原因很简单，因为目标函数对zt可微，而zt对αti 可微，根据chain rule可得目标函数对αti可微）under the deterministic attention, so learning end-to-end is trivial by using standard backpropagation.

在hard attention里面，每个时刻t模型的序列[st1,…stL]只有一个取1，其余全部为0，也就是说每次只focus一个位置，而soft attention每次会照顾到全部的位置，只是不同位置的权重不同罢了。zt为ai的加权求和：
【论文笔记】Attention总结二：Attention本质思想 + Hard/Soft/Global/Local形式Attention

微调：【论文笔记】Attention总结二：Attention本质思想 + Hard/Soft/Global/Local形式Attention ,

用来调节context vector在LSTM中相对于ht-1和yt-1的比重。

2.1.4 训练过程

2种attention模型都使用SGD(stochastic gradient descent)来训练。

2.2 Global/Local Attention论文

论文：Effective Approaches to Attention-based Neural Machine Translation

笔记参考来自：

Attention机制论文阅读——global attention和local attention

Global Attention / Local Attention

论文计算context向量的过程：

h_t -> a_t -> c_t -> h^~_t

Global Attention

【论文笔记】Attention总结二：Attention本质思想 + Hard/Soft/Global/Local形式Attention

global attention 在计算 context vector ct 的时候会考虑 encoder 所产生的全部hidden state。

由此也可以看出，global attention相对于attention总结一里的attention很相似但更简单。两者间的区别，可以参考此篇文章，即下图笔记：
])

记 decoder 时刻t的 target hidden为ht，encoder 的全部 hidden state 为h^~_s ,s=1,2,…n。这也叫作：attentional hidden state。

对于任何h^~_s，权重a_t(s)是一个长度可变的alignment vector，长度等于编码器部分时间序列的长度。通过对比当前的解码器的隐藏层状态h_t 和每个编码器隐藏层状态状态h^~_s 得到：
【论文笔记】Attention总结二：Attention本质思想 + Hard/Soft/Global/Local形式Attention

a_t(s)是一个解码器状态和编码器状态对比得到的。
score是一个基于内容的函数，文章给出了三种种计算方法（文章称为 alignment function）：
【论文笔记】Attention总结二：Attention本质思想 + Hard/Soft/Global/Local形式Attention
其中：dot对global attention更好，general对local attention更好。

另外一种只需要h_t的score方式是将所有的a_t(s)整合成一个权重矩阵，得到Wa，就能计算得到a_t：
【论文笔记】Attention总结二：Attention本质思想 + Hard/Soft/Global/Local形式Attention

对a_t做一个加权平均操作(h^~_s 的weighted summation)就可以得到context向量c_t，然后继续进行后续步骤

global attention过程图：

【论文笔记】Attention总结二：Attention本质思想 + Hard/Soft/Global/Local形式Attention

Local Attention

global attention在计算每一个解码器的状态时需要关注所有的编码器输入，计算量比较大。
local attention 可以视为 hard attention 和 soft attention 的混合体（优势上的混合），因为它的计算复杂度要低于 global attention、soft attention，而且与 hard attention 不同的是，local attention 几乎处处可微，易于训练。
【论文笔记】Attention总结二：Attention本质思想 + Hard/Soft/Global/Local形式Attention

local attention机制选择性的关注于上下文所在的一个小窗口（每次只focus一小部分的source position），这能减少计算代价。

在这个模型中，对于是时刻t的每一个目标词汇，模型首先产生一个对齐的位置（aligned position）p_t。
context向量c_t由编码器中一个集合点隐藏层状态计算得到，编码器中的隐藏层包含在窗口[p_t-D, p_t+D]中，D的大小通过经验选择。

这些模型在c_t的形成上是不同的，具体见下面global vs location。

回到local attention，其中p_t是一个source position index, 可以理解为attention的焦点，作为模型的参数。p_t计算两种计算方案：

Monotonic alingnment(local-m)

设p_t=t，假设源序列和目标序列大致单调对齐，那么对齐向量a_t可以定义为：
Predictive alignment(local-p)

模型预测了一个对齐位置，而不是假设源序列和目标序列单调对齐。

W_p和v_p是模型的参数，通过训练来预测位置。S是源句子长度，这样计算之后，p_t∈[0,S]。
为了支持p_t附近的对齐点，设置一个围绕p_t的高斯分布，这样对齐权重αt(s)就可以表示为：

这里的对齐函数和global中的对齐函数相同，可以看出，距离中心 pt 越远的位置，其位置上的 source hidden state 对应的权重就会被压缩地越厉害。

得到c_t之后计算h^~_t 的方法，通过一个连接层将上下文向量c_t和h_t整合成h^~_t：
h^~_t = tanh(Wc[c_t; h_t])
h^~_t是一个attention向量，这个向量通过如下公式产生预测输出词的概率分布：
【论文笔记】Attention总结二：Attention本质思想 + Hard/Soft/Global/Local形式Attention

local attention过程图：
【论文笔记】Attention总结二：Attention本质思想 + Hard/Soft/Global/Local形式Attention

2.2.1 Global vs Local Attention

因此global/local区别就是：

前者中对齐向量a_t大小是可变的，取决于编码器部分输入序列的长度；
后者context向量a_t的大小是固定的，a_t∈R^2D+1；

Global Attention 和 Local Attention 各有优劣，实际中 Global 的用的更多一点，因为：

Local Attention 当 encoder 不长时，计算量并没有减少
位置向量p_t的预测并不非常准确，直接影响到Local Attention的准确率

2.2.2 Input-feeding Approach

inputfeeding approach：Attentional vectors h˜t are fed as inputs to the next time steps to inform the model about past alignment decisions。这样做的效果是双重的：

make the model fully aware of previous alignment choice
we create a very deep network spanning both horizontally and vertically

2.2.3 总结这篇论文使用的技术点：

global\ local attention,
input-feeding approach
better alignment function

2.2.4 论文实现tips

实现的时候涉及的理念与技术：
层层递进，比如先based模型，然后+reverse, +dropout, +global attention, + feed input, +unk replace, 然后看分数提高程度。
reverse就是reverse the source sentence,
上面的已知技术就比如：source reversing, dropout，unknowed replacement technique.
用整合多种比如8中不同设置的模型，比如使用不同的attention方法，有无使用dropout

词表大小、比如每个语言取top 50K，
未知的词用<unk>代替
句子对填充、LSTM层数、参数初始化设计比如在[-0.1, 0.1]范围内、the normalized gradient is rescaled whenever its norm exceeds 5.

训练方式：SGD
超参数的设计：
LSTM层数，每层的单元数比如100cells，多少维的word embeddings，epoch次数、mini-batch的大小比如128，
学习率可以用变化的，比如一开始是1,5pochs以后每次epoch后就halve、dropout比如0.2、
还有dropout的开始12pochs，8epochs后halve学习率

实验分析：

学习曲线看下降
effects of long sentences
attentional architectures
alignment quality

3 其他相关

3.1 Attention的设计

location-based attention

Location-based的意思就是，这里的attention没有其他额外所关注的对象，即attention的向量就是hi本身。
si=f(hi)=activation(WThi+b)
general attention(不常见)
concatenation-based attention

Concatenation-based意思就是，这里的attention是指要关注其他对象。
而f就是被设计出来衡量hi和ht之间相关性的函数。
si=f(hi，ht)=vTactivation(W1hi+W2ht+b)

3.2 Attention的拓展

一个文档由k2个sentence组成，每个sentence由k1（每个句子的k1大小不一）个word组成。

第一层：word-level的attention
对于每个sentence有k1k1个word，所对应的就有k1k1个向量wiwi，利用本文第二章所提的方式，得到每个sentence的表达向量，记为stisti。
第二层：sentence-level的attention
通过第一层的attention，我们可以得到k2k2个stisti，再利用本文第二章所提的方式，得到每个文档的表达向量didi，当然也可以得到每个stisti所对应的权重αiαi，然后，得到这些，具体任务具体分析。