神经网络中Attention和Transformer结构
通过几个核心问题,探索NLP领域关键模块的结构
问题1,写一个Attention的代码?
import torch
import torch.nn as nn
class AttentionBlock(nn.Module):
def __init__(self, input_dim, hidden_dim, num_heads):
super().__init__()
self.attention = nn.MultiheadAttention(input_dim, num_heads)
self.fc = nn.Linear(input_dim, hidden_dim)
def forward(self, input, mask=None):
attention_output, attention_weights = self.attention(input, input, input, mask=mask)
output = self.fc(attention_output)
return output, attention_weights
这个回答中,用了一个Pytorch现有的模块‘torch.nn.MultiheadAttention’,这不方便我们探索Attention的内部结构,因此我们抛出第二个问题。
问题2,写一个不需要该模块的实现
代码稍长
import torch
import torch.nn as nn
class AttentionBlock(nn.Module):
def __init__(self, input_dim, hidden_dim):
super().__init__()
self.query = nn.Linear(input_dim, hidden_dim)
self.key = nn.Linear(input_dim, hidden_dim)
self.value = nn.Linear(input_dim, hidden_dim)
self.fc = nn.Linear(hidden_dim, hidden_dim)
def forward(self, input, mask=None):
batch_size, sequence_length, _ = input.shape
query = self.query(input) # (batch_size, sequence_length, hidden_dim)
key = self.key(input) # (batch_size, sequence_length, hidden_dim)
value = self.value(input) # (batch_size, sequence_length, hidden_dim)
# Calculate attention weights
attention_weights = torch.matmul(query, key.transpose(1, 2)) # (batch_size, sequence_length, sequence_length)
attention_weights = attention_weights / (hidden_dim ** 0.5) # Scale attention weights
if mask is not None:
attention_weights = attention_weights.masked_fill(mask == 0, -1e9) # Mask padded positions
attention_weights = torch.softmax(attention_weights, dim=-1) # Normalize attention weights
# Calculate attention-weighted output
attention_output = torch.matmul(attention_weights, value) # (batch_size, sequence_length, hidden_dim)
attention_output = self.fc(attention_output) # (batch_size, sequence_length, hidden_dim)
return attention_output, attention_weights
从代码中可以看出,attention的结构是由4个全连接层(fully-connected layer,FC)构成
将数据输入后。并行输入到其中3个FC,它们分别输出三个值Query,Key, Value,简写为Q,K,V。经过矩阵计算,大致为QxKxV, 其中QxK后防止梯度消失加了个数据约束操作(Scale,和数据维度成正相关),再和V向乘。最终QKV的计算结果再进最后1个FC,得到结果。
另外一个重点,也就是Attention的精髓,就是有一个动态参数:attention_weights。这个参数设计初衷是解决机器翻译中长句子上下文关联问题,其作用是让模型在不同时间关注数据的不同部分。
问题3 写一个Transformer的实现
机器人回答代码如下:
import torch
import torch.nn as nn
class TransformerBlock(nn.Module):
def __init__(self, input_dim, hidden_dim, num_heads, dropout_p=0.1):
super().__init__()
self.attention = AttentionBlock(input_dim, hidden_dim, num_heads)
self.norm1 = nn.LayerNorm(input_dim)
self.norm2 = nn.LayerNorm(input_dim)
self.feed_forward = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, input_dim)
)
self.dropout = nn.Dropout(dropout_p)
def forward(self, input, mask=None):
residual = input
input, attention_weights = self.attention(input, mask=mask)
input = self.dropout(input)
input = self.norm1(input + residual)
residual = input
input = self.feed_forward(input)
input = self.dropout(input)
input = self.norm2(input + residual)
return input, attention_weights
可以看出,Transformer是由一个attention,加dropout和norm,再加一个2层的FC网络(这里称为feed_forward),再加dropout和norm。省略常用的数据约束操作(如激活函数,Norm正则等)。
Transformer的结构为:一个attention(3个并行FC和1个FC)和一个feed_forward(2层的FC)构成。
注意其中attention并行的3个FC用了QKV计算完成关联权重,以及2层FC中间是Relu,以及attention和feed_forward之间用了dropout和norm。
补充部分:
问题4,self-attention和attention的区别?
self-attention只是attention的一个特例,区别在于输入和输出(目的)。以NLP为例,attention输入一个句子和一组权重Weights,权重用于给出句子中每个单词的关联。self-attention输入多个句子和一组权重Weights,这个权重的目的是给出这些句子的关联。
问题5,Multi-head attention中的Multi-head是什么
这个也是Attention的变种,设计初衷是并行处理NLP长句子不同部分。就是把输入分解(split)为不同的部分head输入attention即可。比如把句子按词性拆分再输入。代码可以在原有attention基础上加入上述操作。