神经网络中Attention和Transformer结构

通过几个核心问题，探索NLP领域关键模块的结构

问题1，写一个Attention的代码？

import torch
import torch.nn as nn

class AttentionBlock(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_heads):
        super().__init__()
        self.attention = nn.MultiheadAttention(input_dim, num_heads)
        self.fc = nn.Linear(input_dim, hidden_dim)

    def forward(self, input, mask=None):
        attention_output, attention_weights = self.attention(input, input, input, mask=mask)
        output = self.fc(attention_output)
        return output, attention_weights

这个回答中，用了一个Pytorch现有的模块‘torch.nn.MultiheadAttention’，这不方便我们探索Attention的内部结构，因此我们抛出第二个问题。

问题2，写一个不需要该模块的实现

代码稍长

import torch
import torch.nn as nn

class AttentionBlock(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super().__init__()
        self.query = nn.Linear(input_dim, hidden_dim)
        self.key = nn.Linear(input_dim, hidden_dim)
        self.value = nn.Linear(input_dim, hidden_dim)
        self.fc = nn.Linear(hidden_dim, hidden_dim)

    def forward(self, input, mask=None):
        batch_size, sequence_length, _ = input.shape
        query = self.query(input)  # (batch_size, sequence_length, hidden_dim)
        key = self.key(input)  # (batch_size, sequence_length, hidden_dim)
        value = self.value(input)  # (batch_size, sequence_length, hidden_dim)

        # Calculate attention weights
        attention_weights = torch.matmul(query, key.transpose(1, 2))  # (batch_size, sequence_length, sequence_length)
        attention_weights = attention_weights / (hidden_dim ** 0.5)  # Scale attention weights
        if mask is not None:
            attention_weights = attention_weights.masked_fill(mask == 0, -1e9)  # Mask padded positions
        attention_weights = torch.softmax(attention_weights, dim=-1)  # Normalize attention weights

        # Calculate attention-weighted output
        attention_output = torch.matmul(attention_weights, value)  # (batch_size, sequence_length, hidden_dim)
        attention_output = self.fc(attention_output)  # (batch_size, sequence_length, hidden_dim)
        return attention_output, attention_weights

从代码中可以看出，attention的结构是由4个全连接层(fully-connected layer，FC)构成

将数据输入后。并行输入到其中3个FC，它们分别输出三个值Query，Key, Value，简写为Q,K,V。经过矩阵计算，大致为QxKxV, 其中QxK后防止梯度消失加了个数据约束操作(Scale，和数据维度成正相关)，再和V向乘。最终QKV的计算结果再进最后1个FC，得到结果。

另外一个重点，也就是Attention的精髓，就是有一个动态参数：attention_weights。这个参数设计初衷是解决机器翻译中长句子上下文关联问题，其作用是让模型在不同时间关注数据的不同部分。

问题3 写一个Transformer的实现

机器人回答代码如下：

import torch
import torch.nn as nn

class TransformerBlock(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_heads, dropout_p=0.1):
        super().__init__()
        self.attention = AttentionBlock(input_dim, hidden_dim, num_heads)
        self.norm1 = nn.LayerNorm(input_dim)
        self.norm2 = nn.LayerNorm(input_dim)
        self.feed_forward = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, input_dim)
        )
        self.dropout = nn.Dropout(dropout_p)

    def forward(self, input, mask=None):
        residual = input
        input, attention_weights = self.attention(input, mask=mask)
        input = self.dropout(input)
        input = self.norm1(input + residual)
        residual = input
        input = self.feed_forward(input)
        input = self.dropout(input)
        input = self.norm2(input + residual)
        return input, attention_weights

可以看出，Transformer是由一个attention，加dropout和norm，再加一个2层的FC网络（这里称为feed_forward），再加dropout和norm。省略常用的数据约束操作（如激活函数，Norm正则等）。

Transformer的结构为：一个attention（3个并行FC和1个FC）和一个feed_forward（2层的FC）构成。

注意其中attention并行的3个FC用了QKV计算完成关联权重，以及2层FC中间是Relu，以及attention和feed_forward之间用了dropout和norm。

补充部分：

问题4，self-attention和attention的区别？

self-attention只是attention的一个特例，区别在于输入和输出(目的)。以NLP为例，attention输入一个句子和一组权重Weights，权重用于给出句子中每个单词的关联。self-attention输入多个句子和一组权重Weights，这个权重的目的是给出这些句子的关联。

问题5，Multi-head attention中的Multi-head是什么

这个也是Attention的变种，设计初衷是并行处理NLP长句子不同部分。就是把输入分解(split)为不同的部分head输入attention即可。比如把句子按词性拆分再输入。代码可以在原有attention基础上加入上述操作。

秒客网

Attention和Transformer的Pytorch实现