探索和构建 LLaMA 3 架构:深入探讨组件、编码和推理技术(七)前馈神经网络
在Transformer架构中,前馈层扮演着至关重要的角色,通常位于注意力层和标准化处理之后。前馈层由三个线性变换组成。
class FeedForward(nn.Module):
def __init__(self, args: ModelArgs):
super().__init__()
# Assuming 'hidden_dim' is calculated as per your specifications
hidden_dim = 4 * args.dim
hidden_dim = int(2 * hidden_dim / 3) # Applying your specific transformation
if args.ffn_dim_multiplier is not None:
hidden_dim = int(args.ffn_dim_multiplier * hidden_dim)
#hidden_dim = int(2 * hidden_dim / 3) # Applying your specific transformation
hidden_dim = args.multiple_of * ((hidden_dim + args.multiple_of - 1) // args.multiple_of)
self.w1 = nn.Linear(args.dim, hidden_dim, bias=False)
self.w2 = nn.Linear(hidden_dim, args.dim, bias=False) # This layer seems to be missing in your original setup
self.w3 = nn.Linear(args.dim, hidden_dim, bias=False) # Corrected to match checkpoint
def forward(self, x: torch.Tensor):
swish = F.silu(self.w1(x)) # Apply first transformation
x_V = self.w3(x)
x = swish * x_V # Apply contraction to original dimension
x = self.w2(x) # Apply optional additional transformation
return x
在前向传递过程中,输入张量x经历多层线性变换。第一次转换后应用的SwiGLU激活函数增强了模型的表达能力。最终的变换将张量映射回其原始维度。 SwiGLU 激活和多个前馈层的这种独特组合增强了模型的性能。
系列博客
探索和构建 LLaMA 3 架构:深入探讨组件、编码和推理技术(一)Llama3 模型 架构
https://duanzhihua.blog.****.net/article/details/138208650
探索和构建 LLaMA 3 架构:深入探讨组件、编码和推理技术(二)RoPE位置编码
https://duanzhihua.blog.****.net/article/details/138212328
探索和构建 LLaMA 3 架构:深入探讨组件、编码和推理技术(三)KV缓存
https://duanzhihua.blog.****.net/article/details/138213306
探索和构建 LLaMA 3 架构:深入探讨组件、编码和推理技术(四)分组多查询注意力
https://duanzhihua.blog.****.net/article/details/138216050
探索和构建 LLaMA 3 架构:深入探讨组件、编码和推理技术(五)RMS 均方根归一化
https://duanzhihua.blog.****.net/article/details/138216630
探索和构建 LLaMA 3 架构:深入探讨组件、编码和推理技术(六)SwiGLU 激活函数
https://duanzhihua.blog.****.net/article/details/138217261