Mojo中的矩阵乘法

本文介绍了如何在 Mojo 中编写矩阵乘法 (matmul) 算法。我们将从纯 Python 实现开始，过渡到本质上是 Python 实现副本的简单实现，然后添加类型，然后通过矢量化、平铺和并行化实现继续进行优化。

首先，让我们定义矩阵乘法。给定两个密度矩阵A和B的维数分别是MXK和KXN。我们要计算它们的点积C= A·B(也被称为matmul),点积定义为：C+=A·B, C

在这里插入图片描述

请查看我们关于matmul的博客文章，以及为什么它对于ML和DL工作负载很重要。

该notebook的格式是从一个与Python相同的实现开始(实际上是重命名文件扩展名)，然后在通过利用现代硬件上可用的矢量化和并行化能力扩展实现之前，看看向实现中添加类型如何帮助性能。在整个执行过程中，我们报告实现的GFlops。

Python 实现

让我们首先在Python中直接从定义中实现matmul。

%%python
def matmul_python(C, A, B):
    for m in range(C.rows):
        for k in range(A.cols):
            for n in range(C.cols):
                C[m, n] += A[m, k] * B[k, n]

让我们使用128乘128的方阵对我们的实现进行基准测试，并报告实现的GFLops。

安装numpy(如果还没有安装的话):

%%python
from importlib.util import find_spec
import shutil
import subprocess

fix = """
-------------------------------------------------------------------------
fix following the steps here:
    /modularml/mojo/issues/1085#issuecomment-1771403719
-------------------------------------------------------------------------
"""

def install_if_missing(name: str):
    if find_spec(name):
        return

    print(f"{name} not found, installing...")
    try:
        if shutil.which('python3'): python = "python3"
        elif shutil.which('python'): python = "python"
        else: raise ("python not on path" + fix)
        subprocess.check_call([python, "-m", "pip", "install", name])
    except:
        raise ImportError(f"{name} not found" + fix)

install_if_missing("numpy")

%%python
from timeit import timeit
import numpy as np

class Matrix:
    def __init__(self, value, rows, cols):
        self.value = value
        self.rows = rows
        self.cols = cols

    def __getitem__(self, idxs):
        return self.value[idxs[0]][idxs[1]]

    def __setitem__(self, idxs, value):
        self.value[idxs[0]][idxs[1]] = value

def benchmark_matmul_python(M, N, K):
    A = Matrix(list(np.random.rand(M, K)), M, K)
    B = Matrix(list(np.random.rand(K, N)), K, N)
    C = Matrix(list(np.zeros((M, N))), M, N)
    secs = timeit(lambda: matmul_python(C, A, B), number=2)/2
    gflops = ((2*M*N*K)/secs) / 1e9
    print(gflops, "GFLOP/s")
    return gflops

python_gflops = benchmark_matmul_python(128, 128, 128).to_float64()

输出：

0.0022564560694965834 GFLOP/s

将 Python 实现导入 Mojo

使用Mojo和使用Python一样简单。首先，我们从Mojo的stdlib中引入要用到的模块:

import benchmark
from memory import memset_zero
from random import rand, random_float64

然后，我们可以复制粘贴我们的Python代码。Mojo是Python的超集，因此将运行与Mojo相同的Python代码

# This exactly the same Python implementation, but is infact Mojo code!
def matmul_untyped(C, A, B):
    for m in range(C.rows):
        for k in range(A.cols):
            for n in range(C.cols):
                C[m, n] += A[m, k] * B[k, n]

运行

然后我们可以对实现进行基准测试。和之前一样，我们使用一个128 × 128的矩阵

fn matrix_getitem(self: object, i: object) raises -> object:
    return self.value[i]


fn matrix_setitem(self: object, i: object, value: object) raises -> object:
    self.value[i] = value
    return None


fn matrix_append(self: object, value: object) raises -> object:
    self.value.append(value)
    return None


fn matrix_init(rows: Int, cols: Int) raises -> object:
    var value = object([])
    return object(
        Attr("value", value), Attr("__getitem__", matrix_getitem), Attr("__setitem__", matrix_setitem),
        Attr("rows", rows), Attr("cols", cols), Attr("append", matrix_append),
    )

def benchmark_matmul_untyped(M: Int, N: Int, K: Int, python_gflops: Float64):
    C = matrix_init(M, N)
    A = matrix_init(M, K)
    B = matrix_init(K, N)
    for i in range(M):
        c_row = object([])
        b_row = object([])
        a_row = object([])
        for j in range(N):
            c_row.append(0.0)
            b_row.append(random_float64(-5, 5))
            a_row.append(random_float64(-5, 5))
        C.append(c_row)
        B.append(b_row)
        A.append(a_row)

    @parameter
    fn test_fn():
        try:
            _ = matmul_untyped(C, A, B)
        except:
            pass

    var secs = benchmark.run[test_fn](max_runtime_secs=0.5).mean()
    _ = (A, B, C)
    var gflops = ((2*M*N*K)/secs) / 1e9
    var speedup : Float64 = gflops / python_gflops
    print(gflops, "GFLOP/s, a", speedup, "x speedup over Python")

benchmark_matmul_untyped(128, 128, 128, python_gflops)

输出:

0.019322323885887414 GFLOP/s, a 8.563128769530282 x speedup over Python

请注意，我们毫不费力就获得了巨大的速度提升。

向 Python 实现添加类型

上述程序虽然实现了比Python更好的性能，但仍然不是我们从Mojo中可以得到的最好结果。如果我们告诉Mojo输入的类型，它可以优化大部分代码并减少分派成本(不像Python只使用类型进行类型检查，Mojo也利用类型信息进行性能优化)。

为此，我们首先定义一个矩阵结构体。矩阵结构体包含一个数据指针和大小字段。虽然Matrix结构体可以对任何数据类型进行参数化，但为了简洁，这里我们将数据类型设置为Float32。

alias type = DType.float32

struct Matrix[rows: Int, cols: Int]:
    var data: DTypePointer[type]

    # Initialize zeroeing all values
    fn __init__(inout self):
        self.data = DTypePointer[type].alloc(rows * cols)
        memset_zero(self.data, rows * cols)

    # Initialize taking a pointer, don't set any elements
    fn __init__(inout self, data: DTypePointer[type]):
        self.data = data

    # Initialize with random values
    @staticmethod
    fn rand() -> Self:
        var data = DTypePointer[type].alloc(rows * cols)
        rand(data, rows * cols)
        return Self(data)

    fn __getitem__(self, y: Int, x: Int) -> Scalar[type]:
        return self.load[1](y, x)

    fn __setitem__(self, y: Int, x: Int, val: Scalar[type]):
        self.store[1](y, x, val)

    fn load[nelts: Int](self, y: Int, x: Int) -> SIMD[type, nelts]:
        return self.data.load[width=nelts](y * self.cols + x)

    fn store[nelts: Int](self, y: Int, x: Int, val: SIMD[type, nelts]):
        return self.data.store[width=nelts](y * self.cols + x, val)

注意，我们根据load和store实现了getitem和setitem。对于matmul的原始实现来说，这没有什么区别，但我们将在后面更优化的matmul向量化版本中使用这一点。

使用上面的矩阵类型，我们可以有效地复制和粘贴Python实现，并添加类型注释:

# Note that C, A, and B have types.
fn matmul_naive(C: Matrix, A: Matrix, B: Matrix):
    for m in range(C.rows):
        for k in range(A.cols):
            for n in range(C.cols):
                C[m, n] += A[m, k] * B[k, n]

随着我们的改进，我们将对实现进行基准测试，所以让我们编写一个辅助函数来为我们完成此工作:

alias M = 1024
alias N = 1024
alias K = 1024

@always_inline
fn bench[
    func: fn (Matrix, Matrix, Matrix) -> None](base_gflops: Float64):
    var C = Matrix[M, N]()
    var A = Matrix[M, K].rand()
    var B = Matrix[K, N].rand()

    @always_inline
    @parameter
    fn test_fn():
        _ = func(C, A, B)

    var secs = benchmark.run[test_fn](max_runtime_secs=1).mean()

    A.data.free()
    B.data.free()
    C.data.free()

    var gflops = ((2 * M * N * K) / secs) / 1e9
    var speedup: Float64 = gflops / base_gflops

    print(gflops, "GFLOP/s, a", speedup, "x speedup over Python")

基准测试显示了显著的速度提升。我们将矩阵的大小增加到512 * 512，因为Mojo比Python快得多。

bench[matmul_naive](python_gflops)

输出：

5.3335685138294373 GFLOP/s, a 2363.6926000599515 x speedup over Python

与原始的无类型版本相比，添加类型注解带来了巨大的改进。

对最内层循环进行矢量化

通过利用向量指令，我们可以做得比上面的实现更好。我们使用simdwidthof查询指定dtype的simd宽度，而不是假设向量的宽度。这使得我们的代码在迁移到其他硬件时具有可移植性。利用SIMD指令非常简单:

# simdwidthof = number of float32 elements that fit into a single SIMD register
# using a 2x multiplier allows some SIMD operations to run in the same cycle
alias nelts = simdwidthof[DType.float32]() * 2

fn matmul_vectorized_0(C: Matrix, A: Matrix, B: Matrix):
    for m in range(C.rows):
        for k in range(A.cols):
            for nv in range(0, C.cols - nelts + 1, nelts):
                C.store(m, nv, C.load[nelts](m, nv) + A[m, k] * B.load[nelts](k, nv))

            # Handle remaining elements with scalars.
            for n in range(nelts * (C.cols // nelts), C.cols):
                C[m, n] += A[m, k] * B[k, n]

我们可以对上述实现进行基准测试。请注意，许多编译器可以检测简单循环并对其进行优化。然而，Mojo允许你明确和精确地控制应用什么优化。

bench[matmul_vectorized_0](python_gflops)

输出：

27.943656514151261 GFLOP/s, a 12383.869064371152 x speedup over Python

向量化是一种常见的优化，Mojo提供了一个高阶函数来执行向量化。vectorize函数接受一个向量宽度和一个以向量宽度为参数的函数，并将以矢量化的方式进行计算。

# Simplify the code by using the builtin vectorize function
from algorithm import vectorize

fn matmul_vectorized_1(C: Matrix, A: Matrix, B: Matrix):
    for m in range(C.rows):
        for k in range(A.cols):
            @parameter
            fn dot[nelts: Int](n: Int):
                C.store(m, n, C.load[nelts](m, n) + A[m, k] * B.load[nelts](k, n))
            vectorize[dot, nelts, size = C.cols]()

这两种实现在性能上只有细微的差别:

bench[matmul_vectorized_1](python_gflops)

输出：

29.94507372793699 GFLOP/s, a 13270.842775422503 x speedup over Python

并行化 Matmul

使用Mojo，我们可以使用parallelize函数轻松地并行运行代码。

让我们修改matmul的实现，使其成为多线程的(为简单起见，我们只在M维上并行)。

在下面的parallelize中，我们通过在处理器之间更均匀地分配工作来实现过度分区。这确保了即使有些任务先于其他任务完成，或者有些处理器掉队，它们也都有工作可做。英特尔和苹果现在有独立的性能和效率核心，这缓解了可能导致的问题。

# Parallelize the code by using the builtin parallelize function
from algorithm import parallelize

fn matmul_parallelized(C: Matrix, A: Matrix, B: Matrix):
    @parameter
    fn calc_row(m: Int):
        for k in range(A.cols):
            @parameter
            fn dot[nelts : Int](n : Int):
                C.store[nelts](m,n, C.load[nelts](m,n) + A[m,k] * B.load[nelts](k,n))
            vectorize[dot, nelts, size = C.cols]()
    parallelize[calc_row](C.rows, C.rows)

我们可以对matmul的并行实现进行基准测试。

bench[matmul_parallelized](python_gflops)

输出：

860.6608444221323 GFLOP/s, a 381421.49366734456 x speedup over Python

平铺矩阵乘法

分块是matmul为提高缓存局部性而进行的一种优化。其思想是将子矩阵驻留在缓存中，以提高重用性。tile函数本身在Mojo中可以写成:

from algorithm import Static2DTileUnitFunc as Tile2DFunc

# Perform 2D tiling on the iteration space defined by end_x and end_y.
fn tile[tiled_fn: Tile2DFunc, tile_x: Int, tile_y: Int](end_x: Int, end_y: Int):
    # Note: this assumes that ends are multiples of the tiles.
    for y in range(0, end_y, tile_y):
        for x in range(0, end_x, tile_x):
            tiled_fn[tile_x, tile_y](x, y)

# Use the above tile function to perform tiled matmul.
fn matmul_tiled_parallelized(C: Matrix, A: Matrix, B: Matrix):
    @parameter
    fn calc_row(m: Int):
        @parameter
        fn calc_tile[tile_x: Int, tile_y: Int](x: Int, y: Int):
            for k in range(y, y + tile_y):
                @parameter
                fn dot[nelts: Int](n: Int):
                    C.store(m, n + x, C.load[nelts](m, n + x) + A[m, k] * B.load[nelts](k, n + x))
                vectorize[dot, nelts, size = tile_x]()

        # We hardcode the tile factor to be 4.
        alias tile_size = 4
        tile[calc_tile, nelts * tile_size, tile_size](A.cols, C.cols)

    parallelize[calc_row](C.rows, C.rows)

我们可以对分块并行matmul实现进行基准测试

bench[matmul_tiled_parallelized](python_gflops)

输出：

898.79112716147336 GFLOP/s, a 398319.79860436375 x speedup over Python

在上面的实现中，开销的一个来源是我们没有展开dot函数的vectorize引入的循环。我们可以通过Mojo中的vectorize_unroll高阶函数来实现:

# Unroll the vectorized loop by a constant factor.
fn matmul_tiled_unrolled_parallelized(C: Matrix, A: Matrix, B: Matrix):
    @parameter
    fn calc_row(m: Int):
        @parameter
        fn calc_tile[tile_x: Int, tile_y: Int](x: Int, y: Int):
            for k in range(y, y + tile_y):
                @parameter
                fn dot[nelts: Int](n: Int):
                    C.store(m, n + x, C.load[nelts](m, n + x) + A[m, k] * B.load[nelts](k, n + x))

                # Vectorize by nelts and unroll by tile_x/nelts
                # Here unroll factor is 4
                alias unroll_factor = tile_x // nelts
                vectorize[dot, nelts, size=tile_x, unroll_factor=unroll_factor]()

        alias tile_size = 4
        tile[calc_tile, nelts * tile_size, tile_size](A.cols, C.cols)

    parallelize[calc_row](C.rows, C.rows)

同样，我们可以使用展开和向量化的内层循环对新的分块并行matmul实现进行基准测试:

bench[matmul_tiled_unrolled_parallelized](python_gflops)

输出：

1026.9730406876936 GFLOP/s, a 455126.53872176283 x speedup over Python

秒客网