Register-blocked GEMM for optimized 2D matrix multiplication. More...

#include "config.h"
#include "fused/microkernels/microkernel_base.h"
#include "fused/kernel_ops/kernel_helpers.h"

Include dependency graph for kernel_gemm.h:

This graph shows which files directly or indirectly include this file:

Classes
struct	detail::KernelGemm< T, Bits, Arch >

Namespaces
namespace	detail

Detailed Description

Register-blocked GEMM for optimized 2D matrix multiplication.

Implements C[M,N] = A[M,K] × B[K,N] using an outer-product micro-kernel formulation that maximizes register reuse and SIMD throughput.

Algorithm Overview

The naive approach computes each C[i,j] as an independent dot product: one row of A dotted with one column of B. For M=N=100, that's 10,000 dot products, each reloading the same data — terrible cache utilization.

This GEMM instead computes an MR × NR tile of C simultaneously using the outer-product formulation:

For each k in 0..K-1: Load NR_VECS vectors from B's row k (NR_VECS SIMD loads) For each of MR rows of A: Broadcast A[row, k] to a SIMD register (1 broadcast) FMA broadcast × each B vector into accumulators (NR_VECS FMAs)

Key insight: each B vector is loaded once and reused across MR rows of A. Each A broadcast is loaded once and reused across NR_VECS B vectors. This gives MR × NR_VECS × simdWidth multiply-adds per k step from only NR_VECS loads + MR broadcasts.

Register Pressure Model

The micro-kernel holds all accumulators in registers for the entire k-loop, storing to memory only once at the end. The register budget is:

MR × NR_VECS accumulators (one SIMD register each) NR_VECS B vectors (loaded each k step) 1 A broadcast (reused across NR_VECS FMAs) ───────────────────────────────────────────────────── MR × NR_VECS + NR_VECS + 1 ≤ num_registers

The optimal tile size saturates available registers without spilling. MR, NR_VECS, and num_registers are defined per-architecture in the Microkernel specializations (e.g., AVX2: 16 YMM → MR=4, NR_VECS=3).

Tiling Strategy

The output matrix C is covered by nested loops:

i-loop (rows): steps by MR, remainder rows handled one at a time j-loop (columns): three passes per row band:

Wide tiles (j += NR): MR × NR via micro_kernel_wide
Narrow tiles (j += simdWidth): MR × simdWidth via micro_kernel_narrow
Scalar cols (j += 1): MR × 1 via scalar_column_MR

Remainder rows (< MR) use the same three-pass column strategy with single-row variants.

┌──────────┬──────────┬───────┬────┐ │ NR │ NR │simdW │scl │ ← j passes │ wide │ wide │narrow │ │ ──┼──────────┼──────────┼───────┼────┤ MR│micro_ │micro_ │micro_ │scl │ │kernel_ │kernel_ │kernel_│col │ │wide │wide │narrow │MR │ ──┼──────────┼──────────┼───────┼────┤ MR│ ... │ ... │ ... │ │ ──┼──────────┼──────────┼───────┼────┤ rem│single_ │single_ │single_│scl │ ← remainder rows │row_wide │row_wide │row_ │ │ │ │ │narrow │ │ └──────────┴──────────┴───────┴────┘

Memory Layout Requirements

The micro-kernel assumes the "favorable" layout:

A's contraction axis is the last dim (stride 1): sequential k access
B's free axis is the last dim (stride 1): K::load reads contiguous columns

The einsum caller is responsible for materializing transposed copies when the original layout is unfavorable. The O(N²) transpose cost is negligible against the O(N³) multiply.

Concrete Example (MR=4, simdWidth=4, doubles)

Computing a 4×12 tile of C (NR_VECS=3):

k=0: b_vec[0] = load(B + 0*strideB + 0) → [b00, b01, b02, b03] b_vec[1] = load(B + 0*strideB + 4) → [b04, b05, b06, b07] b_vec[2] = load(B + 0*strideB + 8) → [b08, b09, b0A, b0B]

a_bcast = broadcast(A[0, 0]) → [a00, a00, a00, a00] acc[0][0] += a00 * [b00..b03] acc[0][1] += a00 * [b04..b07] acc[0][2] += a00 * [b08..b0B]

a_bcast = broadcast(A[1, 0]) → [a10, a10, a10, a10] acc[1][0] += a10 * [b00..b03] acc[1][1] += a10 * [b04..b07] acc[1][2] += a10 * [b08..b0B]

... (rows 2, 3)

Total per k step: 3 loads + 4 broadcasts + 12 FMAs = 48 multiply-adds After K steps: 12 stores write the completed 4×12 tile to C