|
tesseract++ 0.0.1
N-dimensional tensor library for embedded systems
|
Register-blocked GEMM for optimized 2D matrix multiplication. More...
#include "config.h"#include "fused/microkernels/microkernel_base.h"#include "fused/kernel_ops/kernel_helpers.h"

Go to the source code of this file.
Classes | |
| struct | detail::KernelGemm< T, Bits, Arch > |
Namespaces | |
| namespace | detail |
Register-blocked GEMM for optimized 2D matrix multiplication.
Implements C[M,N] = A[M,K] × B[K,N] using an outer-product micro-kernel formulation that maximizes register reuse and SIMD throughput.
The naive approach computes each C[i,j] as an independent dot product: one row of A dotted with one column of B. For M=N=100, that's 10,000 dot products, each reloading the same data — terrible cache utilization.
This GEMM instead computes an MR × NR tile of C simultaneously using the outer-product formulation:
For each k in 0..K-1: Load NR_VECS vectors from B's row k (NR_VECS SIMD loads) For each of MR rows of A: Broadcast A[row, k] to a SIMD register (1 broadcast) FMA broadcast × each B vector into accumulators (NR_VECS FMAs)
Key insight: each B vector is loaded once and reused across MR rows of A. Each A broadcast is loaded once and reused across NR_VECS B vectors. This gives MR × NR_VECS × simdWidth multiply-adds per k step from only NR_VECS loads + MR broadcasts.
The micro-kernel holds all accumulators in registers for the entire k-loop, storing to memory only once at the end. The register budget is:
MR × NR_VECS accumulators (one SIMD register each) NR_VECS B vectors (loaded each k step) 1 A broadcast (reused across NR_VECS FMAs) ───────────────────────────────────────────────────── MR × NR_VECS + NR_VECS + 1 ≤ num_registers
The optimal tile size saturates available registers without spilling. MR, NR_VECS, and num_registers are defined per-architecture in the Microkernel specializations (e.g., AVX2: 16 YMM → MR=4, NR_VECS=3).
The output matrix C is covered by nested loops:
i-loop (rows): steps by MR, remainder rows handled one at a time j-loop (columns): three passes per row band:
Remainder rows (< MR) use the same three-pass column strategy with single-row variants.
┌──────────┬──────────┬───────┬────┐ │ NR │ NR │simdW │scl │ ← j passes │ wide │ wide │narrow │ │ ──┼──────────┼──────────┼───────┼────┤ MR│micro_ │micro_ │micro_ │scl │ │kernel_ │kernel_ │kernel_│col │ │wide │wide │narrow │MR │ ──┼──────────┼──────────┼───────┼────┤ MR│ ... │ ... │ ... │ │ ──┼──────────┼──────────┼───────┼────┤ rem│single_ │single_ │single_│scl │ ← remainder rows │row_wide │row_wide │row_ │ │ │ │ │narrow │ │ └──────────┴──────────┴───────┴────┘
The micro-kernel assumes the "favorable" layout:
The einsum caller is responsible for materializing transposed copies when the original layout is unfavorable. The O(N²) transpose cost is negligible against the O(N³) multiply.
Computing a 4×12 tile of C (NR_VECS=3):
k=0: b_vec[0] = load(B + 0*strideB + 0) → [b00, b01, b02, b03] b_vec[1] = load(B + 0*strideB + 4) → [b04, b05, b06, b07] b_vec[2] = load(B + 0*strideB + 8) → [b08, b09, b0A, b0B]
a_bcast = broadcast(A[0, 0]) → [a00, a00, a00, a00] acc[0][0] += a00 * [b00..b03] acc[0][1] += a00 * [b04..b07] acc[0][2] += a00 * [b08..b0B]
a_bcast = broadcast(A[1, 0]) → [a10, a10, a10, a10] acc[1][0] += a10 * [b00..b03] acc[1][1] += a10 * [b04..b07] acc[1][2] += a10 * [b08..b0B]
... (rows 2, 3)
Total per k step: 3 loads + 4 broadcasts + 12 FMAs = 48 multiply-adds After K steps: 12 stores write the completed 4×12 tile to C