tesseract++ 0.0.1
N-dimensional tensor library for embedded systems
Loading...
Searching...
No Matches
Static Public Member Functions | Static Public Attributes | List of all members
SimdPaddingPolicyBase< T, SIMDWidth, Dims > Struct Template Reference

Padding policy that pads the last dimension for SIMD alignment. More...

#include <simd_padding_policy.h>

Collaboration diagram for SimdPaddingPolicyBase< T, SIMDWidth, Dims >:
Collaboration graph
[legend]

Static Public Member Functions

static constexpr Array< my_size_t, NumDimscomputeLogicalDims ()
 
static constexpr my_size_t pad (my_size_t n)
 
static constexpr my_size_t computePhysicalSize ()
 
static constexpr Array< my_size_t, NumDimscomputePhysicalDims ()
 

Static Public Attributes

static constexpr my_size_t SimdWidth = SIMDWidth
 
static constexpr my_size_t NumDims = sizeof...(Dims)
 
static constexpr Array< my_size_t, NumDimsLogicalDims = computeLogicalDims()
 
static constexpr my_size_t LastDim = LogicalDims[NumDims - 1]
 
static constexpr my_size_t PaddedLastDim = pad(LastDim)
 
static constexpr my_size_t LogicalSize = (Dims * ...)
 
static constexpr my_size_t PhysicalSize = computePhysicalSize()
 
static constexpr Array< my_size_t, NumDimsPhysicalDims = computePhysicalDims()
 

Detailed Description

template<typename T, my_size_t SIMDWidth, my_size_t... Dims>
struct SimdPaddingPolicyBase< T, SIMDWidth, Dims >

Padding policy that pads the last dimension for SIMD alignment.

Template Parameters
TElement type (float, double, Complex<float>, etc.)
DimsLogical dimensions of the tensor (e.g., 8, 6 for an 8x6 matrix)

All computations happen at compile-time. No runtime overhead.


THE PROBLEM THIS SOLVES

SIMD load instructions (e.g., _mm256_load_pd) require memory addresses to be aligned to SimdWidth * sizeof(T) bytes. For AVX with double:

For a tensor A[8, 6] stored in row-major order WITHOUT padding:

Memory layout (each cell = 1 double = 8 bytes):

Index: 0 1 2 3 4 5 | 6 7 8 9 10 11 | 12 13 14 15 16 17 | ... Row: |-— row 0 -—| |-— row 1 --—| |-— row 2 ---—| ... Address: 0 8 16 24 32 40 48 56 64 72 80 88 96 ... ^ Row 1 starts at byte 48 48 % 32 = 16 ≠ 0 → NOT ALIGNED!

Strides: [6, 1] Row bases: 0, 6, 12, 18, 24, 30, 36, 42

Alignment check (need base % SimdWidth == 0): Row 0: base=0, 0 % 4 = 0 ✓ Row 1: base=6, 6 % 4 = 2 ✗ MISALIGNED → SEGFAULT! Row 2: base=12, 12 % 4 = 0 ✓ Row 3: base=18, 18 % 4 = 2 ✗ MISALIGNED → SEGFAULT!


THE SOLUTION: PAD THE LAST DIMENSION

Pad the last dimension to the next multiple of SimdWidth.

For A[8, 6] with SimdWidth=4:

Memory layout WITH padding:

Index: 0 1 2 3 4 5 P P | 8 9 10 11 12 13 P P | 16 17 ... Row: |-— row 0 -—| pad | |-— row 1 --—| pad | ... Address: 0 8 16 24 32 40 48 56 64 72 80 88 96 104 ... ^ Row 1 starts at byte 64 64 % 32 = 0 → ALIGNED!

P = padding slots (allocated but unused, zero-initialized)

Strides: [8, 1] ← computed from PADDED dimensions! Row bases: 0, 8, 16, 24, 32, 40, 48, 56

Alignment check: Row 0: base=0, 0 % 4 = 0 ✓ Row 1: base=8, 8 % 4 = 0 ✓ Row 2: base=16, 16 % 4 = 0 ✓ Row 3: base=24, 24 % 4 = 0 ✓ ... ALL ALIGNED!


EXAMPLES FOR DIFFERENT TYPES AND DIMENSIONS

Example 1: FusedTensorND<double, 8, 6> (AVX, SimdWidth=4)

Example 2: FusedTensorND<float, 8, 6> (AVX, SimdWidth=8)

Example 3: FusedTensorND<float, 5, 10> (AVX, SimdWidth=8)

Example 4: FusedTensorND<double, 4, 4> (AVX, SimdWidth=4, already aligned)

Example 5: FusedTensorND<float, 2, 3, 5> (AVX, SimdWidth=8, 3D tensor)

Example 6: FusedTensorND<Complex<double>, 8, 6> (AVX, SimdWidth=2)

Example 7: FusedTensorND<double, 8, 6> (GENERICARCH, SimdWidth=1)


WHY ONLY PAD THE LAST DIMENSION?

In row-major storage, the last dimension is contiguous in memory. When we iterate over the last axis (stride=1), we access consecutive elements.

For A[M, N]:

Padding the last dimension ensures that EVERY "contiguous slice" starts at an aligned address:


EMBEDDED SYSTEMS CONSIDERATION

On modern desktop CPUs (Intel Haswell+, AMD Zen+), unaligned loads (loadu) have essentially zero penalty when data doesn't cross cache line boundaries.

However, on embedded systems and older architectures:

Padding trades memory for guaranteed alignment, which is often the right choice for embedded systems where predictable performance matters.


DESIGN: MICROKERNEL AS SINGLE SOURCE OF TRUTH

The SimdWidth is obtained from Microkernel<T, BITS, DefaultArch>::SimdWidth.

Why not compute it as DATA_ALIGNAS / sizeof(T)?

  1. Microkernel is the hardware abstraction layer - it owns SIMD configuration.
  2. Avoids duplication: SimdWidth is defined once, in the microkernel.
  3. Handles edge cases correctly:
    • GENERICARCH: BITS=0, DATA_ALIGNAS=0, but Microkernel::SimdWidth=1
    • Complex types: SimdWidth depends on how the microkernel packs them
  4. Custom types just need to define their Microkernel specialization. The padding policy automatically picks up the correct SimdWidth.

Dependency chain (no cycles): Microkernel (defines SimdWidth) ↓ PaddingPolicy (reads SimdWidth, computes PhysicalSize) ↓ Storage (allocates PhysicalSize elements) ↓ Tensor (uses storage)

Member Function Documentation

◆ computeLogicalDims()

template<typename T , my_size_t SIMDWidth, my_size_t... Dims>
static constexpr Array< my_size_t, NumDims > SimdPaddingPolicyBase< T, SIMDWidth, Dims >::computeLogicalDims ( )
inlinestaticconstexpr

Array of logical dimensions.

For FusedTensorND<T, 8, 6>: LogicalDims = {8, 6} For FusedTensorND<T, 2, 3, 4>: LogicalDims = {2, 3, 4}

◆ computePhysicalDims()

template<typename T , my_size_t SIMDWidth, my_size_t... Dims>
static constexpr Array< my_size_t, NumDims > SimdPaddingPolicyBase< T, SIMDWidth, Dims >::computePhysicalDims ( )
inlinestaticconstexpr

Compile-time array of physical dimensions.

Physical dims = [dim0, dim1, ..., dimN-2, PaddedLastDim]

For FusedTensorND<double, 8, 6> with SimdWidth=4:

  • Logical dims: [8, 6]
  • Physical dims: [8, 8]

For FusedTensorND<float, 2, 3, 5> with SimdWidth=8:

  • Logical dims: [2, 3, 5]
  • Physical dims: [2, 3, 8]

◆ computePhysicalSize()

template<typename T , my_size_t SIMDWidth, my_size_t... Dims>
static constexpr my_size_t SimdPaddingPolicyBase< T, SIMDWidth, Dims >::computePhysicalSize ( )
inlinestaticconstexpr

Compute total physical storage size.

Physical size = (product of all LogicalDims except last) × PaddedLastDim

For FusedTensorND<double, 8, 6> with SimdWidth=4:

  • LogicalDims = {8, 6}
  • PaddedLastDim = 8
  • PhysicalSize = 8 * 8 = 64

For FusedTensorND<float, 2, 3, 5> with SimdWidth=8:

  • LogicalDims = {2, 3, 5}
  • PaddedLastDim = 8
  • PhysicalSize = 2 * 3 * 8 = 48

For FusedTensorND<double, 8, 6> with SimdWidth=1 (GENERICARCH):

  • LogicalDims = {8, 6}
  • PaddedLastDim = 6 (no padding)
  • PhysicalSize = 8 * 6 = 48 (no overhead)

◆ pad()

template<typename T , my_size_t SIMDWidth, my_size_t... Dims>
static constexpr my_size_t SimdPaddingPolicyBase< T, SIMDWidth, Dims >::pad ( my_size_t  n)
inlinestaticconstexpr

Round up n to the next multiple of SimdWidth.

Formula: ceil(n / SimdWidth) * SimdWidth Implemented as: ((n + SimdWidth - 1) / SimdWidth) * SimdWidth

Examples with SimdWidth=4: pad(1) = 4 pad(2) = 4 pad(3) = 4 pad(4) = 4 (already aligned) pad(5) = 8 pad(6) = 8 pad(7) = 8 pad(8) = 8 (already aligned)

Examples with SimdWidth=8: pad(6) = 8 pad(10) = 16

With SimdWidth=1 (GENERICARCH): pad(n) = n (no padding needed)

Member Data Documentation

◆ LastDim

template<typename T , my_size_t SIMDWidth, my_size_t... Dims>
constexpr my_size_t SimdPaddingPolicyBase< T, SIMDWidth, Dims >::LastDim = LogicalDims[NumDims - 1]
staticconstexpr

Original (logical) last dimension.

For FusedTensorND<T, 8, 6>: LastDim = 6

◆ LogicalDims

template<typename T , my_size_t SIMDWidth, my_size_t... Dims>
constexpr Array<my_size_t, NumDims> SimdPaddingPolicyBase< T, SIMDWidth, Dims >::LogicalDims = computeLogicalDims()
staticconstexpr

◆ LogicalSize

template<typename T , my_size_t SIMDWidth, my_size_t... Dims>
constexpr my_size_t SimdPaddingPolicyBase< T, SIMDWidth, Dims >::LogicalSize = (Dims * ...)
staticconstexpr

Logical size = product of all logical dimensions.

For FusedTensorND<double, 8, 6>: LogicalSize = 48

◆ NumDims

template<typename T , my_size_t SIMDWidth, my_size_t... Dims>
constexpr my_size_t SimdPaddingPolicyBase< T, SIMDWidth, Dims >::NumDims = sizeof...(Dims)
staticconstexpr

Number of dimensions (e.g., 2 for a matrix, 3 for a 3D tensor)

◆ PaddedLastDim

template<typename T , my_size_t SIMDWidth, my_size_t... Dims>
constexpr my_size_t SimdPaddingPolicyBase< T, SIMDWidth, Dims >::PaddedLastDim = pad(LastDim)
staticconstexpr

Padded last dimension.

For FusedTensorND<float, 8, 6> with SimdWidth=8: PaddedLastDim = 8 For FusedTensorND<double, 8, 6> with SimdWidth=4: PaddedLastDim = 8 For FusedTensorND<double, 8, 6> with SimdWidth=1: PaddedLastDim = 6 (no change)

◆ PhysicalDims

template<typename T , my_size_t SIMDWidth, my_size_t... Dims>
constexpr Array<my_size_t, NumDims> SimdPaddingPolicyBase< T, SIMDWidth, Dims >::PhysicalDims = computePhysicalDims()
staticconstexpr

Physical dimensions array - all computation happens at compile-time

◆ PhysicalSize

template<typename T , my_size_t SIMDWidth, my_size_t... Dims>
constexpr my_size_t SimdPaddingPolicyBase< T, SIMDWidth, Dims >::PhysicalSize = computePhysicalSize()
staticconstexpr

Total number of elements in physical storage (including padding)

◆ SimdWidth

template<typename T , my_size_t SIMDWidth, my_size_t... Dims>
constexpr my_size_t SimdPaddingPolicyBase< T, SIMDWidth, Dims >::SimdWidth = SIMDWidth
staticconstexpr

SimdWidth from the Microkernel - the single source of truth.

Examples:

  • Microkernel<double, 256, X86_AVX>::SimdWidth = 4
  • Microkernel<float, 256, X86_AVX>::SimdWidth = 8
  • Microkernel<double, 128, X86_SSE>::SimdWidth = 2
  • Microkernel<Complex<double>, 256, X86_AVX>SimdWidth = 2
  • Microkernel<float, 0, GENERICARCH>::SimdWidth = 1

The documentation for this struct was generated from the following file: