|
tesseract++ 0.0.1
N-dimensional tensor library for embedded systems
|
Padding policy that pads the last dimension for SIMD alignment. More...
#include <simd_padding_policy.h>

Static Public Member Functions | |
| static constexpr Array< my_size_t, NumDims > | computeLogicalDims () |
| static constexpr my_size_t | pad (my_size_t n) |
| static constexpr my_size_t | computePhysicalSize () |
| static constexpr Array< my_size_t, NumDims > | computePhysicalDims () |
Static Public Attributes | |
| static constexpr my_size_t | SimdWidth = SIMDWidth |
| static constexpr my_size_t | NumDims = sizeof...(Dims) |
| static constexpr Array< my_size_t, NumDims > | LogicalDims = computeLogicalDims() |
| static constexpr my_size_t | LastDim = LogicalDims[NumDims - 1] |
| static constexpr my_size_t | PaddedLastDim = pad(LastDim) |
| static constexpr my_size_t | LogicalSize = (Dims * ...) |
| static constexpr my_size_t | PhysicalSize = computePhysicalSize() |
| static constexpr Array< my_size_t, NumDims > | PhysicalDims = computePhysicalDims() |
Padding policy that pads the last dimension for SIMD alignment.
| T | Element type (float, double, Complex<float>, etc.) |
| Dims | Logical dimensions of the tensor (e.g., 8, 6 for an 8x6 matrix) |
All computations happen at compile-time. No runtime overhead.
SIMD load instructions (e.g., _mm256_load_pd) require memory addresses to be aligned to SimdWidth * sizeof(T) bytes. For AVX with double:
For a tensor A[8, 6] stored in row-major order WITHOUT padding:
Memory layout (each cell = 1 double = 8 bytes):
Index: 0 1 2 3 4 5 | 6 7 8 9 10 11 | 12 13 14 15 16 17 | ... Row: |-— row 0 -—| |-— row 1 --—| |-— row 2 ---—| ... Address: 0 8 16 24 32 40 48 56 64 72 80 88 96 ... ^ Row 1 starts at byte 48 48 % 32 = 16 ≠ 0 → NOT ALIGNED!
Strides: [6, 1] Row bases: 0, 6, 12, 18, 24, 30, 36, 42
Alignment check (need base % SimdWidth == 0): Row 0: base=0, 0 % 4 = 0 ✓ Row 1: base=6, 6 % 4 = 2 ✗ MISALIGNED → SEGFAULT! Row 2: base=12, 12 % 4 = 0 ✓ Row 3: base=18, 18 % 4 = 2 ✗ MISALIGNED → SEGFAULT!
Pad the last dimension to the next multiple of SimdWidth.
For A[8, 6] with SimdWidth=4:
Memory layout WITH padding:
Index: 0 1 2 3 4 5 P P | 8 9 10 11 12 13 P P | 16 17 ... Row: |-— row 0 -—| pad | |-— row 1 --—| pad | ... Address: 0 8 16 24 32 40 48 56 64 72 80 88 96 104 ... ^ Row 1 starts at byte 64 64 % 32 = 0 → ALIGNED!
P = padding slots (allocated but unused, zero-initialized)
Strides: [8, 1] ← computed from PADDED dimensions! Row bases: 0, 8, 16, 24, 32, 40, 48, 56
Alignment check: Row 0: base=0, 0 % 4 = 0 ✓ Row 1: base=8, 8 % 4 = 0 ✓ Row 2: base=16, 16 % 4 = 0 ✓ Row 3: base=24, 24 % 4 = 0 ✓ ... ALL ALIGNED!
Example 1: FusedTensorND<double, 8, 6> (AVX, SimdWidth=4)
Example 2: FusedTensorND<float, 8, 6> (AVX, SimdWidth=8)
Example 3: FusedTensorND<float, 5, 10> (AVX, SimdWidth=8)
Example 4: FusedTensorND<double, 4, 4> (AVX, SimdWidth=4, already aligned)
Example 5: FusedTensorND<float, 2, 3, 5> (AVX, SimdWidth=8, 3D tensor)
Example 6: FusedTensorND<Complex<double>, 8, 6> (AVX, SimdWidth=2)
Example 7: FusedTensorND<double, 8, 6> (GENERICARCH, SimdWidth=1)
In row-major storage, the last dimension is contiguous in memory. When we iterate over the last axis (stride=1), we access consecutive elements.
For A[M, N]:
Padding the last dimension ensures that EVERY "contiguous slice" starts at an aligned address:
On modern desktop CPUs (Intel Haswell+, AMD Zen+), unaligned loads (loadu) have essentially zero penalty when data doesn't cross cache line boundaries.
However, on embedded systems and older architectures:
Padding trades memory for guaranteed alignment, which is often the right choice for embedded systems where predictable performance matters.
The SimdWidth is obtained from Microkernel<T, BITS, DefaultArch>::SimdWidth.
Why not compute it as DATA_ALIGNAS / sizeof(T)?
Dependency chain (no cycles): Microkernel (defines SimdWidth) ↓ PaddingPolicy (reads SimdWidth, computes PhysicalSize) ↓ Storage (allocates PhysicalSize elements) ↓ Tensor (uses storage)
|
inlinestaticconstexpr |
Array of logical dimensions.
For FusedTensorND<T, 8, 6>: LogicalDims = {8, 6} For FusedTensorND<T, 2, 3, 4>: LogicalDims = {2, 3, 4}
|
inlinestaticconstexpr |
Compile-time array of physical dimensions.
Physical dims = [dim0, dim1, ..., dimN-2, PaddedLastDim]
For FusedTensorND<double, 8, 6> with SimdWidth=4:
For FusedTensorND<float, 2, 3, 5> with SimdWidth=8:
|
inlinestaticconstexpr |
Compute total physical storage size.
Physical size = (product of all LogicalDims except last) × PaddedLastDim
For FusedTensorND<double, 8, 6> with SimdWidth=4:
For FusedTensorND<float, 2, 3, 5> with SimdWidth=8:
For FusedTensorND<double, 8, 6> with SimdWidth=1 (GENERICARCH):
|
inlinestaticconstexpr |
Round up n to the next multiple of SimdWidth.
Formula: ceil(n / SimdWidth) * SimdWidth Implemented as: ((n + SimdWidth - 1) / SimdWidth) * SimdWidth
Examples with SimdWidth=4: pad(1) = 4 pad(2) = 4 pad(3) = 4 pad(4) = 4 (already aligned) pad(5) = 8 pad(6) = 8 pad(7) = 8 pad(8) = 8 (already aligned)
Examples with SimdWidth=8: pad(6) = 8 pad(10) = 16
With SimdWidth=1 (GENERICARCH): pad(n) = n (no padding needed)
|
staticconstexpr |
Original (logical) last dimension.
For FusedTensorND<T, 8, 6>: LastDim = 6
|
staticconstexpr |
|
staticconstexpr |
Logical size = product of all logical dimensions.
For FusedTensorND<double, 8, 6>: LogicalSize = 48
|
staticconstexpr |
Number of dimensions (e.g., 2 for a matrix, 3 for a 3D tensor)
|
staticconstexpr |
Padded last dimension.
For FusedTensorND<float, 8, 6> with SimdWidth=8: PaddedLastDim = 8 For FusedTensorND<double, 8, 6> with SimdWidth=4: PaddedLastDim = 8 For FusedTensorND<double, 8, 6> with SimdWidth=1: PaddedLastDim = 6 (no change)
|
staticconstexpr |
Physical dimensions array - all computation happens at compile-time
|
staticconstexpr |
Total number of elements in physical storage (including padding)
|
staticconstexpr |
SimdWidth from the Microkernel - the single source of truth.
Examples: