Work Division Planning
This document describes the multi-dimensional parallelization planning in Torch-Spyre, which determines how computational work is distributed across multiple cores for parallel execution.
Motivation
Spyre provides multiple processing cores that can execute operations in parallel. To maximize performance, the compiler must decide how to divide tensor operations across these cores. The challenges are to:
Maximize parallelism by using as many cores as possible
Ensure balanced workloads across all cores
Respect hardware memory constraints per core
Maintain correctness by respecting operation semantics
The work division planning phase analyzes each operation in the computation graph and determines a parallelization strategy based on the operation type, tensor dimensions, device layouts, and available hardware resources. In the future we wish to combine it with LX scratchpad optimization and consider optimal work divisions beyond a single operation.
Iteration Space
Each operation has an iteration space: the set of loop variables and their
ranges that together enumerate all output elements (for pointwise ops) or all
input elements (for reductions). For example, a 2D pointwise op over an output
of shape [M, N] has iteration space {c0: M, c1: N}.
Stick variables — iteration variables whose range maps to the innermost (stick)
device dimension of some tensor — are converted from element counts to stick
counts before planning. This ensures core splits always land on stick
boundaries, since each core must receive a whole number of sticks. When
multiple tensors of different dtypes share a stick variable, the conversion
uses the largest elems_per_stick across those tensors (conservative: fewer
sticks → smaller adjusted size → fewer cores assigned to that dimension).
Hardware Memory Span Constraint
Each Spyre core has a 256 MB limit on the memory span it can access. The
per-core span for a tensor is the contiguous range of device memory (in
sticks) that a single core must read or write, given a particular split
assignment. It is determined by the outermost device dimension that a core
touches: per_core_size * stride, where per_core_size is the number of
positions along that dimension each core covers.
If splitting is not applied, a large tensor may violate this limit. The planner detects violations and computes the minimum number of slices required on the responsible iteration variables to bring each tensor’s span within the limit.
For stick variables, valid slice counts are restricted to divisors of the stick count, so each core always receives a whole number of sticks. If the same iteration variable is a stick variable for one tensor and a span variable for another, and no valid slice count satisfies both constraints simultaneously, the compiler raises an error at compile time.
Planning Algorithm
Work division is implemented as two sequential compiler passes over all
operations, both in
work_division.py
and called from CustomPreSchedulingPasses in
passes.py.
Pass 1 — Span Reduction (span_reduction)
Mandatory. Runs first over all operations.
For each operation, span_reduction_pass computes the minimum splits required
to keep every tensor’s per-core memory span within 256 MB (must_split_vars).
must_split_vars processes tensors one at a time. For each tensor whose
per-core span exceeds 256 MB, it iterates over device dimensions outer to inner
and searches for the best split combination (Cartesian product of valid
divisors for the variables contributing to that dimension) that satisfies the
hardware limit. The search applies a two-tier selection: among combinations
whose total core count does not exceed max_cores, prefer the one with the
largest span that still fits within the limit (fewest cores used); if no
combination brings the span within the limit, fall back to the one with the
smallest span (most progress). Previously committed splits are carried
forward as lower bounds, narrowing the search for subsequent tensors.
The resulting minimum splits are written to op.op_it_space_splits via
apply_splits. If no span violation exists, op_it_space_splits is left
unset.
Pass 2 — Work Distribution (work_distribution)
Optional (future: graph-aware). Runs after Pass 1 has completed for all operations.
For each operation, work_distribution_pass:
Recovers the splits committed by Pass 1 by reading
op.op_it_space_splitsviaapply_splits_from_index_coeff. This uses the same coeff-keyed encoding that codegen uses, ensuring stability across compiler passes even as sympy symbols are renamed.Ranks the remaining dimensions (those not already committed by Pass 1) for additional core assignment (
prioritize_dimensions): output dimensions first by decreasing stick-adjusted size, reduction dimensions last. At most one reduction dimension is eligible for splitting — the one that maximisescore_split(size, remaining_cores)after output dimensions have absorbed their share of cores. If Pass 1 already committed a reduction split, no further reduction dimensions are eligible.Distributes all
max_coresacross committed and priority dimensions (multi_dim_iteration_space_split): first applies the committed splits as minimum requirements, then greedily assigns the largest valid divisor of each remaining dimension to the leftover core budget.
The final splits overwrite op.op_it_space_splits. The attribute is a dict
keyed by the index coefficients of the buffer’s read and write index
expressions (computed by splits_by_index_coeff in
pass_utils.py),
with each coefficient mapping to its slice count. Downstream passes can recover
an iteration-variable view by calling
apply_splits_from_index_coeff(splits, write_index, read_index, it_space).
Note
Two distinct memory limits. The 256 MB span limit in step 1 is a per-core addressable device memory constraint, set by how much DDR each core can reach in its address space. It is not the same thing as the 2 MB on-core LX scratchpad. Scratchpad allocation is a separate decision, made by the scratchpad_planning pass when LX_PLANNING is enabled (see scratchpad.py).
Operation-Specific Strategies
Pointwise Operations
The iteration space is that of the output tensor. All output dimensions are candidates for splitting. There is no reduction dimension. Span-required splits are computed jointly over all input and output tensors.
Reduction Operations
Output dimensions are split first, by decreasing size. After output dimensions
have been assigned cores, at most one reduction dimension may also be split: the
one whose size has the most useful divisors for the remaining core budget (i.e.
maximises core_split(size, remaining_cores)). If Pass 1 already committed a
reduction split to satisfy the span limit, no further reduction dimension is
split in Pass 2.
Span-required splits may include at most one reduction variable; if more than one reduction variable must be split to satisfy the 256 MB limit, the compiler raises an error.
For matrix multiplication the reduction dimension is K. Since all matrix multiply variants (mm, bmm) have exactly one K dimension, K is treated as any other reduction dimension: output dimensions (batch, M, N) take priority by decreasing size, and K is only split when the output dimensions cannot utilise all available cores.
Configuration
Work division is controlled by the SENCORES environment variable, which
specifies the maximum number of cores available for parallelization. Valid
values range from 1 (no parallelization) to 32 (maximum supported cores).
Limitations and Future Work
Current limitations:
Dimensions must divide evenly by the slice count (no uneven splits)
Only
PointwiseandReductionIR nodes are dispatched for work division;ExternKernelandFallbackKernelnodes are skipped
Potential future enhancements:
Retrieving correct padding instead of simplifying assumption
Cross-operation optimization considering data reuse and memory hierarchy
Integration with LX scratchpad memory planning
See Also
Tensor Layouts — device layouts and the stick memory model