# Coarse-Tiling Loop IR for the Spyre Backend ## Background Spyre's compilation pipeline runs a sequence of optimization passes over `ir.Operation` objects in `CustomPreSchedulingPasses`, before Inductor's `Scheduler` is constructed. One planned optimization is **coarse-level tiling**: take a sequence of operations that share an iteration space dimension, split that dimension into K chunks (where K may be a symbolic shape), and emit the body operations inside a counted outer loop. This is the key program transformation for working set reduction -- a tiling of the computation in the time domain that enables effective scratchpad utilization by reshaping the computation so that most tensors can be allocated to the scratchpad. The output of this pass needs to survive through: 1. Inductor's `Scheduler` (which wraps each `ir.Operation` in a `SchedulerNode`) 2. Spyre's `SuperDSCScheduling.codegen_node()` (which drives `SpyreKernel` to produce `OpSpec` objects) 3. Downstream SDSC compilation (which needs an explicit loop count to generate correct hardware instructions) This document describes how that loop structure is represented, transported, and consumed. **Quick navigation:** - [Design Overview](#design-overview) - [Small Example](#small-example) - [Layer 1 — IR pass & `coarse_tile()` API](#layer-1--pre-scheduling-ir-pass) - [Layer 2 — `CountedLoopSchedulerNode`](#layer-2--countedloopschedulernode) - [Layer 3 — `LoopSpec` & codegen](#layer-3--loopspec-and-codegen) - [Files changed](#files-changed) - [Invariants](#invariants-and-failure-modes) - [Rejected alternatives](#rejected-design-alternatives) ## Design Overview The tiling loop structure must be created early (before work division sees the iteration space) and preserved intact through scheduling and codegen so that the hardware executes the reduced per-iteration working set — not the full pre-tiling range. The design has three layers that correspond to the three pipeline stages above. ``` Pre-scheduling IR pass (CustomPreSchedulingPasses) └─ stamps loop_group_id + loop_count on each ir.Operation └─ rewrites each op's ranges (divides the tiled dimension by K) ↓ Inductor Scheduler wraps each ir.Operation → SchedulerNode ↓ CustomPostFusionPasses fires Post-fusion scheduler pass (build_loop_scheduler_nodes) └─ runs BEFORE spyre_fuse_nodes └─ scans list[BaseSchedulerNode] for runs sharing a loop_group_id └─ wraps each run in a CountedLoopSchedulerNode(count=K, snodes=[...]) └─ spyre_fuse_nodes runs after; CountedLoopSchedulerNode.can_fuse=False prevents cross-group merging ↓ Scheduler calls SuperDSCScheduling.codegen_node() codegen_node └─ receives CountedLoopSchedulerNode └─ drives SpyreKernel for the inner ops, collecting inner OpSpecs └─ wraps them in LoopSpec(count=K, body=[OpSpec, ...]) └─ LoopSpec is serialized alongside OpSpec in codegen_kernel() ``` ## Small Example Consider two chained pointwise operations over `[1024, 4096]` tensors, where `A=1024` names the row dimension and `B=4096` names the column dimension: ```python from torch_spyre._inductor import spyre_hint from torch_spyre._inductor.propagate_named_dims import declare_tensor_dim, name_tensor_dims A, B = 1024, 4096 declare_tensor_dim("A", A) declare_tensor_dim("B", B) a = torch.randn(A, B, dtype=torch.float16).to("spyre") b = torch.randn(A, B, dtype=torch.float16).to("spyre") c = torch.randn(A, B, dtype=torch.float16).to("spyre") name_tensor_dims(a, ["A", "B"]) name_tensor_dims(b, ["A", "B"]) name_tensor_dims(c, ["A", "B"]) def f(a, b, c): with spyre_hint(slices={"A": 2}): # outer loop: 2 iterations over rows with spyre_hint(slices={"B": 4}): # inner loop: 4 iterations over cols y = a + b z = y * c return z ``` Both operations are placed in a single tiling group with **K=2 in the outer loop** (splitting the 1024 rows into 2 groups of 512) and **M=4 in the inner loop** (splitting the 4096 columns into 4 groups of 1024). Each inner-loop iteration processes a 512 × 1024 tile (1/8th of the full tensor), enabling the intermediate result `y` to remain in scratchpad across both operations within the tile. This example is the canonical small example tested by `test_hint_nested_loop_with_scratchpad` in `tests/inductor/test_coarse_tile_e2e.py`. ### What the coarse-tiling pass stamps `coarse_tile()` sees this as a nested group spec and stamps the following attributes on **both** `ir.Operation` objects: ```python op.loop_group_id = (0, 0) # depth-2 path: group 0, inner slot 0 op.loop_count = [2, 4] # [K_outer, M_inner] op.loop_tiled_dims = [[0], [1]] # outer loop tiles dim 0; inner tiles dim 1 ``` `_divide_ranges` is applied once per level in outermost-first order: 1. Outer level `(K=2, [dim 0])`: `data.ranges [1024, 4096] → [512, 4096]` 2. Inner level `(M=4, [dim 1])`: `data.ranges [512, 4096] → [512, 1024]` The per-inner-iteration `data.ranges` for both ops is `[512, 1024]`. ### LoopLevel IR after CustomPreSchedulingPasses After `coarse_tile`, `span_reduction`, `work_distribution`, and `scratchpad_planning` have all run, the two `ComputedBuffer` objects look like this (the `_format_operations` representation with loop attributes added): ``` buf0: ComputedBuffer # y = a + b layout = FixedTiledLayout(size=[512, 1024], stride=[1024, 1], device_size=[16, 512, 64]) # per-tile shape op_it_space_splits = {1024: 32} # work division: 32 cores along dim 1 loop_group_id = (0, 0) loop_count = [2, 4] loop_tiled_dims = [[0], [1]] Pointwise( ranges=[512, 1024], # per-tile iteration space inner_fn: load(a, i1 + 4096*i0) load(b, i1 + 4096*i0) return a + b ) buf1: ComputedBuffer # z = y * c layout = FixedTiledLayout(size=[512, 1024], stride=[1024, 1], device_size=[16, 512, 64]) # per-tile shape op_it_space_splits = {1024: 32} loop_group_id = (0, 0) loop_count = [2, 4] loop_tiled_dims = [[0], [1]] Pointwise( ranges=[512, 1024], inner_fn: load(buf0, i1 + 4096*i0) # reads y load(c, i1 + 4096*i0) return y * c ) ``` Key points: - Both ops share the same `loop_group_id = (0, 0)`, `loop_count = [2, 4]`, and `loop_tiled_dims = [[0], [1]]` — this is what `build_loop_scheduler_nodes` uses to wrap them together in a `CountedLoopSchedulerNode`. - `ranges = [512, 1024]` is the *per-tile* iteration space (1/8th of the full tensor). Work division and codegen see only this reduced space; the loop trip counts carry the information needed to reconstruct the full addressing. - `layout.size = [512, 1024]` matches the per-tile `ranges`. The layout describes the smaller per-tile output buffer allocated for each loop iteration. Per-iteration addressing into the full HBM region is handled by `tiled_symbols` / `affine.apply` in `bundle.mlir` at runtime. - `op_it_space_splits = {1024: 32}` is stamped by `work_distribution`: the coefficient `1024` identifies the per-tile stride-1 dimension (columns after tiling), and `32` is the number of cores dividing that dimension's work. - `buf0` (`y`) is the intermediate result. At this point its layout is a `FixedTiledLayout` with `size=[512, 1024]`; `scratchpad_planning` later assigns it `allocation={'lx': 0}`, placing it in LX scratchpad memory at address 0. Because `y` is produced and fully consumed within the same tile iteration and its per-tile size fits in scratchpad, no HBM allocation is needed for it at all. ### Generated OpSpec (Python wrapper source) The Python wrapper emitted by `codegen_kernel()` contains both ops inside a single nested `LoopSpec`. Below is the actual output produced by running the e2e test `test_hint_nested_loop_with_scratchpad` (which uses `spyre_hint(slices=...)` / `declare_tensor_dim` / `name_tensor_dims` with `lx_planning=True` and `allow_all_ops_in_lx_planning=True`; concrete HBM addresses replaced with symbolic names for readability): ```python sdsc_fused_add_mul_0 = async_compile.sdsc('sdsc_fused_add_mul_0', [ LoopSpec( count=sympify('2'), # outer K=2 loop body=[ LoopSpec( count=sympify('4'), # inner M=4 loop body=[ OpSpec( op='add', is_reduction=False, iteration_space={ sympify('c0'): (sympify('512'), 32), sympify('c1'): (sympify('1024'), 1), }, op_info={}, tiled_symbols=[sympify('c0'), sympify('c1')], args=[ TensorArg( # input a is_input=True, arg_index=0, device_dtype=DataFormats.SEN169_FP16, device_size=[64, 1024, 64], device_coordinates=[ sympify('floor(c1/64)'), sympify('c0'), sympify('Mod(c1, 64)'), ], allocation={'hbm': }, ), TensorArg( # input b is_input=True, arg_index=1, device_dtype=DataFormats.SEN169_FP16, device_size=[64, 1024, 64], device_coordinates=[ sympify('floor(c1/64)'), sympify('c0'), sympify('Mod(c1, 64)'), ], allocation={'hbm': }, ), TensorArg( # output y (LX scratchpad) is_input=False, arg_index=-1, device_dtype=DataFormats.SEN169_FP16, device_size=[16, 512, 64], device_coordinates=[ sympify('floor(c1/64)'), sympify('c0'), sympify('Mod(c1, 64)'), ], allocation={'lx': 0}, ), ] ), OpSpec( op='mul', is_reduction=False, iteration_space={ sympify('c0'): (sympify('512'), 32), sympify('c1'): (sympify('1024'), 1), }, op_info={}, tiled_symbols=[sympify('c0'), sympify('c1')], args=[ TensorArg( # input y (LX scratchpad) is_input=True, arg_index=-1, device_dtype=DataFormats.SEN169_FP16, device_size=[16, 512, 64], device_coordinates=[ sympify('floor(c1/64)'), sympify('c0'), sympify('Mod(c1, 64)'), ], allocation={'lx': 0}, ), TensorArg( # input c is_input=True, arg_index=2, device_dtype=DataFormats.SEN169_FP16, device_size=[64, 1024, 64], device_coordinates=[ sympify('floor(c1/64)'), sympify('c0'), sympify('Mod(c1, 64)'), ], allocation={'hbm': }, ), TensorArg( # output z (HBM, per-tile) is_input=False, arg_index=3, device_dtype=DataFormats.SEN169_FP16, device_size=[16, 512, 64], device_coordinates=[ sympify('floor(c1/64)'), sympify('c0'), sympify('Mod(c1, 64)'), ], allocation={'hbm': }, ), ] ), ], ), ], ), ] ) ``` Key observations: - `c0` and `c1` are Inductor's iteration-space symbols for the two dimensions. `iteration_space` reflects the per-inner-iteration tile size `[512, 1024]`. - `tiled_symbols=[c0, c1]` records — outermost first — which symbols correspond to the tiled dimensions: `c0` drives the outer `scf.for`, `c1` the inner one. - The intermediate tensor `y` (output of `add`, input to `mul`) has `allocation={'lx': 0}` — it lives in LX scratchpad memory at address 0. Its `device_size=[16, 512, 64]` reflects the per-tile shape `[512, 1024]`. Because `y` is produced and fully consumed within the same tile iteration, no HBM allocation is needed and its address is a fixed scratchpad offset that does not change between loop iterations (no `affine.apply` needed). - The final output `z` (output of `mul`) has `allocation={'hbm': ...}` and `arg_index=3` — it lives in HBM. Its `device_size=[16, 512, 64]` also reflects the per-tile shape; the per-iteration write address into the full HBM buffer is computed by `affine.apply` in `bundle.mlir` (see next section). - HBM inputs `a`, `b`, `c` have `device_size=[64, 1024, 64]` — the full tensor shape `[1024, 4096]` in Spyre stick layout. Their `device_coordinates` use `c0` and `c1` to index the per-iteration tile window into the full tensor. The per-tile output buffers (`y`, `z`) have `device_size=[16, 512, 64]`, the stick-layout shape for `[512, 1024]` fp16: 16 sticks of 64 columns across 512 rows. ### Generated `bundle.mlir` The SDSC compiler (`compile_op_spec`) translates `tiled_symbols` into per-loop byte strides, producing a 2-dimensional `affine_map`. For this `[1024, 4096]` fp16 tensor with Spyre stick layout (128 bytes/stick, 64 elements/stick): - Outer stride: 512 rows × 64 sticks/row × 128 bytes/stick = 4,194,304 bytes - Inner stride: 1024 columns / 64 elements/stick × 128 bytes/stick = 2,048 bytes ```none #map_0 = affine_map<(d0, d1)[s0] -> (s0 + 4194304*d0 + 2048*d1)> module { func.func @sdsc_bundle() { %c0 = arith.constant 0 : index %c1 = arith.constant 1 : index %loop_bound_0 = arith.constant 2 : index %loop_bound_1 = arith.constant 4 : index %sym_1 = arith.constant : index %sym_2 = arith.constant : index %sym_3 = arith.constant : index %sym_4 = arith.constant : index scf.for %i_0 = %c0 to %loop_bound_0 step %c1 { scf.for %i_1 = %c0 to %loop_bound_1 step %c1 { %addr_0 = affine.apply #map_0(%i_0, %i_1)[%sym_1] %addr_1 = affine.apply #map_0(%i_0, %i_1)[%sym_2] sdscbundle.sdsc_execute (%addr_0, %addr_1) {sdsc_filename="sdsc_0.json", ...} // add: a+b→y(lx) %addr_2 = affine.apply #map_0(%i_0, %i_1)[%sym_3] %addr_3 = affine.apply #map_0(%i_0, %i_1)[%sym_4] sdscbundle.sdsc_execute (%addr_2, %addr_3) {sdsc_filename="sdsc_1.json", ...} // mul: y(lx)*c→z } } return } } ``` Both operations share the same affine map because they operate on tensors of the same shape and stride structure. The scratchpad tensor `y` does not appear as a symbol — it has a fixed `lx` address that does not change between iterations. Each inner-loop iteration dispatches `add` then `mul` at tile `(i_0, i_1)`, keeping the intermediate result `y` in scratchpad between the two dispatches. ## Layer 1 — Pre-scheduling IR pass ### Attribute contract on `ir.Operation` The coarse-tiling pass stamps two attributes onto each `ir.Operation` that participates in a loop group. These attributes are plain Python values attached with `setattr`; no Inductor base class is modified. | Attribute | Type | Meaning | |---|---|---| | `loop_group_id` | `tuple[int, ...]` | Nesting-path tuple identifying which loop group this op belongs to. Its length equals the nesting depth. All ops sharing the same tuple form the body of the innermost counted loop at that path. | | `loop_count` | `list[sympy.Expr]` | Trip counts, one per nesting level from outermost to innermost. For a flat (depth-1) group this is a 1-element list `[K]`. For a two-level nested group it is `[K1, K2]`. All ops sharing the same `loop_group_id` must agree on the count at every level. | | `loop_tiled_dims` | `list[list[int]]` | Per-level positional indices into `data.ranges` that are divided by the corresponding count. For a flat group: `[[0]]` (tile only dim 0). For a two-level nested group: `[[0], [1]]` (outer loop tiles dim 0, inner loop tiles dim 1). Different ops in the same group may carry different indices if their iteration spaces are shaped differently. | The pass also **rewrites the op's iteration ranges**: for each level, the dimensions at the corresponding indices in `loop_tiled_dims` are divided by the corresponding count in `loop_count`, so that each inner `OpSpec` describes only the work done per innermost-loop iteration. `loop_group_id` is a tuple rather than a flat integer to support nested loops. See "Nested loops and the `loop_group_id` tree" below. ### Why these three attributes are sufficient `loop_count` is redundant across all ops sharing the same `loop_group_id` (they must agree), but keeping it on each op means the post-fusion pass does not need to maintain a separate side table. The `loop_group_id` is the join key. `loop_tiled_dims` is the bridge between the pre-scheduling pass (which operates on positional `data.ranges` indices) and the codegen phase (which uses named sympy Symbols) — it is read by `create_op_spec` to identify, by index, which scheduler-level symbols correspond to the tiled dimensions and should be recorded in `OpSpec.tiled_symbols`. All levels are flattened (outermost first) so that `tiled_symbols` covers every loop variable for the op. Using a list-of-lists of indices (rather than a count or a flag) allows different ops in the same loop to tile non-contiguous or differently positioned dimensions of their respective iteration spaces. ### `Loops` is a frozen dataclass Inductor's `ir.Loops` (the base of `Pointwise` and `Reduction`) is declared `@ir_dataclass(frozen=True)`, so `data.ranges = x` raises `FrozenInstanceError`. The tiling pass uses `object.__setattr__` to bypass this: ```python object.__setattr__(data, "ranges", ranges) ``` ### Public API: `coarse_tile()` ```python def coarse_tile( operations: list[Operation], groups: list[tuple], *, tiled_dims: list[int] | None = None, ) -> None: ``` `groups` is a pre-computed list of group tuples supplied by the caller (e.g., `config.coarse_tiling_groups_fn`). Each `ops` list must be a contiguous sub-sequence of `operations`; a gap indicates a data-flow dependency crossing the group boundary and raises `RuntimeError`. Each group tuple takes one of two forms: **Flat (single loop):** ```python (ops, K) # tile dim 0 by K (default) (ops, K, [0, 1]) # tile dims 0 and 1 by K ``` The optional third element overrides the `tiled_dims` keyword argument for that specific group. `None` (the default) divides only dimension 0. **Nested (multiple independent loops on the same ops):** ```python (ops, [(K1, [0]), (K2, [1])]) # outer K1 on dim 0; inner K2 on dim 1 (ops, [(K1, [0]), (K2, [0,1])]) # outer K1 on dim 0; inner K2 on dims 0 and 1 ``` The second element is a list of `(count, tiled_dims)` pairs, outermost first. The ops end up in the innermost loop body; each level's count divides the corresponding dims independently (outermost pass applied first). `coarse_tile()` normalises flat syntax to the list-of-pairs form internally, so `_stamp_group` always works with the canonical representation. ### Feature flag and groups callable ```python # config.py coarse_tiling: bool = os.environ.get("COARSE_TILING", "0") == "1" coarse_tiling_groups_fn: Optional[Callable] = None # overrides hint-derived groups ``` When `coarse_tiling=True` and `coarse_tiling_groups_fn` is `None`, groups are derived automatically from `spyre_hint(slices=...)` annotations via `hints_to_coarse_tile_groups` — a no-op if no hints are present. Setting `coarse_tiling_groups_fn` to a callable overrides the hint-derived groups entirely; this is intended for interim testing until the annotation framework matures and will be removed once it is complete. `coarse_tiling_groups_fn` must be a **module-level named function**, not a lambda, because Inductor's FX graph cache pickles the config values. ### Placement in `CustomPreSchedulingPasses` The coarse-tiling pass runs after layout finalization and before `span_reduction`: ```python deadcode_elimination(operations) propagate_spyre_tensor_layouts(operations) optimize_restickify_locations(operations) finalize_layouts(operations) insert_restickify(operations) insert_bmm_padding(operations) dedup_and_promote_constants(operations) if config.chunk_large_tensors: chunk_large_tensors(operations) propagate_named_dims(operations) assign_dim_hints(operations) if config.coarse_tiling: groups = hints_to_coarse_tile_groups(operations) if config.coarse_tiling_groups_fn is not None: groups = config.coarse_tiling_groups_fn(operations) coarse_tile(operations, groups=groups) span_reduction(operations) k_fast_ops = ( k_fast_division(operations) if config.core_id_k_fast_emission else [] ) work_distribution(operations, k_fast_ops) if config.lx_planning: allocator = ( StrategyBCoOptimizingAllocator() if config.co_optimizing_lx_planning else None ) scratchpad_planning(graph, allocator=allocator) ``` This ordering is required by several constraints: **`propagate_named_dims` and `assign_dim_hints` must run before coarse tiling.** `propagate_named_dims` propagates `name_tensor_dims()` annotations through the op graph, attaching named dimension metadata to each `ir.Operation`. `assign_dim_hints` then combines those named dimensions with the `spyre_hint` scope annotations (attached to FX nodes as `meta["custom"]`) to produce `op.dim_hints` — a flat list of `DimHint` objects consumed by `hints_to_coarse_tile_groups` to form the coarse tiling groups. **Must run after stickify and padding.** `propagate_spyre_tensor_layouts`, `insert_restickify`, and `insert_bmm_padding` establish the final tiled memory layout for each tensor. The tiling pass must see the post-stickify, post-padding shapes or it will split on the wrong dimension or produce a non-stick-aligned inner size. **Must run before `work_distribution`.** `work_distribution` stamps `op_it_space_splits` on each `ir.Operation` to assign per-core work slices. It must see the already-reduced (inner) iteration space so that cores divide the per-iteration work, not the full pre-tiling iteration space. Running coarse tiling after `work_distribution` would produce `op_it_space_splits` values sized for the full range, which would then be wrong relative to the reduced `ranges` written by the tiling pass. `span_reduction` and `k_fast_division` have the same requirement and already run before `work_distribution`, so placing `coarse_tile` with them is consistent. `scratchpad_planning` must run after coarse tiling because it sizes scratchpad allocations to fit the per-iteration working set. If it ran before, it would see the full iteration space and allocate too much — defeating the working-set reduction that coarse tiling is designed to achieve. `scratchpad_planning` receives the full `GraphLowering` object (not just `operations`) because it needs access to graph-level metadata for buffer lifetime analysis. ### Buffer propagation: `insert_tiling_propagation` `coarse_tile()` calls `insert_tiling_propagation(operations, groups)` immediately after stamping all loop attributes. Its job is to ensure that any op whose result is consumed **outside** the loop (or is a graph output) exposes a complete, fully-sized buffer to its consumers. Ops whose outputs are consumed only inside the loop are marked so the unroller does not advance their base addresses. #### Use-def analysis For each `ComputedBuffer` in a loop group the pass asks two questions: 1. **Does this buffer have outside consumers?** A consumer is "outside" if it carries a different `loop_group_id` prefix, or has no `loop_group_id` at all. Graph outputs (recorded in the Inductor buffer's `users`/`get_alias_name` machinery) count as outside consumers. 2. **Does this buffer have inside consumers?** A consumer is "inside" if it shares the same `loop_group_id` tuple (i.e. it is another op in the same innermost loop body). #### `per_tile_fixed` — loop-internal buffers If a buffer has **no** outside consumers and is not a graph output, it is a per-tile scratch region that is fully overwritten and read within the same loop iteration. The pass marks it: ```python if isinstance(op.layout, FixedTiledLayout): op.layout.per_tile_fixed = True ``` This flag propagates to `TensorArg.per_tile_fixed` during codegen (in `spyre_kernel.py`). The unroller (`codegen/unroll.py`) then skips two things for these args: - **Address advance** — the base address is fixed; no per-iteration offset is added. - **`device_size` update** — the tile geometry is not applied; the hardware uses the original allocation size, which already matches the tile. #### Case 1 — output used both inside and outside the loop The tiled op writes into its small, per-iteration buffer as usual. The pass allocates a full-sized HBM buffer (sized to the original pre-division ranges) and inserts a **copy op** immediately after the tiled op in the operations list. The copy op carries the same `loop_group_id`, `loop_count`, and `loop_tiled_dims` as the original, so the scheduler wraps both ops in the same `CountedLoopSchedulerNode`. Its `TensorArg` for the destination uses the full buffer's address; the existing `tiled_symbols` / `affine.apply` machinery in `SpyreKernel` and `bundle.py` computes the per-iteration slice offset automatically. All outside consumers are then patched to read the full buffer instead of the tiled one. #### Case 2 — output used only outside the loop When no inside consumer needs the per-iteration buffer, the simplest fix is to rewire the tiled op itself to write directly into the full buffer: ```python op.layout = MutationLayoutSHOULDREMOVE(TensorBox(StorageBox(full_buf))) ``` `MutationLayoutSHOULDREMOVE` tells Inductor that the op mutates an existing storage in-place. Because the full buffer is pre-allocated and its address is encoded in the `TensorArg` via the `tiled_symbols` offset, no copy op is needed. #### Reduction safety checks Before running the propagation logic for a `Reduction` op, the pass calls `_check_reduction_tiling_safety(op)` which raises `RuntimeError` in two unsupported configurations: - **Matrix multiply** (`reduction_type == "batchmatmul"`) inside a tiling loop — the accumulation semantics are not handled. - **Tiled reduction dim** — if any entry in `loop_tiled_dims` is `>= len(data.ranges)`, the tiled index falls in `reduction_ranges`. Accumulation-buffer support for this case is not yet implemented. Both checks happen before any buffer allocation, so the error is clean. ## Layer 2 — `CountedLoopSchedulerNode` ### Class definition `CountedLoopSchedulerNode` lives in `torch_spyre/_inductor/scheduler.py` alongside `SuperDSCScheduling`. It subclasses Inductor's `FusedSchedulerNode`: ```python class CountedLoopSchedulerNode(FusedSchedulerNode): loop_count: sympy.Expr def __init__( self, scheduler, snodes: list[BaseSchedulerNode], loop_count: sympy.Expr, ) -> None: super().__init__(scheduler, snodes) self.loop_count = loop_count def unpack(self) -> list[BaseSchedulerNode]: # CountedLoopSchedulerNode is an atomic codegen unit; do not unpack. return [self] @classmethod def can_fuse( cls, producer: BaseSchedulerNode, consumer: BaseSchedulerNode, ) -> bool: return False ``` `unpack()` returns `[self]` to prevent Inductor's `Scheduler.process_grouped_nodes()` from dissolving the node back into its constituent `SchedulerNode`s before codegen. `can_fuse` returns `False` — a loop group is atomic; nothing can be fused into it from outside. ### Why `FusedSchedulerNode` is the right base `CountedLoopSchedulerNode` subclasses `FusedSchedulerNode` rather than `GroupedSchedulerNode` for two reasons: 1. **Dispatch**: `Scheduler._codegen` only dispatches `FusedSchedulerNode | SchedulerNode` to `codegen_node()`. A `GroupedSchedulerNode` subclass falls through to `assert isinstance(node, NopKernelSchedulerNode)` and crashes. 2. **Unpack control**: `GroupedSchedulerNode` is unconditionally unpacked by `Scheduler.process_grouped_nodes()` at the start of codegen. `FusedSchedulerNode` is not subject to that unpack, so overriding `unpack()` is sufficient to keep the node intact. `FusedSchedulerNode` already merges `unmet_dependencies` across all constituent nodes, exposes `get_nodes()`, and registers all constituent names in `scheduler.name_to_fused_node`. Nothing needs to be reimplemented. ### The post-fusion pass and ordering `CountedLoopSchedulerNode`s are created by `build_loop_scheduler_nodes`, which is registered as the first pass in `CustomPostFusionPasses`: ```python class CustomPostFusionPasses(CustomNodePassBase): def get_passes(self): return [memory_planning, build_loop_scheduler_nodes, spyre_fuse_nodes] ``` **`build_loop_scheduler_nodes` must run before `spyre_fuse_nodes`.** If `spyre_fuse_nodes` ran first, it could merge `SchedulerNode`s from different loop groups into a single `FusedSchedulerNode`. The loop-group pass would then see one fused node spanning multiple groups rather than the individual nodes with distinct `loop_group_id`s, and would wrap both groups in a single `CountedLoopSchedulerNode` with a single `loop_count`. Running loop grouping first ensures each group is already wrapped and opaque before `spyre_fuse_nodes` runs. `can_fuse = False` then prevents `spyre_fuse_nodes` from merging across group boundaries. ### The grouping algorithm `build_loop_scheduler_nodes` scans the flat node list and groups contiguous runs sharing the same outermost `loop_group_id` key: ``` result = [] i = 0 while i < len(nodes): node = nodes[i] gid = _loop_group_id(node) # reads loop_group_id from the inner ir.Operation if gid is None: result.append(node) i += 1 continue outer_key = gid[0] run = [node]; i += 1 while i < len(nodes) and _loop_group_id(nodes[i])[0] == outer_key: run.append(nodes[i]); i += 1 # Recursively wrap deeper nesting within this run. inner = _build_loop_group(run, depth=1) result.append(CountedLoopSchedulerNode.create(inner, loop_count)) return result ``` Key invariant: because the pre-scheduling pass runs in topological order and the scheduler's topological sort preserves that order, a loop group's `SchedulerNode`s will be contiguous in the post-fusion node list. If they are not contiguous it means a data-flow constraint separates them, which is a bug in the tiling pass. The post-fusion pass asserts contiguity. ## Layer 3 — `LoopSpec` and codegen ### `LoopSpec` and `OpSpec.tiled_symbols` in `op_spec.py` ```python @dataclasses.dataclass class LoopSpec: count: sympy.Expr body: list[OpSpec | UnimplementedOp | LoopSpec] @dataclasses.dataclass class OpSpec: op: str is_reduction: bool iteration_space: dict[Symbol, tuple[Expr, int]] args: Sequence[TensorArg] op_info: dict[str, Any] tiled_symbols: list[Symbol] = field(default_factory=list) ``` `LoopSpec` is a peer of `OpSpec` and `UnimplementedOp` in the list that `SpyreKernel.codegen_kernel()` serializes. It is not a subclass of `OpSpec` because it has no `iteration_space`, `args`, or `op_info` of its own — those belong to the inner `OpSpec`s. The `body` type is recursive: a `LoopSpec` body may itself contain `LoopSpec` entries, representing nested counted loops. `OpSpec.tiled_symbols` carries the per-op tiling information: all iteration-space symbols that are divided by any enclosing loop, listed **outermost first**. It is **empty for ops that are not inside a `LoopSpec`**. For a single-level tiled op, `tiled_symbols = [s0]`. For a two-level nested tiled op, `tiled_symbols = [s_outer, s_inner]`. The runtime uses this list together with the enclosing loop variables (also outermost-first) to build the affine address formula: `base + s_outer_stride * i_outer + s_inner_stride * i_inner`. Tiling information is stored on `OpSpec` rather than on `LoopSpec` because different body ops may tile different iteration-space dimensions. Two ops in the same loop group can have different `tiled_symbols` if, for example, work division or stickification places the batch dimension at different positions in each op's iteration space. A single `int` on `LoopSpec` cannot express this; per-op `list[Symbol]` can. ### Nested loops and the `loop_group_id` tree Each `ir.Operation` carries a `loop_group_id` that is a **path** rather than a flat integer. A path is a tuple of integers, one element per nesting level: | `loop_group_id` | Meaning | |---|---| | `(0,)` | outermost loop group 0, not nested | | `(0, 0)` | single op nested two levels deep inside group 0 | | `(0, 1)` | ops at depth 2 inside outer group 0, inner group 1 | `loop_count` is a **list** parallel to the path. For a flat op at `(0,)`, `loop_count = [K]`. For a single op at `(0, 0)`, `loop_count = [K1, K2]` — the scheduler reads `loop_count[0] = K1` when building the outer `CountedLoopSchedulerNode` and `loop_count[1] = K2` when building the inner one. This allows a single op to supply the counts for all its enclosing loops without requiring sibling ops at intermediate depths. The post-fusion pass (`_build_loop_group`) reconstructs the tree recursively: 1. Group the flat `SchedulerNode` list into runs that share the same outermost group id element (index `depth`). 2. Read the count for this depth from `_loop_count(node, depth)`, which indexes `loop_count[depth - base_depth]`. All nodes in the run must agree on this count. 3. Recursively call `_build_loop_group(run, depth + 1)` to build the inner level. 4. Wrap the result in a `CountedLoopSchedulerNode(count=K_outer, ...)`. Because every op carries the full `loop_count` list, the algorithm works even when a run contains only a single op that spans all nesting levels — there is no need for placeholder ops at intermediate depths. ### Bundle boundary constraint A `CountedLoopSchedulerNode` (at any nesting depth) and all its descendant `SchedulerNode`s must be codegen'd into a **single SuperDSC bundle** — i.e., a single `codegen_node()` call must produce the entire `LoopSpec` tree. This is automatically satisfied because Inductor calls `codegen_node()` once per `BaseSchedulerNode` in the topological order, and a `CountedLoopSchedulerNode` is a single node that encapsulates all its children. No loop group can be split across two `codegen_node()` calls. The bundle boundary constraint also forbids a loop group from being split by Inductor fusion: `can_fuse` returns `False` on `CountedLoopSchedulerNode`, so no external node can be merged into or absorb part of a loop group. In `bundle.py`, `generate_bundle` iterates the flat `list[OpSpec]` emitted by `codegen_kernel()`. When it encounters a `LoopSpec` it emits SDSC JSON files for each `OpSpec` in the body (recursively) and wraps those executions in an `scf.for` in `bundle.mlir`. ### Changes to `SuperDSCScheduling.codegen_node()` `codegen_node` already handles `FusedSchedulerNode | SchedulerNode`. `CountedLoopSchedulerNode` is recognized by an `isinstance` check: ```python def codegen_node( self, node: Union[FusedSchedulerNode, SchedulerNode, CountedLoopSchedulerNode], ) -> None: if isinstance(node, CountedLoopSchedulerNode): self._codegen_counted_loop(node) return # existing flat-list path unchanged ... def _codegen_counted_loop(self, node: CountedLoopSchedulerNode) -> None: inner_nodes = [ n for n in node.get_nodes() if n.get_name() not in self.scheduler.removed_ops ] kernel = SpyreKernel() all_schedule_nodes = [] with kernel: for inner in inner_nodes: if isinstance(inner, CountedLoopSchedulerNode): self._codegen_loop_body(inner, kernel, all_schedule_nodes) else: sched = self.generate_node_schedule([inner]) all_schedule_nodes.extend(sched) for snode in sched: var_ranges = iteration_space(snode) vs = list(var_ranges.keys()) index_vars = [vs[:len(snode._body.iter_vars)], vs[len(snode._body.iter_vars):]] snode.codegen(index_vars) # Wrap the collected inner specs in a LoopSpec kernel.wrap_op_specs_in_loop(node.loop_count) with V.set_kernel_handler(kernel): src_code = kernel.codegen_kernel() kernel_name = self.define_kernel(src_code, all_schedule_nodes, kernel) ... ``` `_codegen_loop_body` handles nested `CountedLoopSchedulerNode`s: it codegens the body ops into the existing kernel, then wraps only the newly added `op_specs` entries in an inner `LoopSpec`. The outer `_codegen_counted_loop` then wraps everything in the outer `LoopSpec` via `wrap_op_specs_in_loop`. `SpyreKernel.wrap_op_specs_in_loop(count)` replaces the flat `self.op_specs` list with `[LoopSpec(count=count, body=self.op_specs)]`. `generate_node_schedule` handles `FusedSchedulerNode`s that may appear among the inner nodes (e.g. from earlier passes that fused nodes within the same loop group) by flattening them into their constituent `SchedulerNode`s. ### Serialization in `codegen_kernel()` `codegen_kernel()` already iterates `self.op_specs` to emit Python source. A `LoopSpec` entry is serialized as: ```python LoopSpec( count=sympify('K'), body=[ OpSpec( ..., tiled_symbols=[sympify('c0')], # emitted only when non-empty ), LoopSpec( # nested loop count=sympify('J'), body=[ OpSpec(..., tiled_symbols=[sympify('c0'), sympify('c1')]), ], ), ], ) ``` `tiled_symbols` is populated by `SpyreKernel.create_op_spec`: it reads `loop_tiled_dims` (a `list[list[int]]`) from the `ir.Operation` (stamped by `coarse_tile()`), flattens all levels outermost-first, and selects the symbols at those indices from the scheduler-level `iteration_space` dict. `MemoryDep.ranges` preserves the `data.ranges` ordering, so this positional correspondence is stable across the pre-scheduling to codegen boundary. `tiled_symbols` is omitted from the serialized source when empty (i.e. for ops that are not inside a loop), keeping the generated output identical to the pre-tiling baseline for non-tiled kernels. The generated Python wrapper imports `LoopSpec` from `op_spec.py` so the serialized source is re-loadable from the Inductor cache. The `arg_index` fixup loop (which maps tensor names to kernel argument positions) runs before serialization. It must walk the `LoopSpec` tree recursively to find all `TensorArg` objects inside nested bodies, not just the top-level `self.op_specs` list. ### `bundle.mlir` generation for loops `generate_bundle` in `bundle.py` emits one `sdscbundle.sdsc_execute` line per `OpSpec`. When a `LoopSpec` is present it emits an `scf.for` block in `bundle.mlir` wrapping the execute calls for the body ops. The loop induction variable is an `index` type running from `0` to `count` with step `1`. For the current prototype, `count` must be a concrete integer; symbolic loop counts raise `NotImplementedError`. Emitted MLIR for a single-level loop with one body op: ```none module { func.func @sdsc_bundle() { %c0 = arith.constant 0 : index %c1 = arith.constant 1 : index %loop_bound_0 = arith.constant 4 : index scf.for %i_0 = %c0 to %loop_bound_0 step %c1 { sdscbundle.sdsc_execute () {sdsc_filename="sdsc_a_0.json"} } return } } ``` For nested loops, `scf.for` blocks are nested and induction variables are numbered sequentially (`%i_0`, `%i_1`, ...): ```none %loop_bound_0 = arith.constant 4 : index %loop_bound_1 = arith.constant 8 : index scf.for %i_0 = %c0 to %loop_bound_0 step %c1 { sdscbundle.sdsc_execute () {sdsc_filename="sdsc_a_0.json"} scf.for %i_1 = %c0 to %loop_bound_1 step %c1 { sdscbundle.sdsc_execute () {sdsc_filename="sdsc_a_1.json"} } } ``` `generate_bundle` walks the `list[OpSpec | LoopSpec]` recursively, maintaining an indentation level and a counter for SDSC JSON filenames. The filenames are assigned in depth-first traversal order. ## Files changed | File | Change | |---|---| | `torch_spyre/_inductor/op_spec.py` | Add `LoopSpec` dataclass (recursive body type); add `tiled_symbols: list[Symbol]` to `OpSpec` | | `torch_spyre/_inductor/spyre_kernel.py` | Add `SpyreKernel.wrap_op_specs_in_loop()`; extend `codegen_kernel()` to serialize `LoopSpec` recursively; populate `OpSpec.tiled_symbols` in `create_op_spec`; fix `arg_index` fixup to walk nested bodies | | `torch_spyre/_inductor/scheduler.py` | Add `CountedLoopSchedulerNode(FusedSchedulerNode)` with `unpack()` override; add `build_loop_scheduler_nodes` and `_codegen_counted_loop`/`_codegen_loop_body` to `SuperDSCScheduling` | | `torch_spyre/_inductor/passes.py` | Add `coarse_tile()` call (with `hints_to_coarse_tile_groups` fallback) in `CustomPreSchedulingPasses`; add `propagate_named_dims` and `resolve_hints` calls before coarse tiling; reorder `CustomPostFusionPasses` to `[memory_planning, build_loop_scheduler_nodes, spyre_fuse_nodes]` | | `torch_spyre/_inductor/config.py` | Add `coarse_tiling: bool` flag, `coarse_tiling_groups_fn` override callable, `bundle_hbm_symbols: bool`, `unroll_loops: bool`, and `allow_all_ops_in_lx_planning: bool` | | `torch_spyre/_inductor/propagate_hints.py` | Add `spyre_hint(slices=...)` context manager and `get_op_hints()`; `resolve_hints` stamps hint metadata on `ir.Operation` objects; `hints_to_coarse_tile_groups` converts resolved hints to a `coarse_tile` groups list | | `torch_spyre/_inductor/propagate_named_dims.py` | `propagate_named_dims` propagates `name_tensor_dims()` annotations from FX nodes to `ir.Operation` objects; `assign_dim_hints` combines those named dimensions with `spyre_hint` scope metadata to produce `op.dim_hints`, consumed by `hints_to_coarse_tile_groups` | | `torch_spyre/_inductor/coarse_tile.py` | New file: `coarse_tile(operations, groups)` pass; stamps `loop_group_id`, `loop_count`, and `loop_tiled_dims` on ops; rewrites `ranges` via `object.__setattr__`; `insert_tiling_propagation` allocates full buffers for outside consumers, marks loop-internal buffers `per_tile_fixed` | | `torch_spyre/_inductor/ir.py` | Add `per_tile_fixed: bool = False` to `FixedTiledLayout` | | `torch_spyre/_inductor/op_spec.py` | Add `per_tile_fixed: bool = False` to `TensorArg` | | `torch_spyre/_inductor/codegen/unroll.py` | Add `_tile_device_size` helper; apply tile-sized `device_size` and skip address advance for `per_tile_fixed` args during loop unrolling | | `torch_spyre/_inductor/wrapper.py` | Add `LoopSpec` to the generated wrapper's import line | | `torch_spyre/_inductor/codegen/bundle.py` | Extend `generate_bundle()` to walk `LoopSpec` tree and emit `scf.for` in `bundle.mlir`; number SDSC JSON files in depth-first order | | `torch_spyre/execution/async_compile.py` | `sdsc()` accepts `Sequence[OpSpec | UnimplementedOp | LoopSpec]`; delegates `_find_unimplemented` to `bundle.py` | | `tests/inductor/test_coarse_tiling.py` | Consolidated unit test suite: `LoopSpec`/`OpSpec` data structures, `coarse_tile()` IR pass, `insert_tiling_propagation`, `CountedLoopSchedulerNode`, `generate_sdsc`/`compile_op_spec` symbol paths, `generate_bundle` MLIR output and snapshot tests (104 tests) | | `tests/inductor/test_coarse_tile_e2e.py` | End-to-end compilation tests: baseline, single group, softmax-shaped, two groups, per-group tiled dims, unrolled execution | | `tests/inductor/test_unroll_loop_specs.py` | Unit tests for `unroll_loop_specs`: address arithmetic, `per_tile_fixed` handling, nested loops, stride computation | ## Invariants and failure modes **Contiguity invariant**: all `SchedulerNode`s sharing a `loop_group_id` must be contiguous after the scheduler's topological sort. If the tiling pass stamps ops that have a data dependency crossing the group boundary, the post-fusion pass will detect a non-contiguous run and raise a `RuntimeError`. **Consistent `loop_count`**: all ops sharing a `loop_group_id` must agree on `loop_count` at every depth level. The post-fusion pass asserts this. **`tiled_symbols` populated iff inside a loop**: `OpSpec.tiled_symbols` is non-empty exactly when the op was codegen'd inside a `CountedLoopSchedulerNode`. Its elements are the flattened (outermost-first) per-level tiled dims from `loop_tiled_dims` on the corresponding `ir.Operation`, selected from the scheduler-level `iteration_space` keys. **Pass ordering**: coarse tiling must run after stickify/padding and before `span_reduction`, `k_fast_division`, `work_distribution`, and `scratchpad_planning`. `build_loop_scheduler_nodes` must run before `spyre_fuse_nodes` in `CustomPostFusionPasses` — see the ordering rationale above. **Cache invalidation**: `coarse_tile.py`, `scratchpad_planning`, and all other pass source files are included in `CustomPreSchedulingPasses.uuid()` so the Inductor FX cache is invalidated when any pass changes. The `coarse_tiling_groups_fn` must be a module-level named function (not a lambda) for Inductor's cache pickling to work. ## Rejected design alternatives ### Inductor's existing loop IR Inductor has several loop-related constructs, none of which fit the requirement. **`ir.Loops` / `Pointwise` / `Reduction`** (`torch/_inductor/ir.py`). These have a `ranges: Sequence[Expr]` field that describes the iteration space of a *single* operation. They model per-op loop bounds, not a loop that groups multiple operations together. There is no concept of "execute this sequence of ops N times." **`ir.WhileLoop`** (`torch/_inductor/ir.py`). A while-loop IR node for data-dependent control flow. Trip count is not statically known; not appropriate for the counted, coarse-tiling use case. **`GroupedSchedulerNode`** (`torch/_inductor/scheduler.py`). Groups a sequence of `SchedulerNode`s so the scheduler cannot interleave other nodes between them. This is a pure scheduling constraint: it carries no loop count, does not rewrite iteration spaces, and is **unconditionally unpacked** by `Scheduler.process_grouped_nodes()` before codegen. It also does not appear in the `FusedSchedulerNode | SchedulerNode` isinstance check in `Scheduler._codegen`, so a subclass of `GroupedSchedulerNode` would not be dispatched to `codegen_node()` at all. These limitations make `FusedSchedulerNode` the correct base instead. **`codegen.cpp.LoopLevel` / `LoopNest`** (`torch/_inductor/codegen/cpp.py`). Codegen-time loop structures used by the C++ backend to emit nested `for` loops. They exist only during C++ code emission and have no presence in the scheduler or IR layers where Spyre's optimization passes run. ### Helion's `ForLoopGraphInfo` Helion (`helion/_compiler/device_ir.py`) represents loops as `ForLoopGraphInfo` nodes. Each node wraps a nested FX sub-graph (referenced by `graph_id`) and a `block_ids` list that determines which tile dimensions participate in the loop. The FX graph for the outer scope contains a `_for_loop(graph_id, begin, end, args)` node (`helion/language/_tracing_ops.py`) as a placeholder. A companion `ReductionLoopGraphInfo` handles reduction loops. This design is well-suited to Helion's tile-strategy-driven GPU compilation model, where the loop structure is discovered during tracing and the body is a reusable sub-graph. It is a poor fit for Spyre's pipeline for three reasons: 1. **Wrong representation layer.** Spyre's optimization passes operate on `list[ir.Operation]` before the Inductor `Scheduler` exists. Helion's loop nodes live in an FX graph; adopting that representation would require building and maintaining a parallel FX graph for the pre-scheduling IR, adding substantial complexity. 2. **Tile strategy coupling.** `ForLoopGraphInfo` carries `block_ids` that reference Helion's tile strategy objects. Spyre has no tile strategy layer; loop structure comes from the coarse-tiling pass decision, not from a tiling configuration object. 3. **Sub-graph identity vs. flat sequence.** Helion identifies loop bodies by an opaque `graph_id` and looks them up in a registry. For Spyre's use case — a contiguous run of `SchedulerNode`s that must stay together — a flat ordered list inside `CountedLoopSchedulerNode` is simpler and directly matches what `codegen_node` already iterates. The key insight borrowed from Helion is that the loop body should be a *separate, named structure* rather than an attribute on individual ops. That insight shaped the decision to make `CountedLoopSchedulerNode` a first-class scheduler node (rather than stamping a loop-count attribute on each `SchedulerNode` and reconstructing the grouping at codegen time). ### Attribute-only approach (Option B) An earlier candidate design stamped `loop_group_id` and `loop_count` directly onto `ir.Operation` objects and deferred all grouping to `codegen_node()`, which would scan the flat `node_schedule` list and reconstruct loop boundaries at codegen time. This was rejected because it is fragile in the face of correctness requirements. If the scheduler ever reorders nodes within what the tiling pass intended to be a loop group — or if a group boundary does not align perfectly with a fused-node boundary — the reconstruction in `codegen_node()` silently produces wrong output: incorrect trip counts or mismatched iteration spaces. With coarse tiling these are correctness bugs, not performance bugs. `CountedLoopSchedulerNode` enforces the grouping structurally: the scheduler cannot split or reorder within it, and a mismatch is caught at post-fusion pass time rather than silently at codegen time. ## Out of scope - Loops whose trip count is data-dependent (use `ir.WhileLoop` for that). - Fusing a non-tiled op into the body of a `CountedLoopSchedulerNode`. - Passing the loop induction variable into an `OpSpec` body (ops inside a loop do not currently use the induction variable; each iteration executes identically on a different slice of the data determined by the reduced iteration space). - Symbolic loop counts in `bundle.mlir` (currently raises `NotImplementedError`; requires runtime shape plumbing into the MLIR function signature).