Inductor Front-End: Deep Dive

This page provides a detailed reference for the Torch-Spyre Inductor front-end compiler. For a high-level overview of the full compilation pipeline, see Compiler Architecture.

Torch-Spyre compilation pipeline showing upstream versus custom components — The Torch-Spyre compilation pipeline. The left end (green) is entirely upstream PyTorch — Dynamo/Autograd and Inductor. The right end (pink) is Torch-Spyre’s custom Inductor backend, which generates OpSpecs, SuperDSCs, and host code. Torch-Spyre also adds configurations and extensions to the upstream stages to tailor them for the Spyre device.

Inductor Backend Registration

At import time the Spyre backend registers three components with Inductor. Together they take the place of the Triton/CUDA codegen path on a GPU:

Component	Module	Role
`SuperDSCScheduling`	`scheduler.py`	Inductor backend scheduling class. Decides how to group and order operations on the LoopLevelIR. Replaces Triton scheduling.
`SpyrePythonWrapperCodegen`	`wrapper.py`	Inductor wrapper-codegen class. Generates the Python wrapper that allocates tiled buffers via `spyre_empty_with_layout()` and dispatches kernels via `async_compile.sdsc()`.
`SpyreDeviceOpOverrides`	`device/op_overrides.py`	Device-specific op overrides surfaced to Inductor.

The Spyre-specific Inductor configuration (decompositions, lowerings, the mm_to_bmm_pass that rewrites 2D matmul into 3D bmm for better core utilization, fusion heuristics, and dataflow-friendly Inductor config overrides) is activated through a single context manager:

from torch_spyre._inductor.patches import enable_spyre_context

with enable_spyre_context(...):
    compiled = torch.compile(model, backend="spyre")

enable_spyre_context is the central entry point that wires everything together. The three registrations above happen earlier, at package import time, in _inductor/__init__.py.

Extending Compilation

The front-end adds compilation passes into upstream Inductor via six extension points, all registered in passes.py:

Extension Point	Stage	Purpose
`CustomPreGradPasses`	Pre-grad FX graph	Reserved for graph rewrites before autograd partitioning. The pipeline is empty today.
`CustomPrePasses`	Post-grad FX graph (early)	`collect_spyre_hints` snapshots `spyre_hint` annotations so they survive AOT re-tracing.
`CustomPostPasses`	Post-grad FX graph (late)	Late post-grad rewrites: `recover_spyre_hints`, `mm_to_bmm_pass`, `mark_direct_unit_bmm_pass`, `bmm_unflatten_pass`.
`CustomPreFusionPasses`	LoopLevelIR (pre-fusion)	Pre-fusion scheduler passes: `propagate_mutation_layouts`, `build_loop_scheduler_nodes`.
`CustomPostFusionPasses`	LoopLevelIR (post-fusion)	Post-fusion scheduler passes: `memory_planning`, `spyre_fuse_nodes`.
`CustomPreSchedulingPasses`	LoopLevelIR (pre-scheduler)	The pre-scheduling pipeline that runs immediately before the Scheduler is constructed (wired in via a `GraphLowering._update_scheduler` monkey-patch in `patches.py`). The full step list is in LoopLevelIR Passes below.

FX Graph Passes

Transformations on the FX Graph tend to be simpler to implement, but happen before the layout of intermediate Tensors in device memory has been computed. Therefore they need to be layout-agnostic. Some examples of passes that are appropriate to perform at this level are:

replacing constants with size 1 tensors
normalizing matrix multiplies to add padding (also applied to mm and bmm)

LoopLevelIR Passes

Passes on the LoopLevelIR run late in compilation. CustomPreSchedulingPasses dispatches them in a fixed order. Each step takes the GraphLowering and mutates graph.operations in place. Steps marked “Gated” are skipped when their config flag is off.

#	Pass	Module	Notes
1	`deadcode_elimination`	deadcode_elimination.py	Drops unreachable ops.
2	`split_multi_ops`	split_multi_ops.py	Splits multi-op loop bodies (e.g. type conversion + arithmetic) into separate single-op buffers and materializes constant args as `SpyreConstantFallback`.
3	`propagate_spyre_tensor_layouts`	propagate_layouts.py	Stamps `FixedTiledLayout` on every `ComputedBuffer`.
4	`validate_ops`	split_multi_ops.py	Checks that each op’s inputs share the same `ElementArrangement`. Runs after layout propagation, when the `SpyreTensorLayout`s are available.
5	`optimize_restickify_locations`	optimize_restickify.py	Moves restickify ops to better placements before the layout is finalized.
6	`finalize_layouts`	insert_restickify.py	Settles tile-structure decisions before any new restickify is inserted.
7	`insert_restickify`	insert_restickify.py	Adds explicit re-tile ops where adjacent ops disagree on layout.
8	`insert_post_mutation_restickify`	insert_restickify.py	Handles restickification for slice-mutation buffers.
9	`insert_bmm_padding`	padding.py	Pads `mm` and `bmm` operands to satisfy hardware alignment.
10	`dedup_and_promote_constants`	dedup_constants.py	Deduplicates identical constants and promotes shared ones.
11	`propagate_named_dims`	propagate_named_dims.py	Propagates `name_tensor_dims` annotations from inputs through the op graph.
12	`assign_dim_hints`	propagate_named_dims.py	Lowers each `spyre_hint` scope to a per-op `DimHint` list.
13	`coarse_tile`	coarse_tile.py	Runs only when hint-derived loop groups are present. Wraps each group in nested counted loops.
14	`span_reduction`	work_division.py	Reduces per-core access spans to fit the hardware memory budget.
15	`cost_model_matmul_division` + `work_distribution`	work_division.py	The cost-model pass claims a subset of matmuls; `work_distribution` covers the rest.
16	`scratchpad_planning`	scratchpad/	Gated by `config.lx_planning`. Allocates the LX scratchpad.

Once stickification has run, every ComputedBuffer carries a FixedTiledLayout, so the later passes can take device layout into account when making decisions.

For deeper treatment of individual passes see Working Set Reduction, Coarse-Tiling Loops, Work Division Planning, and Scratchpad Planning.

Views and Index Translation

Real models lean on views heavily. Here is the RoPE block from Granite:

def rope(cached_freqs, q):
    q_ = q.view(2, 256, 32, 128).view(2, 256, 32, 2, 64)         # B L H 2 D/2
    mul_out = cached_freqs[:, :, None, :, :, :] * q_.unsqueeze(-3)  # B L H 2 2 D/2
    sum_out = mul_out.sum(4, keepdim=True)                        # B L H 2 1 D/2
    return sum_out.flatten(3)                                     # B L H D

Two view calls, an unsqueeze, a reduction, and a flatten, all on the hot path of inference. Materializing a tensor copy at every one of those view boundaries would erase any benefit from tiling. The rest of this section walks through how the compiler keeps that from happening.

PyTorch models are full of view operations: reshape, view, transpose, permute, flatten, unsqueeze, slicing, and so on. A single transformer block in Granite goes through dozens of them. On Spyre we want most of these views to cost nothing at runtime; materializing a copy every time a tensor is reshaped would defeat the point of tiling.

Worked example of how a shared Inductor index becomes per-tensor device coordinates — A worked end-to-end example. Two tensors with different PyTorch shapes (`x` is rank-3, `y` is rank-2) both flow through one shared Inductor index expression. The compiler then lifts that single expression into a distinct device-coordinate vector for each tensor, and finally co-simplifies the iteration space so the integer divisions and modulos collapse into ordinary loop variables. The two tensors end up with different per-argument dim orders, which SuperDSC handles natively.

Inductor gives us a useful starting point. When it lowers a graph that involves views, it normalizes everything onto a shared iteration space and emits a single per-output index expression. For example, this code:

x = torch.rand(50, 10, 200, dtype=torch.float16)
y = torch.rand(500, 200, dtype=torch.float16)

def f(x, y):
    return x.flatten(0, 1) + y

result = torch.compile(f)(x.to("spyre"), y.to("spyre")).cpu()

produces an Inductor body that looks roughly like:

var_ranges = {p0: 500, p1: 200}
index0 = 200*p0 + p1

def body(self, ops):
    get_index = self.get_index('index0')
    load   = ops.load('arg0_1', get_index)
    load_1 = ops.load('arg1_1', get_index)
    add    = ops.add(load, load_1)
    store  = ops.store('buf0', get_index, add, None)
    return store

Both tensors share index0 even though x was rank-3 and y was rank-2 in the original program: the flatten was absorbed into a single linear expression 200*p0 + p1. From here, the Spyre compiler has to turn that one expression into per-tensor device coordinates. There are three steps.

1. Lift index expressions to device coordinates. The host iteration variables (p0, p1) describe positions in the PyTorch shape. They are mapped into per-tensor device coordinate expressions that walk the tiled, padded device shape. Continuing the example:

Host vars	x: device shape `[10, 4, 50, 64]`	y: device shape `[4, 500, 64]`
`{p: 500, s: 200}`, `index0 = 200*p + s`	`[p%10, s//64, p//10, s%64]`	`[s//64, p, s%64]`

The stick dimension always comes out as a pair of expressions, one for the tile index (floor(s/64)) and one for the intra-stick offset (Mod(s, 64)).

2. Co-simplify the iteration space and the per-tensor coordinates. A naive translation leaves expensive integer divisions in place. The front-end factors the iteration space so the divisions and modulos disappear. Splitting p into (q, p) with q = p // 10, p = p % 10 gives:

Iteration space	x	y
`{p: 10, q: 50, s: 200}`	`[p, s//64, q, s%64]`	`[s//64, q, p, s%64]`

Each tensor is now indexed with the same iteration variables, but the expressions inside its coordinate vector are simple and the dim orders are different per tensor. That is exactly what the SuperDSC IR allows (layoutDimOrder_ is per-argument). All of this is held in torch_spyre/_inductor/views.py (compute_coordinates, align_tensors, normalize_coordinates).

3. Emit the OpSpec. The simplified iteration space and per-tensor device coordinates are dropped onto the OpSpec and TensorArgs. The “Example: an add OpSpec” section in the Back-End Compiler doc walks through what one of these artifacts looks like.

The net result for the user is what you would expect: ops on tensor views run without cloning whenever the compiler can express the new layout as a different read pattern over the same storage. When that is not feasible (for example when a downstream op forces a different stick dimension), the insert_restickify pass adds an explicit re-stick operation so the rest of the pipeline still sees a clean layout.

Code Generation

We do code generation in three stages.

LoopLevelIR nodes are fused together to form Kernels.
Each Kernel is processed by spyre_kernel.py to convert it to a list of OpSpec (op_spec.py).
Finally, the codegen/ package translates OpSpec into SuperDSC JSON — the input format for the DeepTools back-end compiler.

Our intent is that the OpSpec will capture all important semantic information about the operation in a more human readable form than the SuperDSC JSON. Therefore, the OpSpec should be the primary artifact used to understand the output of the front-end compiler. Inspecting the SuperDSC JSON should only be necessary when debugging problems in the codegen package of the front-end compiler.

Extending Operations

We extend Inductor to compile Spyre-specific operations by adding Custom Operations. We modify how existing operations are compiled by adding Spyre-specific decompositions and lowerings. See Adding Operations for a step-by-step guide.

Custom Operations

Spyre-specific operations with no ATen equivalent are defined in customops.py using @torch.library.custom_op. Each custom op requires:

A signature definition (@custom_op)
A fake/meta function (@opname.register_fake)
Either a lowering + SpyreOpFuncs entry, or a decomposition that removes it from the graph before lowering

Decompositions

Spyre-specific decompositions are registered with @register_spyre_decomposition in decompositions.py. Decompositions transform complex ATen operations into simpler primitives before the graph is lowered to loop-level IR.

Lowerings

Spyre-specific lowerings to Inductor’s LoopLevelIR are defined in lowering.py using the @register_spyre_lowering decorator. This mechanism supports both the replacement of upstream lowerings and the addition of new lowerings for Spyre-specific custom operations.

Module Reference

The headline modules above are the ones a contributor reaches for first. The front-end is also made up of a number of smaller modules; the table below names each and points to the source.

Module	Purpose
`passes.py`	The six extension-point classes, plus `_format_operations`, which renders the LoopLevelIR before and after the pre-scheduling pipeline.
`temp_passes.py`	Transitional FX-graph rewrites registered in `CustomPostPasses`: `mm_to_bmm_pass`, `mark_direct_unit_bmm_pass`, and `bmm_unflatten_pass`. The “temp” name reflects the plan to retire them as upstream Inductor evolves.
`propagate_layouts.py`	`propagate_spyre_tensor_layouts` and `propagate_mutation_layouts`. Assigns `FixedTiledLayout` to every `ComputedBuffer`.
`split_multi_ops.py`	`split_multi_ops` and `validate_ops`. Splits multi-op loop bodies into single-op buffers (materializing constant args as `SpyreConstantFallback`) and validates that op inputs share the same `ElementArrangement`.
`optimize_restickify.py`	Optimizes restickify operations inserted by layout propagation.
`insert_restickify.py`	`finalize_layouts`, `insert_restickify`, `insert_post_mutation_restickify`. Settles tile-structure decisions, adds explicit re-tile ops where adjacent ops disagree on layout, and handles restickification for slice-mutation buffers.
`memory_planning.py`	Memory planning for Spyre device buffers.
`deadcode_elimination.py`	`deadcode_elimination` for the pre-scheduling LoopLevelIR.
`work_division.py`	`span_reduction`, `cost_model_matmul_division`, `work_distribution`, `divide_pointwise_op`, `divide_reduction_op`, `apply_splits`. Three-pass work division: span reduction (mandatory), cost-model matmul division (claims a subset of matmuls), and work distribution (covers the rest).
`scratchpad/`	`scratchpad_planning` and the layout-solver framework (`GreedyLayoutSolver`, `FirstFitLayoutSolver`, `BestFitLayoutSolver`). LX scratchpad allocation.
`propagate_hints.py`	`spyre_hint` context manager, the `DimHint` dataclass, and `collect_spyre_hints` / `recover_spyre_hints` for surviving AOT re-tracing.
`propagate_named_dims.py`	`declare_tensor_dim`, `name_tensor_dims`, `propagate_named_dims`, and `assign_dim_hints`. Propagates named-dim metadata through the op graph and lowers `spyre_hint` scopes to per-op `DimHint` lists.
`coarse_tile.py`	`coarse_tile`, `hints_to_coarse_tile_groups`, `insert_tiling_propagation`, `reorder_unhinted_interlopers`, `span_overflow_groups`. Wraps each hint-derived group in nested counted loops and scales per-iteration ranges.
`dedup_constants.py`	`dedup_and_promote_constants`. Deduplicates identical constants in the LoopLevelIR.
`loop_info.py`	`CoarseTileInfo`, `copy_op_metadata`. Per-op metadata stamped by `coarse_tile()` and consumed by the scheduler, kernel codegen, and buffer-propagation pass.
`dtype_ops.py`	`DtypeOpTable`. Lookup table mapping PyTorch dtype pairs to Spyre hardware dtype-conversion operators.
`pass_utils.py`	Shared helpers for the pre-scheduling pipeline, including `splits_by_index_coeff` and `apply_splits_from_index_coeff`. These translate between iteration-variable splits and the index-coefficient-keyed splits stored on `ComputedBuffer.op_it_space_splits`.
`views.py`	`compute_coordinates`, `align_tensors`, `normalize_coordinates`, `Term`, `matching_dim`. Coordinate computation for memory-dep expressions and tensor alignment for fused kernels. Used by `spyre_kernel.py` and `work_division.py`.
`multi_dim_reduction_pass.py`	`decompose_multi_dim_reductions`. Splits a multi-dim reduction into a sequence of single-dim reductions. Not currently registered in any pass pipeline.
`op_spec.py`	`OpSpec`, `TensorArg`, `UnimplementedOp`. The high-level per-operation artifact emitted by `spyre_kernel.py`.
`spyre_kernel.py`	Converts a fused kernel into a list of `OpSpec`s.
`padding.py`	`insert_bmm_padding`. Pads `mm` and `bmm` operands to satisfy hardware alignment.
`fusion.py`	`spyre_fuse_nodes`. Post-fusion scheduler pass.
`wrapper.py`	`SpyrePythonWrapperCodegen`. The host-code generator that produces the Python wrapper around device kernels.
`codegen/`	`superdsc.py` (`SDSCArgs`, `SDSCSpec`, `parse_op_spec`, `compile_op_spec`, `_get_core_to_slice_mapping`, `_get_padded_iteration_space`, `_get_op_dim_labels`), `compute_ops.py` (`generate_sdsc`), `bundle.py` (`generate_bundle`), `unroll.py` (loop unrolling for coarse-tiling LoopSpec trees). Translates `OpSpec` into SuperDSC JSON for the back-end compiler.
`config.py`	Spyre-specific Inductor configuration: `SENCORES`, `LX_PLANNING`, `DXP_LX_FRAC_AVAIL`.
`ir.py`	`FixedTiledLayout`, `SpyreConstantFallback`, `SpyreEmptyFallback`, `SpyreReduction`. Core IR types used throughout the pre-scheduling pipeline.
`choices.py`	Spyre-specific `InductorChoices` overrides (e.g. reduction split heuristics).
`errors.py`	`Unsupported` exception class for ops or configurations not yet handled by the Spyre backend.
`indirect_access.py`	Indirect-access helpers for gather/scatter ops on `OpSpec`.
`span_overflow_hint_analysis.py`	`plan_span_overflow_tile`, `SpanOverflowTilePlan`. Generates coarse-tiling hints for ops whose per-core span exceeds the hardware limit.
`constants.py`	Shared constants: `DEVICE_NAME`, `BATCH_MATMUL_OP`, `TOPK_OPS`.
`logging_utils.py`	`get_inductor_logger` and Spyre Inductor logging setup.

torch.compile(..., dynamic=True) is supported through the static-binary path. Shapes are specialized at compile time and the resulting binary is reused across calls with the same input geometry.