Back-End Compiler (DeepTools)
The back-end compiler is a proprietary component called DeepTools, developed by IBM. It takes the SuperDSC JSON specifications produced by the Torch-Spyre front-end and generates optimized Spyre program binaries.
Responsibilities
The back-end compiler is responsible for:
Dataflow mapping — mapping SuperDSC operations to optimized Spyre dataflows and execution patterns
Core scheduling — determining the precise execution order and timing of operations across cores
Binary generation — producing the executable program binaries loaded onto the Spyre device at runtime
Front-End Artifacts
For every torch.compiled function, the front-end emits two kinds of
artifacts:
Artifact |
Consumer |
Purpose |
|---|---|---|
SuperDSC JSON |
DeepTools |
Per-kernel operation specification: tensor layouts, work division, OpFunc selection. DeepTools turns each SuperDSC into a device binary that includes both compute and the load/store sequence that stages tiles from LPDDR5 into the LX scratchpad. |
DCI (Data Conversion Information) |
Runtime ( |
The |
SuperDSC Format
SuperDSC (Super Design Space Config) is a JSON-based intermediate representation that describes the full tile-level compute graph for all 32 Spyre cores. Each artifact is self-contained: it carries everything the hardware needs to execute one scheduled operation deterministically across every core.
Top-level structure
Field |
Purpose |
|---|---|
|
How the iteration space is divided across cores (for example |
|
Number of work slices per iteration dimension. |
|
Maps each core ID to the slice indices it owns. |
|
Array of |
Each dscs_ entry is a complete description of one compute configuration:
Field |
Purpose |
|---|---|
|
Full iteration-space extents. |
|
Per-core dimension sizes for the steady-state ( |
|
Tiling information per logical role ( |
|
Tensor descriptors. Each entry pairs a tensor argument with its |
|
Allocate nodes, one per tensor, with memory placement (HBM or LX scratchpad), dimension ordering, per-core start addresses via fold mappings, and coordinate information. |
|
One entry per operation, encoding the execution unit ( |
Folding and affine transforms
SuperDSC stays compact through folding. A single parameterized artifact can describe behavior across cores, corelets, rows, and time steps without repeating itself. Fold properties use affine transforms of the form alpha * index + beta to compute per-core coordinates and addresses:
{"Affine": {"alpha_": 64, "beta_": 0}}
The result is that one JSON file describes the behavior of all 32 cores.
Note
The hbm field name appearing throughout the SuperDSC IR is a legacy label that refers to device memory in general. Spyre’s actual device memory is LPDDR5, not HBM.
Codegen pipeline (front-end to SuperDSC)
Three components in the front-end collaborate to produce a SuperDSC artifact for each scheduled node:
SpyreKernel(spyre_kernel.py) collects the iteration space from the scheduler and builds an RValue AST that represents the computation. Node types includeTensorAccess,PointwiseOp,ReductionOp, andConstant. Leaves are tensor reads or constants; internal nodes are operations.OpSpec(op_spec.py) wraps the kernel’s output in a structured descriptor: the operation name, the iteration space encoded as SymPy symbolic expressions, tensor arguments annotated with device coordinates (tile index and intra-stick offset), plus any auxiliary information.generate_sdsc()(codegen/compute_ops.py) takes theOpSpecand emits the final JSON IR. Symbolic expressions are resolved to concrete loop bounds, tiling parameters are expanded, and thescheduleTree_is assembled. The output is written as JSON (for examplesdsc_0.json), which DeepTools then consumes to produce the device binary.
Example: an add OpSpec
The device_coordinates on each TensorArg are SymPy expressions over
the iteration variables, not plain integer offsets. Here is the
artifact for an add between two tensors that share an iteration space
with three loop variables: c0 of extent 10 with unit stride, z0 of
extent 50 walking the iteration space at stride 25 (the second value in
each iteration_space entry — for example (sympify('50'), 25)), and
c1 of extent 200 with unit stride:
OpSpec(
op='add',
is_reduction=False,
iteration_space={sympify('c0'): (sympify('10'), 1),
sympify('z0'): (sympify('50'), 25),
sympify('c1'): (sympify('200'), 1)},
op_info={},
args=[
TensorArg(
is_input=True, arg_index=0, device_dtype=DataFormats.SEN169_FP16,
device_size=[10, 4, 50, 64],
device_coordinates=[sympify('c0'), sympify('floor(c1/64)'),
sympify('z0'), sympify('Mod(c1, 64)')],
allocation={},
),
TensorArg(
is_input=True, arg_index=1, device_dtype=DataFormats.SEN169_FP16,
device_size=[4, 50, 10, 64],
device_coordinates=[sympify('floor(c1/64)'), sympify('z0'),
sympify('c0'), sympify('Mod(c1, 64)')],
allocation={},
),
TensorArg(
is_input=False, arg_index=2, device_dtype=DataFormats.SEN169_FP16,
device_size=[4, 50, 10, 64],
device_coordinates=[sympify('floor(c1/64)'), sympify('z0'),
sympify('c0'), sympify('Mod(c1, 64)')],
allocation={},
),
],
)
A few things are worth pulling out from this. First, the same iteration
variables (c0, z0, c1) thread through every argument, but each
argument resolves them differently because each tensor sits in a
different device shape. Second, the stick dimension shows up as a pair
of expressions (floor(c1/64) and Mod(c1, 64)), one for the tile
index and one for the intra-stick offset. Third, the two input tensors
end up with different dim orders ([c0, floor(c1/64), z0, Mod(c1, 64)]
vs [floor(c1/64), z0, c0, Mod(c1, 64)]), and that is fine: per-argument
layoutDimOrder_ in the SuperDSC labeledDs_ is independent.
Why JSON
SuperDSC artifacts have to be diffable and inspectable during development, which is why JSON is the wire format. When an op gives wrong results on a particular core layout, opening the artifact in a text editor and reading that core’s address mapping is usually the fastest path to a diagnosis. JSON also slots cleanly into torch.compile’s artifact cache.
From SuperDSC to KTIR
SuperDSC was designed to get Torch-Spyre running quickly with an IR that closely matches the hardware model. The team is now transitioning to KernelTile IR (KTIR), an MLIR-based representation that generalizes the concepts SuperDSC introduced (compute tiles, scratchpad staging, compile-time core partitioning) into a community specification aimed at any dataflow accelerator. See RFC 0682 - KTIR Spec.
Invocation
The front-end compiler invokes DeepTools programmatically as part of
the torch.compile pipeline. The binary artifacts are cached by
Inductor’s standard compilation cache.
Further Reading
Inductor Front-End — how the front-end generates SuperDSC
Dataflow Architecture — the hardware model that DeepTools targets