Back-End Compiler (DeepTools)

The back-end compiler is a proprietary component called DeepTools, developed by IBM. It takes the SuperDSC JSON specifications produced by the Torch-Spyre front-end and generates optimized Spyre program binaries.

Responsibilities

The back-end compiler is responsible for:

  • Dataflow mapping — mapping SuperDSC operations to optimized Spyre dataflows and execution patterns

  • Core scheduling — determining the precise execution order and timing of operations across cores

  • Binary generation — producing the executable program binaries loaded onto the Spyre device at runtime

Front-End Artifacts

For every torch.compiled function, the front-end emits two kinds of artifacts:

Artifact

Consumer

Purpose

SuperDSC JSON

DeepTools

Per-kernel operation specification: tensor layouts, work division, OpFunc selection. DeepTools turns each SuperDSC into a device binary that includes both compute and the load/store sequence that stages tiles from LPDDR5 into the LX scratchpad.

DCI (Data Conversion Information)

Runtime (copyAsync)

The DataConversionInfo struct, built by generate_dci() in spyre_mem.cpp from a tensor’s SpyreTensorLayout. It carries loop ranges, host and device strides, and dtype info, and drives the host ↔ LPDDR5 DMA transfer for each graph input and output.

SuperDSC Format

SuperDSC (Super Design Space Config) is a JSON-based intermediate representation that describes the full tile-level compute graph for all 32 Spyre cores. Each artifact is self-contained: it carries everything the hardware needs to execute one scheduled operation deterministically across every core.

Top-level structure

Field

Purpose

coreFoldProp_

How the iteration space is divided across cores (for example {"factor_": 2} for a 2-core split).

numWkSlicesPerDim_

Number of work slices per iteration dimension. {"c0": 2, "c1": 1} says dim c0 is split two ways and dim c1 is not split.

coreIdToWkSlice_

Maps each core ID to the slice indices it owns.

dscs_

Array of DesignSpaceConfig entries, one per compute configuration.

Each dscs_ entry is a complete description of one compute configuration:

Field

Purpose

N_

Full iteration-space extents. {"c0_": 4, "c1_": 64} for a 4×64 op.

dataStageParam_

Per-core dimension sizes for the steady-state (ss_) and epilogue (el_) passes. Tells the runtime how to partition data for transfer into scratchpad.

primaryDsInfo_

Tiling information per logical role (INPUT, KERNEL, OUTPUT, KERNEL_IDX): layoutDimOrder_, stickDimOrder_, stickSize_.

labeledDs_

Tensor descriptors. Each entry pairs a tensor argument with its dsType_ (tiling layout class), dataFormat_ (for example SEN169_FP16), and memOrg_ (HBM or LX residency). The layoutDimOrder_ of each entry is independent: two arguments of the same op can pick different dim orders.

scheduleTree_

Allocate nodes, one per tensor, with memory placement (HBM or LX scratchpad), dimension ordering, per-core start addresses via fold mappings, and coordinate information.

computeOp_

One entry per operation, encoding the execution unit (PT or SFP), op name, data format, fidelity, and input/output tensor references.

Folding and affine transforms

SuperDSC stays compact through folding. A single parameterized artifact can describe behavior across cores, corelets, rows, and time steps without repeating itself. Fold properties use affine transforms of the form alpha * index + beta to compute per-core coordinates and addresses:

{"Affine": {"alpha_": 64, "beta_": 0}}

The result is that one JSON file describes the behavior of all 32 cores.

Note

The hbm field name appearing throughout the SuperDSC IR is a legacy label that refers to device memory in general. Spyre’s actual device memory is LPDDR5, not HBM.

Codegen pipeline (front-end to SuperDSC)

Three components in the front-end collaborate to produce a SuperDSC artifact for each scheduled node:

  1. SpyreKernel (spyre_kernel.py) collects the iteration space from the scheduler and builds an RValue AST that represents the computation. Node types include TensorAccess, PointwiseOp, ReductionOp, and Constant. Leaves are tensor reads or constants; internal nodes are operations.

  2. OpSpec (op_spec.py) wraps the kernel’s output in a structured descriptor: the operation name, the iteration space encoded as SymPy symbolic expressions, tensor arguments annotated with device coordinates (tile index and intra-stick offset), plus any auxiliary information.

  3. generate_sdsc() (codegen/compute_ops.py) takes the OpSpec and emits the final JSON IR. Symbolic expressions are resolved to concrete loop bounds, tiling parameters are expanded, and the scheduleTree_ is assembled. The output is written as JSON (for example sdsc_0.json), which DeepTools then consumes to produce the device binary.

Example: an add OpSpec

The device_coordinates on each TensorArg are SymPy expressions over the iteration variables, not plain integer offsets. Here is the artifact for an add between two tensors that share an iteration space with three loop variables: c0 of extent 10 with unit stride, z0 of extent 50 walking the iteration space at stride 25 (the second value in each iteration_space entry — for example (sympify('50'), 25)), and c1 of extent 200 with unit stride:

OpSpec(
    op='add',
    is_reduction=False,
    iteration_space={sympify('c0'): (sympify('10'), 1),
                     sympify('z0'): (sympify('50'), 25),
                     sympify('c1'): (sympify('200'), 1)},
    op_info={},
    args=[
        TensorArg(
            is_input=True, arg_index=0, device_dtype=DataFormats.SEN169_FP16,
            device_size=[10, 4, 50, 64],
            device_coordinates=[sympify('c0'), sympify('floor(c1/64)'),
                                sympify('z0'), sympify('Mod(c1, 64)')],
            allocation={},
        ),
        TensorArg(
            is_input=True, arg_index=1, device_dtype=DataFormats.SEN169_FP16,
            device_size=[4, 50, 10, 64],
            device_coordinates=[sympify('floor(c1/64)'), sympify('z0'),
                                sympify('c0'), sympify('Mod(c1, 64)')],
            allocation={},
        ),
        TensorArg(
            is_input=False, arg_index=2, device_dtype=DataFormats.SEN169_FP16,
            device_size=[4, 50, 10, 64],
            device_coordinates=[sympify('floor(c1/64)'), sympify('z0'),
                                sympify('c0'), sympify('Mod(c1, 64)')],
            allocation={},
        ),
    ],
)

A few things are worth pulling out from this. First, the same iteration variables (c0, z0, c1) thread through every argument, but each argument resolves them differently because each tensor sits in a different device shape. Second, the stick dimension shows up as a pair of expressions (floor(c1/64) and Mod(c1, 64)), one for the tile index and one for the intra-stick offset. Third, the two input tensors end up with different dim orders ([c0, floor(c1/64), z0, Mod(c1, 64)] vs [floor(c1/64), z0, c0, Mod(c1, 64)]), and that is fine: per-argument layoutDimOrder_ in the SuperDSC labeledDs_ is independent.

Why JSON

SuperDSC artifacts have to be diffable and inspectable during development, which is why JSON is the wire format. When an op gives wrong results on a particular core layout, opening the artifact in a text editor and reading that core’s address mapping is usually the fastest path to a diagnosis. JSON also slots cleanly into torch.compile’s artifact cache.

From SuperDSC to KTIR

SuperDSC was designed to get Torch-Spyre running quickly with an IR that closely matches the hardware model. The team is now transitioning to KernelTile IR (KTIR), an MLIR-based representation that generalizes the concepts SuperDSC introduced (compute tiles, scratchpad staging, compile-time core partitioning) into a community specification aimed at any dataflow accelerator. See RFC 0682 - KTIR Spec.

Invocation

The front-end compiler invokes DeepTools programmatically as part of the torch.compile pipeline. The binary artifacts are cached by Inductor’s standard compilation cache.

Further Reading