Back-End Compiler (DeepTools)

The back-end compiler is a proprietary component called DeepTools, developed by IBM. It takes the SuperDSC JSON specifications produced by the Torch-Spyre front-end and generates optimized Spyre program binaries.

Responsibilities

The back-end compiler is responsible for:

Dataflow mapping — mapping SuperDSC operations to optimized Spyre dataflows and execution patterns
Core scheduling — determining the precise execution order and timing of operations across cores
Binary generation — producing the executable program binaries loaded onto the Spyre device at runtime

Front-End Artifacts

For every torch.compiled function, the front-end emits two kinds of artifacts:

Artifact	Consumer	Purpose
SuperDSC JSON	DeepTools	Per-kernel operation specification: tensor layouts, work division, OpFunc selection. DeepTools turns each SuperDSC into a device binary that includes both compute and the load/store sequence that stages tiles from LPDDR5 into the LX scratchpad.
DCI (Data Conversion Information)	Runtime (`copyAsync`)	The `DataConversionInfo` struct, built by `generate_dci()` in `spyre_mem.cpp` from a tensor’s `SpyreTensorLayout`. It carries loop ranges, host and device strides, and dtype info, and drives the host ↔ LPDDR5 DMA transfer for each graph input and output.

SuperDSC Format

SuperDSC (Super Design Space Config) is a JSON-based intermediate representation that describes the full tile-level compute graph for all 32 Spyre cores. Each artifact is self-contained: it carries everything the hardware needs to execute one scheduled operation deterministically across every core.

Top-level structure

Field	Purpose
`coreFoldProp_`	How the iteration space is divided across cores (for example `{"factor_": 2}` for a 2-core split).
`numWkSlicesPerDim_`	Number of work slices per iteration dimension. `{"c0": 2, "c1": 1}` says dim `c0` is split two ways and dim `c1` is not split.
`coreIdToWkSlice_`	Maps each core ID to the slice indices it owns.
`dscs_`	Array of `DesignSpaceConfig` entries, one per compute configuration.

Each dscs_ entry is a complete description of one compute configuration:

Field	Purpose
`N_`	Full iteration-space extents. `{"c0_": 4, "c1_": 64}` for a 4×64 op.
`dataStageParam_`	Per-core dimension sizes for the steady-state (`ss_`) and epilogue (`el_`) passes. Tells the runtime how to partition data for transfer into scratchpad.
`primaryDsInfo_`	Tiling information per logical role (`INPUT`, `KERNEL`, `OUTPUT`, `KERNEL_IDX`): `layoutDimOrder_`, `stickDimOrder_`, `stickSize_`.
`labeledDs_`	Tensor descriptors. Each entry pairs a tensor argument with its `dsType_` (tiling layout class), `dataFormat_` (for example `SEN169_FP16`), and `memOrg_` (HBM or LX residency). The `layoutDimOrder_` of each entry is independent: two arguments of the same op can pick different dim orders.
`scheduleTree_`	Allocate nodes, one per tensor, with memory placement (HBM or LX scratchpad), dimension ordering, per-core start addresses via fold mappings, and coordinate information.
`computeOp_`	One entry per operation, encoding the execution unit (`PT` or `SFP`), op name, data format, fidelity, and input/output tensor references.

Folding and affine transforms

SuperDSC stays compact through folding. A single parameterized artifact can describe behavior across cores, corelets, rows, and time steps without repeating itself. Fold properties use affine transforms of the form alpha * index + beta to compute per-core coordinates and addresses:

{"Affine": {"alpha_": 64, "beta_": 0}}

The result is that one JSON file describes the behavior of all 32 cores.

Note

The hbm field name appearing throughout the SuperDSC IR is a legacy label that refers to device memory in general. Spyre’s actual device memory is LPDDR5, not HBM.

Codegen pipeline (front-end to SuperDSC)

Three components in the front-end collaborate to produce a SuperDSC artifact for each scheduled node:

SpyreKernel (spyre_kernel.py) collects the iteration space from the scheduler and builds an RValue AST that represents the computation. Node types include TensorAccess, PointwiseOp, ReductionOp, and Constant. Leaves are tensor reads or constants; internal nodes are operations.
OpSpec (op_spec.py) wraps the kernel’s output in a structured descriptor: the operation name, the iteration space encoded as SymPy symbolic expressions, tensor arguments annotated with device coordinates (tile index and intra-stick offset), plus any auxiliary information.
generate_sdsc() (codegen/compute_ops.py) takes the OpSpec and emits the final JSON IR. Symbolic expressions are resolved to concrete loop bounds, tiling parameters are expanded, and the scheduleTree_ is assembled. The output is written as JSON (for example sdsc_0.json), which DeepTools then consumes to produce the device binary.

Example: an `add` OpSpec

The device_coordinates on each TensorArg are SymPy expressions over the iteration variables, not plain integer offsets. Here is the artifact for an add between two tensors that share an iteration space with three loop variables: c0 of extent 10 with unit stride, z0 of extent 50 walking the iteration space at stride 25 (the second value in each iteration_space entry — for example (sympify('50'), 25)), and c1 of extent 200 with unit stride:

OpSpec(
    op='add',
    is_reduction=False,
    iteration_space={sympify('c0'): (sympify('10'), 1),
                     sympify('z0'): (sympify('50'), 25),
                     sympify('c1'): (sympify('200'), 1)},
    op_info={},
    args=[
        TensorArg(
            is_input=True, arg_index=0, device_dtype=DataFormats.SEN169_FP16,
            device_size=[10, 4, 50, 64],
            device_coordinates=[sympify('c0'), sympify('floor(c1/64)'),
                                sympify('z0'), sympify('Mod(c1, 64)')],
            allocation={},
        ),
        TensorArg(
            is_input=True, arg_index=1, device_dtype=DataFormats.SEN169_FP16,
            device_size=[4, 50, 10, 64],
            device_coordinates=[sympify('floor(c1/64)'), sympify('z0'),
                                sympify('c0'), sympify('Mod(c1, 64)')],
            allocation={},
        ),
        TensorArg(
            is_input=False, arg_index=2, device_dtype=DataFormats.SEN169_FP16,
            device_size=[4, 50, 10, 64],
            device_coordinates=[sympify('floor(c1/64)'), sympify('z0'),
                                sympify('c0'), sympify('Mod(c1, 64)')],
            allocation={},
        ),
    ],
)

A few things are worth pulling out from this. First, the same iteration variables (c0, z0, c1) thread through every argument, but each argument resolves them differently because each tensor sits in a different device shape. Second, the stick dimension shows up as a pair of expressions (floor(c1/64) and Mod(c1, 64)), one for the tile index and one for the intra-stick offset. Third, the two input tensors end up with different dim orders ([c0, floor(c1/64), z0, Mod(c1, 64)] vs [floor(c1/64), z0, c0, Mod(c1, 64)]), and that is fine: per-argument layoutDimOrder_ in the SuperDSC labeledDs_ is independent.

Why JSON

SuperDSC artifacts have to be diffable and inspectable during development, which is why JSON is the wire format. When an op gives wrong results on a particular core layout, opening the artifact in a text editor and reading that core’s address mapping is usually the fastest path to a diagnosis. JSON also slots cleanly into torch.compile’s artifact cache.

From SuperDSC to KTIR

SuperDSC was designed to get Torch-Spyre running quickly with an IR that closely matches the hardware model. The team is now transitioning to KernelTile IR (KTIR), an MLIR-based representation that generalizes the concepts SuperDSC introduced (compute tiles, scratchpad staging, compile-time core partitioning) into a community specification aimed at any dataflow accelerator. See RFC 0682 - KTIR Spec.

Invocation

DeepTools runs as an out-of-process subprocess. During scheduling, the generated host code calls async_compile.sdsc(...) (execution/async_compile.py), which runs dxp_standalone -d <output_dir> to turn the SuperDSC JSON into a device binary. (The bundle itself is produced earlier, in Python, by generate_bundle(...).) Each kernel gets its own output directory created with tempfile.mkdtemp under <cache_dir>/inductor-spyre, so the bundles are stored separately from Inductor’s content-addressed Python/Triton cache.