# Back-End Compiler (DeepTools) The back-end compiler is a proprietary component called **DeepTools**, developed by IBM. It takes the SuperDSC JSON specifications produced by the Torch-Spyre front-end and generates optimized Spyre program binaries. ## Responsibilities The back-end compiler is responsible for: - **Dataflow mapping** — mapping SuperDSC operations to optimized Spyre dataflows and execution patterns - **Core scheduling** — determining the precise execution order and timing of operations across cores - **Binary generation** — producing the executable program binaries loaded onto the Spyre device at runtime ## Front-End Artifacts For every `torch.compile`d function, the front-end emits two kinds of artifacts: | Artifact | Consumer | Purpose | |----------|----------|---------| | **SuperDSC JSON** | DeepTools | Per-kernel operation specification: tensor layouts, work division, OpFunc selection. DeepTools turns each SuperDSC into a device binary that includes both compute and the load/store sequence that stages tiles from LPDDR5 into the LX scratchpad. | | **DCI (Data Conversion Information)** | Runtime (`copyAsync`) | The `DataConversionInfo` struct, built by `generate_dci()` in `spyre_mem.cpp` from a tensor's `SpyreTensorLayout`. It carries loop ranges, host and device strides, and dtype info, and drives the host ↔ LPDDR5 DMA transfer for each graph input and output. | ## SuperDSC Format SuperDSC (Super Design Space Config) is a JSON-based intermediate representation that describes the full tile-level compute graph for all 32 Spyre cores. Each artifact is self-contained: it carries everything the hardware needs to execute one scheduled operation deterministically across every core. ### Top-level structure | Field | Purpose | |---|---| | `coreFoldProp_` | How the iteration space is divided across cores (for example `{"factor_": 2}` for a 2-core split). | | `numWkSlicesPerDim_` | Number of work slices per iteration dimension. `{"c0": 2, "c1": 1}` says dim `c0` is split two ways and dim `c1` is not split. | | `coreIdToWkSlice_` | Maps each core ID to the slice indices it owns. | | `dscs_` | Array of `DesignSpaceConfig` entries, one per compute configuration. | Each `dscs_` entry is a complete description of one compute configuration: | Field | Purpose | |---|---| | `N_` | Full iteration-space extents. `{"c0_": 4, "c1_": 64}` for a 4×64 op. | | `dataStageParam_` | Per-core dimension sizes for the steady-state (`ss_`) and epilogue (`el_`) passes. Tells the runtime how to partition data for transfer into scratchpad. | | `primaryDsInfo_` | Tiling information per logical role (`INPUT`, `KERNEL`, `OUTPUT`, `KERNEL_IDX`): `layoutDimOrder_`, `stickDimOrder_`, `stickSize_`. | | `labeledDs_` | Tensor descriptors. Each entry pairs a tensor argument with its `dsType_` (tiling layout class), `dataFormat_` (for example `SEN169_FP16`), and `memOrg_` (HBM or LX residency). The `layoutDimOrder_` of each entry is independent: two arguments of the same op can pick different dim orders. | | `scheduleTree_` | Allocate nodes, one per tensor, with memory placement (HBM or LX scratchpad), dimension ordering, per-core start addresses via fold mappings, and coordinate information. | | `computeOp_` | One entry per operation, encoding the execution unit (`PT` or `SFP`), op name, data format, fidelity, and input/output tensor references. | ### Folding and affine transforms SuperDSC stays compact through *folding*. A single parameterized artifact can describe behavior across cores, corelets, rows, and time steps without repeating itself. Fold properties use affine transforms of the form `alpha * index + beta` to compute per-core coordinates and addresses: ```json {"Affine": {"alpha_": 64, "beta_": 0}} ``` The result is that one JSON file describes the behavior of all 32 cores. :::{note} The `hbm` field name appearing throughout the SuperDSC IR is a legacy label that refers to device memory in general. Spyre's actual device memory is LPDDR5, not HBM. ::: ### Codegen pipeline (front-end to SuperDSC) Three components in the front-end collaborate to produce a SuperDSC artifact for each scheduled node: 1. **`SpyreKernel`** ([`spyre_kernel.py`](https://github.com/torch-spyre/torch-spyre/blob/main/torch_spyre/_inductor/spyre_kernel.py)) collects the iteration space from the scheduler and builds an RValue AST that represents the computation. Node types include `TensorAccess`, `PointwiseOp`, `ReductionOp`, and `Constant`. Leaves are tensor reads or constants; internal nodes are operations. 2. **`OpSpec`** ([`op_spec.py`](https://github.com/torch-spyre/torch-spyre/blob/main/torch_spyre/_inductor/op_spec.py)) wraps the kernel's output in a structured descriptor: the operation name, the iteration space encoded as [SymPy](https://www.sympy.org/) symbolic expressions, tensor arguments annotated with device coordinates (tile index and intra-stick offset), plus any auxiliary information. 3. **`generate_sdsc()`** ([`codegen/compute_ops.py`](https://github.com/torch-spyre/torch-spyre/blob/main/torch_spyre/_inductor/codegen/compute_ops.py)) takes the `OpSpec` and emits the final JSON IR. Symbolic expressions are resolved to concrete loop bounds, tiling parameters are expanded, and the `scheduleTree_` is assembled. The output is written as JSON (for example `sdsc_0.json`), which DeepTools then consumes to produce the device binary. ### Example: an `add` OpSpec The `device_coordinates` on each `TensorArg` are SymPy expressions over the iteration variables, not plain integer offsets. Here is the artifact for an `add` between two tensors that share an iteration space with three loop variables: `c0` of extent 10 with unit stride, `z0` of extent 50 walking the iteration space at stride 25 (the second value in each `iteration_space` entry — for example `(sympify('50'), 25)`), and `c1` of extent 200 with unit stride: ```python OpSpec( op='add', is_reduction=False, iteration_space={sympify('c0'): (sympify('10'), 1), sympify('z0'): (sympify('50'), 25), sympify('c1'): (sympify('200'), 1)}, op_info={}, args=[ TensorArg( is_input=True, arg_index=0, device_dtype=DataFormats.SEN169_FP16, device_size=[10, 4, 50, 64], device_coordinates=[sympify('c0'), sympify('floor(c1/64)'), sympify('z0'), sympify('Mod(c1, 64)')], allocation={}, ), TensorArg( is_input=True, arg_index=1, device_dtype=DataFormats.SEN169_FP16, device_size=[4, 50, 10, 64], device_coordinates=[sympify('floor(c1/64)'), sympify('z0'), sympify('c0'), sympify('Mod(c1, 64)')], allocation={}, ), TensorArg( is_input=False, arg_index=2, device_dtype=DataFormats.SEN169_FP16, device_size=[4, 50, 10, 64], device_coordinates=[sympify('floor(c1/64)'), sympify('z0'), sympify('c0'), sympify('Mod(c1, 64)')], allocation={}, ), ], ) ``` A few things are worth pulling out from this. First, the same iteration variables (`c0`, `z0`, `c1`) thread through every argument, but each argument resolves them differently because each tensor sits in a different device shape. Second, the stick dimension shows up as a pair of expressions (`floor(c1/64)` and `Mod(c1, 64)`), one for the tile index and one for the intra-stick offset. Third, the two input tensors end up with different dim orders (`[c0, floor(c1/64), z0, Mod(c1, 64)]` vs `[floor(c1/64), z0, c0, Mod(c1, 64)]`), and that is fine: per-argument `layoutDimOrder_` in the SuperDSC `labeledDs_` is independent. ### Why JSON SuperDSC artifacts have to be diffable and inspectable during development, which is why JSON is the wire format. When an op gives wrong results on a particular core layout, opening the artifact in a text editor and reading that core's address mapping is usually the fastest path to a diagnosis. JSON also slots cleanly into `torch.compile`'s artifact cache. ### From SuperDSC to KTIR SuperDSC was designed to get Torch-Spyre running quickly with an IR that closely matches the hardware model. The team is now transitioning to KernelTile IR (KTIR), an MLIR-based representation that generalizes the concepts SuperDSC introduced (compute tiles, scratchpad staging, compile-time core partitioning) into a community specification aimed at any dataflow accelerator. See [RFC 0682 - KTIR Spec](https://github.com/torch-spyre/rfcs/blob/main/0682-KtirSpec/0682-KtirSpecRFC.md). ## Invocation The front-end compiler invokes DeepTools programmatically as part of the `torch.compile` pipeline. The binary artifacts are cached by Inductor's standard compilation cache. ## Further Reading - [Inductor Front-End](inductor_frontend.md) — how the front-end generates SuperDSC - [Dataflow Architecture](../architecture/dataflow_architecture.md) — the hardware model that DeepTools targets