# KTIR (Kernel Tile IR) KTIR is an MLIR dialect for tiled, multi-core accelerator kernels. It extends torch-spyre's existing SuperDSC IR into a community specification for any dataflow accelerator with scratchpad memory and compile-time core partitioning. The dialect is the planned successor to SuperDSC in the torch-spyre compilation pipeline. :::{admonition} Status :class: note The specification is published as [RFC 0682](https://github.com/torch-spyre/rfcs/blob/main/0682-KtirSpec/0682-KtirSpecRFC.md) (merged March 2026). Two open-source companion projects implement the dialect today, both Apache-2.0: - [torch-spyre/ktir-cpu](https://github.com/torch-spyre/ktir-cpu) — CPU interpreter and validator. The README describes it as an experimental research prototype that implements a subset of RFC 0682. - [torch-spyre/ktir-mlir-frontend](https://github.com/torch-spyre/ktir-mlir-frontend) — MLIR parser and Python bindings (`mlir_ktdp`). The torch-spyre production path still goes through SuperDSC. KTIR adoption is incremental: the spec is stable, the reference interpreter is up, and the backend lowering path is in development. ::: ## Role in the compilation pipeline KTIR sits between the torch-spyre Inductor front-end and the backend compiler: ```text PyTorch model │ ▼ torch.compile, Spyre Inductor backend │ FX graph → ATen IR → LoopLevel IR │ ▼ emit KTIR-shaped kernels │ KTIR (this dialect) │ ▼ backend lowering │ hardware binaries ``` For the current production path see [Inductor Frontend](inductor_frontend.md) (emits SuperDSC) and [Backend Compiler](backend.md) (consumes SuperDSC). ## Three-step memory access pattern The defining design choice in KTIR is decoupling memory access into three explicit steps. Each step is a separate op in the KTDP dialect: | Step | Op | What it does | |---|---|---| | 1. Describe layout | `ktdp.construct_memory_view` | Names a memory region with sizes, strides, coordinate set, memory space | | 2. Address a tile | `ktdp.construct_access_tile` | Selects which slice of the view this core touches | | 3. Move data | `ktdp.load`, `ktdp.store` | Transfers between the tile and a tensor SSA value | The separation lets the compiler reason about memory layout, work division, and data movement independently. Spyre's hardware exposes HBM and per-core LX scratchpad as distinct memory spaces, so each `construct_memory_view` carries an explicit `#ktdp.spyre_memory_space` or `` attribute. A `construct_distributed_memory_view` variant covers the case where a tensor is split across many per-core scratchpad slices instead of sitting in a single HBM region. ## Worked example: 1D element-wise add A 1024-element vector add over 32 cores looks like this in KTDP. Each core picks up a 32-element slice based on its grid coordinate: ``` func.func @add(%A: index, %B: index, %Out: index) attributes {grid = [32]} { %c32 = arith.constant 32 : index %id = ktdp.get_compute_tile_id : index %off = arith.muli %id, %c32 : index %A_view = ktdp.construct_memory_view %A, sizes:[1024], strides:[1] { coordinate_set = affine_set<(d0): (0 <= d0, d0 <= 1023)>, memory_space = #ktdp.spyre_memory_space } : memref<1024xf16> %A_tile = ktdp.construct_access_tile %A_view[%off] { access_tile_set = affine_set<(d0): (0 <= d0, d0 <= 31)>, access_tile_order = affine_map<(d0) -> (d0)> } -> !ktdp.access_tile<32xindex> %a = ktdp.load %A_tile : !ktdp.access_tile<32xindex> -> tensor<32xf16> // ... construct B_view and B_tile, then: // %s = arith.addf %a, %b : tensor<32xf16> // ktdp.store %s, %Out_tile : tensor<32xf16>, !ktdp.access_tile<32xindex> return } ``` The 32 cores execute the same function body in parallel. `get_compute_tile_id` returns each core's grid coordinate, and `construct_access_tile` uses that coordinate to select the per-core slice of the view. Partitioning is fixed at compile time. There is no runtime block dispatcher. ## ktir-cpu reference interpreter [ktir-cpu](https://github.com/torch-spyre/ktir-cpu) parses KTDP MLIR, executes it with NumPy on a simulated multi-core grid, and produces correctness output plus optional roofline latency estimates. Two parser frontends are available: - **Regex parser** for rapid iteration without LLVM dependencies. - **MLIR frontend** through `mlir_ktdp` (from [ktir-mlir-frontend](https://github.com/torch-spyre/ktir-mlir-frontend)) for strict LLVM 22 conformance. Both feed the same interpreter, so a kernel that runs through one runs through the other. The interpreter targets RFC 0682 but does not yet implement every KTDP op. Conformance gaps are tracked as `xfail(strict=True)` tests under `tests/test_spec_gaps.py`. An unexpected pass on one of those tests signals that a gap has been closed and the marker should be promoted to a regular test. The full gap analysis is at `docs/gap_analysis.md` in the ktir-cpu repository. ktir-cpu also supports AI-driven compiler development: a frontend pass can emit candidate kernels, run them through the interpreter, and use correctness output and the latency report to score them. Determinism and CPU-only execution make this feedback loop practical. ## Why an MLIR dialect The constraints that shape KTIR's design, drawn from RFC 0682: - **Tiled, persistent cores.** Spyre cores are persistent and partitioned at compile time. The dialect models this with a fixed `grid` attribute and a per-core access tile. There is no GPU-style thread-block dispatch. - **Explicit scratchpad.** Per-core LX is small, and the compiler manages its allocation (there is no hardware cache). KTIR describes staged transfers explicitly through the three-step access pattern instead of relying on an implicit cache hierarchy. - **Cross-stack reuse.** MLIR provides existing dialects (`arith`, `math`, `linalg`, `scf`) for the inner kernel body. KTDP only adds the Spyre-specific access primitives. - **Multiple frontends.** A community spec lets multiple compilers target the same dialect. The torch-spyre Inductor backend is the primary consumer today. ## See also - [KTIR Specification (RFC 0682)](https://github.com/torch-spyre/rfcs/blob/main/0682-KtirSpec/0682-KtirSpecRFC.md) - [torch-spyre/ktir-cpu](https://github.com/torch-spyre/ktir-cpu) — CPU interpreter and validator - [torch-spyre/ktir-mlir-frontend](https://github.com/torch-spyre/ktir-mlir-frontend) — MLIR parser and Python bindings - [Inductor Frontend](inductor_frontend.md) — current source of compiled kernels - [Backend Compiler](backend.md) — current target of compiled kernels