KTIR (Kernel Tile IR)

KTIR is an MLIR dialect for tiled, multi-core accelerator kernels. It extends torch-spyre’s existing SuperDSC IR into a community specification for any dataflow accelerator with scratchpad memory and compile-time core partitioning. The dialect is the planned successor to SuperDSC in the torch-spyre compilation pipeline.

Status

The specification is published as RFC 0682 (merged March 2026). Two open-source companion projects implement the dialect today, both Apache-2.0:

torch-spyre/ktir-cpu — CPU interpreter and validator. The README describes it as an experimental research prototype that implements a subset of RFC 0682.
torch-spyre/ktir-mlir-frontend — MLIR parser and Python bindings (mlir_ktdp).

The torch-spyre production path still goes through SuperDSC. KTIR adoption is incremental: the spec is stable, the reference interpreter is up, and the backend lowering path is in development.

Role in the compilation pipeline

KTIR sits between the torch-spyre Inductor front-end and the backend compiler:

PyTorch model
    │
    ▼  torch.compile, Spyre Inductor backend
    │
FX graph → ATen IR → LoopLevel IR
    │
    ▼  emit KTIR-shaped kernels
    │
KTIR  (this dialect)
    │
    ▼  backend lowering
    │
hardware binaries

For the current production path see Inductor Frontend (emits SuperDSC) and Backend Compiler (consumes SuperDSC).

Three-step memory access pattern

The defining design choice in KTIR is decoupling memory access into three explicit steps. Each step is a separate op in the KTDP dialect:

Step	Op	What it does
1. Describe layout	`ktdp.construct_memory_view`	Names a memory region with sizes, strides, coordinate set, memory space
2. Address a tile	`ktdp.construct_access_tile`	Selects which slice of the view this core touches
3. Move data	`ktdp.load`, `ktdp.store`	Transfers between the tile and a tensor SSA value

The separation lets the compiler reason about memory layout, work division, and data movement independently. Spyre’s hardware exposes HBM and per-core LX scratchpad as distinct memory spaces, so each construct_memory_view carries an explicit #ktdp.spyre_memory_space<HBM> or <LX> attribute. A construct_distributed_memory_view variant covers the case where a tensor is split across many per-core scratchpad slices instead of sitting in a single HBM region.

Worked example: 1D element-wise add

A 1024-element vector add over 32 cores looks like this in KTDP. Each core picks up a 32-element slice based on its grid coordinate:

func.func @add(%A: index, %B: index, %Out: index)
    attributes {grid = [32]} {
  %c32 = arith.constant 32 : index
  %id  = ktdp.get_compute_tile_id : index
  %off = arith.muli %id, %c32 : index

  %A_view = ktdp.construct_memory_view %A, sizes:[1024], strides:[1] {
    coordinate_set = affine_set<(d0): (0 <= d0, d0 <= 1023)>,
    memory_space   = #ktdp.spyre_memory_space<HBM>
  } : memref<1024xf16>

  %A_tile = ktdp.construct_access_tile %A_view[%off] {
    access_tile_set   = affine_set<(d0): (0 <= d0, d0 <= 31)>,
    access_tile_order = affine_map<(d0) -> (d0)>
  } -> !ktdp.access_tile<32xindex>

  %a = ktdp.load %A_tile : !ktdp.access_tile<32xindex> -> tensor<32xf16>
  // ... construct B_view and B_tile, then:
  // %s = arith.addf %a, %b : tensor<32xf16>
  // ktdp.store %s, %Out_tile : tensor<32xf16>, !ktdp.access_tile<32xindex>
  return
}

The 32 cores execute the same function body in parallel. get_compute_tile_id returns each core’s grid coordinate, and construct_access_tile uses that coordinate to select the per-core slice of the view. Partitioning is fixed at compile time. There is no runtime block dispatcher.

ktir-cpu reference interpreter

ktir-cpu parses KTDP MLIR, executes it with NumPy on a simulated multi-core grid, and produces correctness output plus optional roofline latency estimates. Two parser frontends are available:

Regex parser for rapid iteration without LLVM dependencies.
MLIR frontend through mlir_ktdp (from ktir-mlir-frontend) for strict LLVM 22 conformance.

Both feed the same interpreter, so a kernel that runs through one runs through the other.

The interpreter targets RFC 0682 but does not yet implement every KTDP op. Conformance gaps are tracked as xfail(strict=True) tests under tests/test_spec_gaps.py. An unexpected pass on one of those tests signals that a gap has been closed and the marker should be promoted to a regular test. The full gap analysis is at docs/gap_analysis.md in the ktir-cpu repository.

ktir-cpu also supports AI-driven compiler development: a frontend pass can emit candidate kernels, run them through the interpreter, and use correctness output and the latency report to score them. Determinism and CPU-only execution make this feedback loop practical.

Why an MLIR dialect

The constraints that shape KTIR’s design, drawn from RFC 0682:

Tiled, persistent cores. Spyre cores are persistent and partitioned at compile time. The dialect models this with a fixed grid attribute and a per-core access tile. There is no GPU-style thread-block dispatch.
Explicit scratchpad. Per-core LX is small, and the compiler manages its allocation (there is no hardware cache). KTIR describes staged transfers explicitly through the three-step access pattern instead of relying on an implicit cache hierarchy.
Cross-stack reuse. MLIR provides existing dialects (arith, math, linalg, scf) for the inner kernel body. KTDP only adds the Spyre-specific access primitives.
Multiple frontends. A community spec lets multiple compilers target the same dialect. The torch-spyre Inductor backend is the primary consumer today.