Key concepts

This page introduces the terms and ideas that the rest of the documentation assumes you have seen at least once: dataflow execution, sticks and tiled tensors, the LX scratchpad, the eager and compiled paths, graph breaks, and the four-layer op coverage strategy. Each section is short on purpose. Cross-references at the end of each section point to the deeper treatment of each topic.

For the full design narrative, see How Torch-Spyre works. For a one-line definition of a specific term, see the glossary.


1. Execution model

A GPU executes thousands of threads in lock-step on different data, the Single Instruction, Multiple Threads (SIMT) model. Spyre executes a dataflow graph instead: each operation fires as soon as its inputs are ready, and the schedule is fixed at compile time. There is no runtime thread dispatcher and no hardware cache. Execution latency is deterministic as a result.

GPU latency profile compared to Spyre's flat, deterministic latency

Illustrative comparison of per-step latency. GPU execution sees jitter from thread scheduling, cache evictions, and dynamic dispatch. The compiler-planned dataflow on Spyre produces a flat latency profile for the same model.

Two operations firing in parallel as soon as their inputs become available

Dataflow firing rule: an operation runs as soon as all of its inputs are available. Two independent branches execute in parallel; a join node waits for both before firing.

Every decision a GPU runtime makes (which core runs what, when data moves, where it resides) is made by the Torch-Spyre compiler. See Dataflow Accelerator Architecture for the full treatment.


2. Hardware

A Spyre card has 32 cores. Each core has 2 corelets that share a 2 MB LX scratchpad. Inside each corelet is an 8×8 PE array (systolic, used for matmul-style compute on the PT unit) and a 1D SFU/SFP vector unit (used for non-linear ops such as GELU and softmax). Cores connect via a bi-directional ring at 128 B per cycle per direction.

A single Spyre core with two corelets sharing the LX scratchpad

One Spyre core. The two corelets each have a PE array and an SFU, sharing the 2 MB LX scratchpad. The card has 32 cores connected by a ring.

The constant SENCORES controls how many cores the compiler targets (default 32; can be lowered for debugging via the SENCORES env var). Default dtype is torch.float16.


3. Memory hierarchy

Spyre has two memory tiers. LPDDR5 is 128 GB of off-chip device memory, equivalent in role to a GPU’s HBM. The LX scratchpad is 2 MB of on-core SRAM. There is no hardware cache. The compiler decides which tensors reside in LX at each point in the computation and emits explicit load/store instructions to move data.

Spyre memory hierarchy showing LPDDR5 device memory and per-core LX scratchpad

Data moves between 128 GB of LPDDR5 and the 2 MB per-core LX scratchpad under explicit compiler control. There is no hardware cache.

Two sizing constraints matter for users:

  • 2 MB LX scratchpad per core. Working sets that exceed this are staged in tiles. See Scratchpad Planning.

  • Per-core addressable device memory limit. This is a separate hardware address-space constraint, not the 2 MB LX size. Work division must keep each core’s footprint under this limit.

Note

The SuperDSC IR has a legacy field name hbm that refers to LPDDR5 device memory in general. Spyre’s device memory is LPDDR5, not HBM.


4. Sticks and tiled tensors

The unit of memory transfer on Spyre is a stick: 128 B aligned, 64 fp16 elements (BYTES_IN_STICK = 128). A stick matches the granularity of a load between LPDDR5 and LX, so each transfer moves a full stick of contiguous data.

Tensors on Spyre are therefore not stored the way PyTorch describes them. A (1024, 256) fp16 tensor is physically four tiles of 64-element sticks: (4, 1024, 64) on the device. The element at position [i, 63] and the element at [i, 64] are not one stride apart. They sit in different tiles.

A (1024, 256) host tensor reshaped into a (4, 1024, 64) tiled device layout

A (1024, 256) tensor on the host becomes a (4, 1024, 64) tiled structure on the device. The stride breaks at every tile boundary, so the layout cannot be expressed as a single integer stride per dimension.

PyTorch’s (size, stride) model cannot describe this layout, so Torch-Spyre introduces FixedTiledLayout, a subclass of Inductor’s FixedLayout that carries a SpyreTensorLayout descriptor with the device-side shape and a host-to-device stride map. Two compiler operations manage this layout:

  • Stickification is the transformation from a host-strided layout to a tiled device layout. It runs during layout propagation.

  • Restickification (spyre::restickify) is an explicit re-tile the compiler inserts when two adjacent ops disagree on tile structure.

For the full reference, see Tensor Layouts.


5. Eager vs compiled path

A PyTorch program reaches Spyre on one of two paths. Which path a given line of code takes determines its performance.

Eager path. When you write x.to("spyre") or torch.add(x, y) on Spyre tensors, PyTorch’s dispatcher routes each op to a Torch-Spyre C++ kernel registered against the PrivateUse1 device key. Each op runs immediately. The result is correct but slow: there is no fusion, no shared scratchpad reuse across ops, and many ops fall back to CPU.

PyTorch dispatcher routing a Spyre tensor op to the registered Spyre kernel

The eager path: the PyTorch dispatcher looks up the SPYRE entry in its dispatch table for each op and calls the registered Spyre kernel.

Compiled path. When you wrap a model with torch.compile(model, backend="spyre"), the FX graph passes through the Torch-Spyre Inductor backend, which runs layout propagation, work division, and scratchpad planning, then emits a SuperDSC artifact that the Deeptools backend turns into a device binary.

import torch
import torch_spyre  # registers the device

model = ...                     # any nn.Module
model = model.to("spyre")
compiled = torch.compile(model, backend="spyre")
out = compiled(x.to("spyre"))   # this is the fast path

If you only want to test that something runs, the eager path is fine. For performance you must reach the compiled path. All Spyre-specific optimizations (tiled layouts, multi-core work division, LX planning) are implemented there.


6. Graph breaks

Inside a torch.compile-d region, anything Inductor cannot lower forces a graph break: the compiled graph stops, the partial result round-trips to the CPU, the unsupported op runs there, and the data comes back. A single graph break in the hot path removes the performance gains from the surrounding compiled code.

The most common cause is a missing op. Torch-Spyre handles ops in four layers, in priority order:

Four-layer op coverage strategy on Spyre

Op coverage on Spyre. ATen ops are decomposed into native ops or custom ops; custom ops lower to SuperDSC; everything else falls back to the CPU.

  1. Native ops — ATen ops Deeptools supports directly (pointwise ops, mm, bmm).

  2. Custom ops — Spyre-specific ops registered via torch.library.custom_op (e.g. spyre::rms_norm, spyre::layer_norm, spyre::gelu).

  3. Decompositions — FX rewrites that turn an ATen op into a sequence of native or custom ops (e.g. aten.addmmmatmul + scale + add).

  4. CPU fallback — auto-transfer for the long tail (embedding, arange, sin, cos, tril, triu, …). Transparent, but off the hot path only.

When debugging slow models, the first question to ask is whether anything fell through to the CPU fallback. See Supported Operations for the current matrix and Adding Operations to enable a new one.


7. Compilation pipeline

The compiled path runs the standard PyTorch pipeline (FX capture, AOTAutograd, Inductor scheduler) and inserts three Spyre-specific passes: layout propagation, work division (span_reduction followed by work_distribution), and scratchpad / LX planning. The output is a SuperDSC JSON artifact that the Deeptools backend (a proprietary compiler) turns into a device binary.

The Torch-Spyre compilation pipeline from torch.compile to device binary

The compilation pipeline. Spyre-specific passes (orange) operate on two IR levels: the FX graph (before Inductor lowering) and the LoopLevel IR (before codegen). Gray boxes are PyTorch-standard.

Two terms appear throughout the compiler docs:

  • SuperDSC is the current Spyre IR. It is JSON. One artifact per scheduled kernel encodes the per-core schedule, tensor descriptors, and the compute op. Artifacts are cached through the standard torch.compile cache.

  • KTIR is the planned successor, an MLIR-based dialect designed as a community specification for dataflow accelerators. See the KTIR RFC.

For the full pipeline reference, see Compiler Architecture and Inductor Frontend.


8. Dtype defaults and casting

Spyre’s default compute dtype is torch.float16. fp32 inputs are down-cast to fp16 before reaching the device. This is correct for most inference workloads but can surprise users coming from CPU or GPU training paths. Down-cast warnings are emitted by default; set TORCH_SPYRE_DOWNCAST_WARN=0 to suppress them. bfloat16, fp8 variants, and integer dtypes are supported in the runtime but have narrower op coverage on the compiled path.

If your model has a numerically sensitive layer, check the supported operations matrix for that op’s dtype coverage.


9. Running models today: FMS vs stock HuggingFace

Today the production path for LLM inference on Spyre is through IBM’s Foundation Model Stack (FMS), which provides Spyre-aware model implementations. Granite 3.3 8B runs in production this way:

from fms.models import get_model
model = get_model("granite", "3.3-8b-instruct", device_type="spyre")
compiled = torch.compile(model, backend="spyre")

This describes the state of the stack today and will change as the stack matures. Once op coverage broadens, dynamic shapes land, and KV-cache handling stabilizes, the same workloads will run with stock AutoModelForCausalLM.from_pretrained(...).to("spyre") plus torch.compile and no model-side changes. The FMS path will remain supported and will become one of several entry points. If you are running a model that FMS already supports, use FMS today. If you are prototyping a new architecture, target the compiled path directly and expect to file ops as you go. Check this page or the supported operations matrix for the current state.


10. Hardware constraints checklist

Constraints that show up as compile-time errors or unexpected behavior:

Constraint

What it means in practice

128-byte alignment

Inner dimensions are padded up to a multiple of 64 fp16 elements (a stick).

No HW scalar immediates

Scalar constants in the FX graph are rewritten to size-1 tensors via spyre::constant.

Indivisible reduction dims

Some reduction dimensions cannot be split across cores; work distribution honors this.

Static shapes

Dynamic shapes are work-in-progress. Shape-polymorphic models may recompile per shape.

Per-core memory limit

Each core’s tensor footprint must fit its addressable device memory range (separate from the 2 MB LX).

fp16 default

fp32 is down-cast to fp16 with a warning; set TORCH_SPYRE_DOWNCAST_WARN=0 to suppress.

SENCORES=32

Default core count; lowering it for debugging changes work-division decisions.


Where to go next