Glossary
This page is the lookup reference for terms used across Torch-Spyre documentation. For the conceptual primer that introduces these terms in context, see Key concepts.
Other pages can reference any entry here with the MyST term role:
{term}`stickification` renders as a hyperlink to the definition
below.
- BYTES_IN_STICK
The 128-byte alignment constant used by the runtime, compiler, and tensor-layout code. One stick at fp16 holds 64 elements. The size matches the natural granularity of LPDDR5 ↔ LX scratchpad transfers on Spyre. See Key concepts §4.
- corelet
One of two execution units inside a Spyre core. Each corelet contains an 8×8 systolic PE array (the PT execution unit) and a 1D SFU vector unit. Both corelets in a core share the same 2 MB LX scratchpad.
- dataflow
An execution model in which operations fire as soon as their inputs are ready, rather than being driven by a program counter. Spyre executes a compile-time-scheduled dataflow graph, which is what gives it deterministic latency. See Dataflow Accelerator Architecture.
- DCI
Data Conversion Information. The
DataConversionInfostruct (built bygenerate_dci()inspyre_mem.cpp) that bundles loop ranges, host and device strides, and dtype info. The runtime feeds it tocopyAsyncto drive a host ↔ LPDDR5 DMA transfer.- decomposition
An FX graph rewrite that turns one ATen op into a sequence of Spyre-native or custom ops. Example:
aten.addmmdecomposes intomatmul + scale + add. Decompositions are how Torch-Spyre covers ATen ops that have no single hardware-level equivalent. See Key concepts §6.- Deeptools
IBM’s proprietary backend compiler that consumes the SuperDSC JSON IR and emits a Spyre device binary. Torch-Spyre is the open-source frontend; Deeptools is the closed backend. See Compiler architecture.
- DMA
Direct Memory Access. On Spyre, the PCIe path that carries tensor data between host memory and the device’s LPDDR5.
- FixedTiledLayout
A Torch-Spyre subclass of Inductor’s
FixedLayoutthat augments the PyTorch(size, stride)description with a SpyreTensorLayout carrying tiled device-side shape, a host-to-device stride map, and the device dtype. This is the layout abstraction that makes tiled tensors representable inside Inductor. See Tensor Layouts.- flex runtime
The Spyre device runtime that the C++
SpyreAllocatorwraps. It owns the underlying device memory and issues kernel launches without exposing raw pointers (an IBM Z security requirement).- fold
An affine-transform parameterization in SuperDSC (
alpha * index + beta) that lets one JSON artifact describe the per-core behavior of all 32 cores compactly. Fold properties cover core, corelet, row, and time dimensions.- graph break
An interruption inside a
torch.compile-d region where Inductor cannot lower an op, so the partial result round-trips to the CPU, the unsupported op runs there, and the data comes back. A single graph break in the hot path can wipe out the performance gains from surrounding compiled code. See Key concepts §6.- HBM
In SuperDSC field names (e.g.
memOrg_.hbm),hbmis a legacy label for device memory in general. Spyre’s actual device memory is LPDDR5, not HBM.- KTIR
KernelTile IR. The MLIR-based dialect designed as the successor to SuperDSC. KTIR generalizes the SuperDSC concepts (compute tiles, scratchpad staging, compile-time core partitioning) into a community specification for any dataflow accelerator. See the KTIR RFC.
- LPDDR5
Spyre’s off-chip device memory. 128 GB on the PCIe card. Equivalent in role to a GPU’s HBM, but with a different memory technology and a different cost/bandwidth profile. The legacy HBM field name in SuperDSC refers to LPDDR5.
- LX planning
The compiler pass that decides which tensors live in the LX scratchpad versus LPDDR5 at each point in the computation. Gated by
config.lx_planning(env varLX_PLANNING=1). See Scratchpad Planning.- LX scratchpad
The 2 MB SRAM scratchpad on each Spyre core. Compiler-managed — there is no hardware cache. Both corelets in a core share the same scratchpad. See Key concepts §3.
- OpFunc
A Deeptools primitive that implements one hardware operation on Spyre. Native ATen ops map to single OpFuncs; custom ops lower to one or more OpFuncs. SuperDSC
computeOp_entries reference OpFuncs by name.- PE array
Processing Element array. An 8×8 systolic array of multiply-accumulate units inside each corelet, used for matrix-style compute through the PT execution unit.
- PrivateUse1
PyTorch’s official extension mechanism for out-of-tree backends. Torch-Spyre uses it to register
"spyre"as a first-class device name without forking PyTorch. See Runtime.- PT
The Processing Tensor execution unit on each corelet. Backed by the PE array, it runs matrix-style compute (matmul, fused pointwise epilogues). The other unit on a corelet is the SFU.
- restickify
An explicit re-tile op (
spyre::restickify) inserted by theinsert_restickifycompiler pass when two adjacent operations disagree on tile structure. Preserves correctness when layout propagation cannot pick one consistent tiling for the producer and consumer. See Inductor frontend.- SDSC
See SuperDSC. The two terms are interchangeable in code and filenames (e.g.
sdsc_0.json,generate_sdsc()).- SENCORES
The number of Spyre cores the compiler targets. Default 32 (one full card). Lowering it via the
SENCORESenvironment variable is primarily a debugging tool; it changes work-division decisions and can be useful for isolating per-core behavior.- SFP
See SFU. Used interchangeably in some code paths.
- SFU
Special Function Unit (sometimes Special Function Processor, SFP). The 1D vector unit on each corelet that handles non-linear activations such as GELU, softmax, and other element-wise functions the PT unit does not implement.
- span reduction
The first of two work-division passes (
span_reduction()). Analyzes the iteration space of each reduction and determines how its span can be split across cores. Followed by work distribution. See Work Division Planning.- SPMD
Single Program, Multiple Data. Every core runs the same program on its own slice of the data, picked by core ID. Spyre’s execution model is SPMD across the 32 cores.
- SpyreTensorImpl
The C++ subclass of
at::TensorImplthat carries Spyre-specific layout metadata (a SpyreTensorLayout) alongside the standard PyTorch tensor fields. Registered through the PrivateUse1 hook system.- SpyreTensorLayout
The descriptor inside a SpyreTensorImpl (and a FixedTiledLayout) that carries the tiled device-side size, a stride map from host dimensions to device dimensions, and the device dtype.
- stick
A 128-byte aligned memory chunk on Spyre, equal to 64 fp16 elements. The unit of LPDDR5 ↔ LX transfer and the basic granularity of the tiled tensor layout. Defined by the BYTES_IN_STICK constant.
- stickification
The transformation from a host-strided PyTorch layout to a tiled Spyre device layout. Run during the
propagate_spyre_tensor_layoutspass on the LoopLevel IR. After this pass everyComputedBuffercarries a FixedTiledLayout. See Inductor frontend.- SuperDSC
Super Design Space Config. Torch-Spyre’s current JSON-based IR. One artifact per scheduled kernel; encodes the per-core schedule, tensor descriptors, memory addresses, and the compute op. Cached through the standard
torch.compileartifact system. The successor is KTIR. See Key concepts §7.- tile
A contiguous sub-tensor assigned to a single core. On Spyre, a tile is built from one or more sticks.
- work distribution
The second of two work-division passes (
work_distribution()). Assigns the spans identified by span reduction to the 32 cores. Enforces equal stick counts per core (no load imbalance) and per-core addressable device memory limits. See Work Division Planning.- work slice
The slice of the iteration space assigned to a single core by work distribution. Encoded in SuperDSC as
coreIdToWkSlice_andnumWkSlicesPerDim_.