# Glossary This page is the lookup reference for terms used across Torch-Spyre documentation. For the conceptual primer that introduces these terms in context, see [Key concepts](key_concepts.md). Other pages can reference any entry here with the MyST `term` role: `` {term}`stickification` `` renders as a hyperlink to the definition below. :::{glossary} BYTES_IN_STICK The 128-byte alignment constant used by the runtime, compiler, and tensor-layout code. One stick at fp16 holds 64 elements. The size matches the natural granularity of LPDDR5 ↔ LX scratchpad transfers on Spyre. See [Key concepts §4](key_concepts.md#4-sticks-and-tiled-tensors). corelet One of two execution units inside a Spyre core. Each corelet contains an 8×8 systolic PE array (the {term}`PT` execution unit) and a 1D {term}`SFU` vector unit. Both corelets in a core share the same 2 MB {term}`LX scratchpad`. dataflow An execution model in which operations fire as soon as their inputs are ready, rather than being driven by a program counter. Spyre executes a compile-time-scheduled dataflow graph, which is what gives it deterministic latency. See [Dataflow Accelerator Architecture](../architecture/dataflow_architecture.md). DCI Data Conversion Information. The `DataConversionInfo` struct (built by `generate_dci()` in `spyre_mem.cpp`) that bundles loop ranges, host and device strides, and dtype info. The runtime feeds it to `copyAsync` to drive a host ↔ LPDDR5 DMA transfer. decomposition An FX graph rewrite that turns one ATen op into a sequence of Spyre-native or custom ops. Example: `aten.addmm` decomposes into `matmul + scale + add`. Decompositions are how Torch-Spyre covers ATen ops that have no single hardware-level equivalent. See [Key concepts §6](key_concepts.md#6-graph-breaks). Deeptools IBM's proprietary backend compiler that consumes the {term}`SuperDSC` JSON IR and emits a Spyre device binary. Torch-Spyre is the open-source frontend; Deeptools is the closed backend. See [Compiler architecture](../compiler/architecture.md). DMA Direct Memory Access. On Spyre, the PCIe path that carries tensor data between host memory and the device's LPDDR5. FixedTiledLayout A Torch-Spyre subclass of Inductor's `FixedLayout` that augments the PyTorch `(size, stride)` description with a {term}`SpyreTensorLayout` carrying tiled device-side shape, a host-to-device stride map, and the device dtype. This is the layout abstraction that makes tiled tensors representable inside Inductor. See [Tensor Layouts](../user_guide/tensors_and_layouts.md). flex runtime The Spyre device runtime that the C++ `SpyreAllocator` wraps. It owns the underlying device memory and issues kernel launches without exposing raw pointers (an IBM Z security requirement). fold An affine-transform parameterization in {term}`SuperDSC` (`alpha * index + beta`) that lets one JSON artifact describe the per-core behavior of all 32 cores compactly. Fold properties cover core, corelet, row, and time dimensions. graph break An interruption inside a `torch.compile`-d region where Inductor cannot lower an op, so the partial result round-trips to the CPU, the unsupported op runs there, and the data comes back. A single graph break in the hot path can wipe out the performance gains from surrounding compiled code. See [Key concepts §6](key_concepts.md#6-graph-breaks). HBM In SuperDSC field names (e.g. `memOrg_.hbm`), `hbm` is a legacy label for device memory in general. Spyre's actual device memory is {term}`LPDDR5`, not HBM. KTIR KernelTile IR. The MLIR-based dialect designed as the successor to {term}`SuperDSC`. KTIR generalizes the SuperDSC concepts (compute tiles, scratchpad staging, compile-time core partitioning) into a community specification for any dataflow accelerator. See the [KTIR RFC](https://github.com/torch-spyre/rfcs/blob/main/0682-KtirSpec/0682-KtirSpecRFC.md). LPDDR5 Spyre's off-chip device memory. 128 GB on the PCIe card. Equivalent in role to a GPU's HBM, but with a different memory technology and a different cost/bandwidth profile. The legacy {term}`HBM` field name in SuperDSC refers to LPDDR5. LX planning The compiler pass that decides which tensors live in the {term}`LX scratchpad` versus {term}`LPDDR5` at each point in the computation. Gated by `config.lx_planning` (env var `LX_PLANNING=1`). See [Scratchpad Planning](../compiler/scratchpad_planning.md). LX scratchpad The 2 MB SRAM scratchpad on each Spyre core. Compiler-managed — there is no hardware cache. Both corelets in a core share the same scratchpad. See [Key concepts §3](key_concepts.md#3-memory-hierarchy). OpFunc A {term}`Deeptools` primitive that implements one hardware operation on Spyre. Native ATen ops map to single OpFuncs; custom ops lower to one or more OpFuncs. SuperDSC `computeOp_` entries reference OpFuncs by name. PE array Processing Element array. An 8×8 systolic array of multiply-accumulate units inside each corelet, used for matrix-style compute through the {term}`PT` execution unit. PrivateUse1 PyTorch's official extension mechanism for out-of-tree backends. Torch-Spyre uses it to register `"spyre"` as a first-class device name without forking PyTorch. See [Runtime](../runtime/index.md). PT The Processing Tensor execution unit on each corelet. Backed by the {term}`PE array`, it runs matrix-style compute (matmul, fused pointwise epilogues). The other unit on a corelet is the {term}`SFU`. restickify An explicit re-tile op (`spyre::restickify`) inserted by the `insert_restickify` compiler pass when two adjacent operations disagree on tile structure. Preserves correctness when layout propagation cannot pick one consistent tiling for the producer and consumer. See [Inductor frontend](../compiler/inductor_frontend.md). SDSC See {term}`SuperDSC`. The two terms are interchangeable in code and filenames (e.g. `sdsc_0.json`, `generate_sdsc()`). SENCORES The number of Spyre cores the compiler targets. Default 32 (one full card). Lowering it via the `SENCORES` environment variable is primarily a debugging tool; it changes work-division decisions and can be useful for isolating per-core behavior. SFP See {term}`SFU`. Used interchangeably in some code paths. SFU Special Function Unit (sometimes Special Function Processor, SFP). The 1D vector unit on each corelet that handles non-linear activations such as GELU, softmax, and other element-wise functions the {term}`PT` unit does not implement. span reduction The first of two work-division passes (`span_reduction()`). Analyzes the iteration space of each reduction and determines how its span can be split across cores. Followed by {term}`work distribution`. See [Work Division Planning](../compiler/work_division_planning.md). SPMD Single Program, Multiple Data. Every core runs the same program on its own slice of the data, picked by core ID. Spyre's execution model is SPMD across the 32 cores. SpyreTensorImpl The C++ subclass of `at::TensorImpl` that carries Spyre-specific layout metadata (a {term}`SpyreTensorLayout`) alongside the standard PyTorch tensor fields. Registered through the {term}`PrivateUse1` hook system. SpyreTensorLayout The descriptor inside a {term}`SpyreTensorImpl` (and a {term}`FixedTiledLayout`) that carries the tiled device-side size, a stride map from host dimensions to device dimensions, and the device dtype. stick A 128-byte aligned memory chunk on Spyre, equal to 64 fp16 elements. The unit of LPDDR5 ↔ LX transfer and the basic granularity of the tiled tensor layout. Defined by the {term}`BYTES_IN_STICK` constant. stickification The transformation from a host-strided PyTorch layout to a tiled Spyre device layout. Run during the `propagate_spyre_tensor_layouts` pass on the LoopLevel IR. After this pass every `ComputedBuffer` carries a {term}`FixedTiledLayout`. See [Inductor frontend](../compiler/inductor_frontend.md). SuperDSC Super Design Space Config. Torch-Spyre's current JSON-based IR. One artifact per scheduled kernel; encodes the per-core schedule, tensor descriptors, memory addresses, and the compute op. Cached through the standard `torch.compile` artifact system. The successor is {term}`KTIR`. See [Key concepts §7](key_concepts.md#7-compilation-pipeline). tile A contiguous sub-tensor assigned to a single core. On Spyre, a tile is built from one or more {term}`stick`s. work distribution The second of two work-division passes (`work_distribution()`). Assigns the spans identified by {term}`span reduction` to the 32 cores. Enforces equal stick counts per core (no load imbalance) and per-core addressable device memory limits. See [Work Division Planning](../compiler/work_division_planning.md). work slice The slice of the iteration space assigned to a single core by {term}`work distribution`. Encoded in SuperDSC as `coreIdToWkSlice_` and `numWkSlicesPerDim_`. :::