# Key concepts

This page introduces the terms and ideas that the rest of the
documentation assumes you have seen at least once: dataflow execution,
sticks and tiled tensors, the LX scratchpad, the eager and compiled
paths, graph breaks, and the four-layer op coverage strategy. Each
section is short on purpose. Cross-references at the end of each
section point to the deeper treatment of each topic.

For the full design narrative, see
[How Torch-Spyre works](how_torch_spyre_works.md). For a one-line
definition of a specific term, see the [glossary](glossary.md).

---

## 1. Execution model

A GPU executes thousands of threads in lock-step on different data, the
Single Instruction, Multiple Threads (SIMT) model. Spyre executes a
**dataflow** graph instead: each operation fires as soon as its inputs
are ready, and the schedule is fixed at compile time. There is no
runtime thread dispatcher and no hardware cache. Execution latency is
**deterministic** as a result.

:::{figure} ../_static/images/how-torch-spyre-works/figA-latency-comparison.svg
:alt: GPU latency profile compared to Spyre's flat, deterministic latency
:width: 680px
:align: center

Illustrative comparison of per-step latency. GPU execution sees jitter
from thread scheduling, cache evictions, and dynamic dispatch. The
compiler-planned dataflow on Spyre produces a flat latency profile for
the same model.
:::

:::{figure} ../_static/images/dataflow-dag.svg
:alt: Two operations firing in parallel as soon as their inputs become available
:width: 680px
:align: center

Dataflow firing rule: an operation runs as soon as all of its inputs
are available. Two independent branches execute in parallel; a join
node waits for both before firing.
:::

Every decision a GPU runtime makes (which core runs what, when data
moves, where it resides) is made by the Torch-Spyre compiler. See
[Dataflow Accelerator Architecture](../architecture/dataflow_architecture.md)
for the full treatment.

---

## 2. Hardware

A Spyre card has 32 cores. Each core has 2 corelets that share a 2 MB
LX scratchpad. Inside each corelet is an 8×8 PE array (systolic, used
for matmul-style compute on the PT unit) and a 1D SFU/SFP vector unit
(used for non-linear ops such as GELU and softmax). Cores connect via
a bi-directional ring at 128 B per cycle per direction.

:::{figure} ../_static/images/how-torch-spyre-works/fig-spyre-core-architecture.svg
:alt: A single Spyre core with two corelets sharing the LX scratchpad
:width: 680px
:align: center

One Spyre core. The two corelets each have a PE array and an SFU,
sharing the 2 MB LX scratchpad. The card has 32 cores connected by a ring.
:::

The constant `SENCORES` controls how many cores the compiler targets
(default 32; can be lowered for debugging via the `SENCORES` env var).
Default dtype is `torch.float16`.

---

## 3. Memory hierarchy

Spyre has two memory tiers. LPDDR5 is 128 GB of off-chip device memory,
equivalent in role to a GPU's HBM. The LX scratchpad is 2 MB of on-core
SRAM. There is no hardware cache. The compiler decides which tensors
reside in LX at each point in the computation and emits explicit
load/store instructions to move data.

:::{figure} ../_static/images/how-torch-spyre-works/fig1-memory-hierarchy.svg
:alt: Spyre memory hierarchy showing LPDDR5 device memory and per-core LX scratchpad
:width: 680px
:align: center

Data moves between 128 GB of LPDDR5 and the 2 MB per-core LX scratchpad
under explicit compiler control. There is no hardware cache.
:::

Two sizing constraints matter for users:

- **2 MB LX scratchpad per core.** Working sets that exceed this are
  staged in tiles. See [Scratchpad Planning](../compiler/scratchpad_planning.md).
- **Per-core addressable device memory limit.** This is a separate
  hardware address-space constraint, *not* the 2 MB LX size. Work
  division must keep each core's footprint under this limit.

:::{note}
The SuperDSC IR has a legacy field name `hbm` that refers to LPDDR5
device memory in general. Spyre's device memory is LPDDR5, not HBM.
:::

---

## 4. Sticks and tiled tensors

The unit of memory transfer on Spyre is a stick: 128 B aligned, 64 fp16
elements (`BYTES_IN_STICK = 128`). A stick matches the granularity of a
load between LPDDR5 and LX, so each transfer moves a full stick of
contiguous data.

Tensors on Spyre are therefore not stored the way PyTorch describes
them. A `(1024, 256)` fp16 tensor is physically four tiles of 64-element
sticks: `(4, 1024, 64)` on the device. The element at position `[i, 63]`
and the element at `[i, 64]` are *not* one stride apart. They sit in
different tiles.

:::{figure} ../_static/images/how-torch-spyre-works/fig3-tensor-layout.svg
:alt: A (1024, 256) host tensor reshaped into a (4, 1024, 64) tiled device layout
:width: 680px
:align: center

A `(1024, 256)` tensor on the host becomes a `(4, 1024, 64)` tiled
structure on the device. The stride breaks at every tile boundary, so
the layout cannot be expressed as a single integer stride per dimension.
:::

PyTorch's `(size, stride)` model cannot describe this layout, so
Torch-Spyre introduces `FixedTiledLayout`, a subclass of Inductor's
`FixedLayout` that carries a `SpyreTensorLayout` descriptor with the
device-side shape and a host-to-device stride map. Two compiler
operations manage this layout:

- **Stickification** is the transformation from a host-strided layout to
  a tiled device layout. It runs during layout propagation.
- **Restickification** (`spyre::restickify`) is an explicit re-tile
  the compiler inserts when two adjacent ops disagree on tile
  structure.

For the full reference, see [Tensor Layouts](../user_guide/tensors_and_layouts.md).

---

## 5. Eager vs compiled path

A PyTorch program reaches Spyre on one of two paths. Which path a given
line of code takes determines its performance.

**Eager path.** When you write `x.to("spyre")` or `torch.add(x, y)` on
Spyre tensors, PyTorch's dispatcher routes each op to a Torch-Spyre C++
kernel registered against the `PrivateUse1` device key. Each op runs
immediately. The result is correct but slow: there is no fusion, no
shared scratchpad reuse across ops, and many ops fall back to CPU.

:::{figure} ../_static/images/pytorch-dispatcher.png
:alt: PyTorch dispatcher routing a Spyre tensor op to the registered Spyre kernel
:width: 50%
:align: center

The eager path: the PyTorch dispatcher looks up the `SPYRE` entry in
its dispatch table for each op and calls the registered Spyre kernel.
:::

**Compiled path.** When you wrap a model with
`torch.compile(model, backend="spyre")`, the FX graph passes through
the Torch-Spyre Inductor backend, which runs layout propagation,
work division, and scratchpad planning, then emits a SuperDSC artifact
that the Deeptools backend turns into a device binary.

```python
import torch
import torch_spyre  # registers the device

model = ...                     # any nn.Module
model = model.to("spyre")
compiled = torch.compile(model, backend="spyre")
out = compiled(x.to("spyre"))   # this is the fast path
```

If you only want to *test that something runs*, the eager path is fine.
For performance you must reach the compiled path. All Spyre-specific
optimizations (tiled layouts, multi-core work division, LX planning)
are implemented there.

---

## 6. Graph breaks

Inside a `torch.compile`-d region, anything Inductor cannot lower forces
a **graph break**: the compiled graph stops, the partial result
round-trips to the CPU, the unsupported op runs there, and the data
comes back. A single graph break in the hot path removes the
performance gains from the surrounding compiled code.

The most common cause is a missing op. Torch-Spyre handles ops in four
layers, in priority order:

:::{figure} ../_static/images/how-torch-spyre-works/fig5-op-layers.svg
:alt: Four-layer op coverage strategy on Spyre
:width: 680px
:align: center

Op coverage on Spyre. ATen ops are decomposed into native ops or custom
ops; custom ops lower to SuperDSC; everything else falls back to the CPU.
:::

1. **Native ops** — ATen ops Deeptools supports directly (pointwise
   ops, `mm`, `bmm`).
2. **Custom ops** — Spyre-specific ops registered via
   `torch.library.custom_op` (e.g. `spyre::rms_norm`,
   `spyre::layer_norm`, `spyre::gelu`).
3. **Decompositions** — FX rewrites that turn an ATen op into a
   sequence of native or custom ops (e.g. `aten.addmm` →
   `matmul + scale + add`).
4. **CPU fallback** — auto-transfer for the long tail (`embedding`,
   `arange`, `sin`, `cos`, `tril`, `triu`, ...). Transparent, but
   off the hot path only.

When debugging slow models, the first question to ask is whether
anything fell through to the CPU fallback. See
[Supported Operations](../user_guide/supported_operations.md) for the
current matrix and [Adding Operations](../compiler/adding_operations.md)
to enable a new one.

---

## 7. Compilation pipeline

The compiled path runs the standard PyTorch pipeline (FX capture,
AOTAutograd, Inductor scheduler) and inserts three Spyre-specific
passes: layout propagation, work division (`span_reduction` followed
by `work_distribution`), and scratchpad / LX planning. The output is
a SuperDSC JSON artifact that the Deeptools backend (a proprietary
compiler) turns into a device binary.

:::{figure} ../_static/images/how-torch-spyre-works/fig4-compilation-pipeline.svg
:alt: The Torch-Spyre compilation pipeline from torch.compile to device binary
:width: 680px
:align: center

The compilation pipeline. Spyre-specific passes (orange) operate on
two IR levels: the FX graph (before Inductor lowering) and the
LoopLevel IR (before codegen). Gray boxes are PyTorch-standard.
:::

Two terms appear throughout the compiler docs:

- **SuperDSC** is the current Spyre IR. It is JSON. One artifact per
  scheduled kernel encodes the per-core schedule, tensor descriptors,
  and the compute op. Artifacts are cached through the standard
  `torch.compile` cache.
- **KTIR** is the planned successor, an MLIR-based dialect designed as
  a community specification for dataflow accelerators. See the
  [KTIR RFC](https://github.com/torch-spyre/rfcs/blob/main/0682-KtirSpec/0682-KtirSpecRFC.md).

For the full pipeline reference, see
[Compiler Architecture](../compiler/architecture.md) and
[Inductor Frontend](../compiler/inductor_frontend.md).

---

## 8. Dtype defaults and casting

Spyre's default compute dtype is `torch.float16`. fp32 inputs are
down-cast to fp16 before reaching the device. This is correct for most
inference workloads but can surprise users coming from CPU or GPU
training paths. Down-cast warnings are emitted by default; set
`TORCH_SPYRE_DOWNCAST_WARN=0` to suppress them. bfloat16, fp8 variants,
and integer dtypes are supported in the runtime but have narrower op
coverage on the compiled path.

If your model has a numerically sensitive layer, check the
[supported operations](../user_guide/supported_operations.md) matrix
for that op's dtype coverage.

---

## 9. Running models today: FMS vs stock HuggingFace

Today the production path for LLM inference on Spyre is through IBM's
**Foundation Model Stack (FMS)**, which provides Spyre-aware model
implementations. Granite 3.3 8B runs in production this way:

```python
from fms.models import get_model
model = get_model("granite", "3.3-8b-instruct", device_type="spyre")
compiled = torch.compile(model, backend="spyre")
```

This describes the state of the stack today and will change as the
stack matures. Once op coverage broadens, dynamic shapes land, and
KV-cache handling stabilizes, the same workloads will run with stock
`AutoModelForCausalLM.from_pretrained(...).to("spyre")` plus
`torch.compile` and no model-side changes. The FMS path will remain
supported and will become one of several entry points. If you are
running a model that FMS already supports, use FMS today. If you are
prototyping a new architecture, target the compiled path directly and
expect to file ops as you go. Check this page or the
[supported operations](../user_guide/supported_operations.md) matrix
for the current state.

---

## 10. Hardware constraints checklist

Constraints that show up as compile-time errors or unexpected behavior:

| Constraint | What it means in practice |
|---|---|
| **128-byte alignment** | Inner dimensions are padded up to a multiple of 64 fp16 elements (a stick). |
| **No HW scalar immediates** | Scalar constants in the FX graph are rewritten to size-1 tensors via `spyre::constant`. |
| **Indivisible reduction dims** | Some reduction dimensions cannot be split across cores; work distribution honors this. |
| **Static shapes** | Dynamic shapes are work-in-progress. Shape-polymorphic models may recompile per shape. |
| **Per-core memory limit** | Each core's tensor footprint must fit its addressable device memory range (separate from the 2 MB LX). |
| **fp16 default** | fp32 is down-cast to fp16 with a warning; set `TORCH_SPYRE_DOWNCAST_WARN=0` to suppress. |
| **`SENCORES=32`** | Default core count; lowering it for debugging changes work-division decisions. |

---

## Where to go next

- Run something end to end: [Quickstart](quickstart.md).
- Learn the design story behind these concepts:
  [How Torch-Spyre works](how_torch_spyre_works.md).
- Look up a single term: [Glossary](glossary.md).
- Dig into a specific area:
  [Tensor Layouts](../user_guide/tensors_and_layouts.md),
  [Compiler Architecture](../compiler/architecture.md),
  [Runtime](../runtime/index.md).