# Runtime The Torch-Spyre runtime layer manages device lifecycle, memory allocation, and kernel execution at inference time. This section covers the device-registration plumbing, the C++ tensor and allocator machinery, eager-mode dispatch, streams, and multi-card support. ## Responsibilities - **Device registration** — registering `spyre` as a PyTorch device type - **Tensor memory management** — allocating and freeing device DRAM (DDR) for `SpyreTensorImpl` objects - **DMA transfers** — moving tensor data between host (CPU) memory and device (DDR) memory via the `to()` / `from_device()` APIs - **Kernel dispatch** — loading compiled program binaries and orchestrating their execution across Spyre cores :::{figure} ../_static/images/pytorch-dispatcher.png :alt: PyTorch Dispatcher routing a Spyre tensor operation through the dispatch table :width: 45% :align: center The PyTorch Dispatcher routes each operation to the correct device implementation. When a `torch.add` call carries Spyre tensors, the Dispatcher looks up `SPYRE` in its dispatch table and calls the registered `spyre__add_Tensor` kernel. Torch-Spyre registers all its eager runtime kernels in this table via `TORCH_LIBRARY_IMPL`. ::: ## Device Registration Torch-Spyre registers `spyre` as a PyTorch device using the `PrivateUse1` mechanism — the standard PyTorch pathway for out-of-tree accelerators. Registration happens in `torch_spyre/__init__.py`'s `_autoload()`: ```python torch.utils.rename_privateuse1_backend("spyre") torch._register_device_module("spyre", make_spyre_module()) ``` This gives the device a human-readable name (`"spyre"`) without requiring any upstream PyTorch changes. A custom `SpyreGuardImpl` implements `c10::impl::DeviceGuardImplInterface` to handle device management and synchronization. ### Device Enumeration `torch.spyre.device_count()` is handled by the PrivateUse1 hooks in `csrc/spyre_hooks.cpp`, which look up the visible-device set from a small group of environment variables read in `csrc/spyre_device_enum.cpp`: | Variable | Effect | |---|---| | `AIU_WORLD_SIZE` | Overrides the visible device count. | | `SPYRE_DEVICES` | Comma-separated list of device indices to expose. | | `FLEX_DEVICE` | Selects the underlying flex runtime mode (PF or VF). | The count itself comes from `flex::getNumDevices`. ## Key C++ Components | File | Responsibility | |------|---------------| | `csrc/module.cpp` | pybind11 entry point for the `_C` extension module. Device registration itself happens in `torch_spyre/__init__.py::_autoload()`. | | `csrc/spyre_tensor_impl.cpp` | `SpyreTensorImpl`, the device tensor backing store. | | `csrc/spyre_mem.cpp` | Device memory allocation and DMA, including graph-free DMA and FlexAllocator support. | | `csrc/spyre_allocator.cpp` | `SpyreAllocator`, which bridges PyTorch's `c10::Allocator` to `flex::FlexAllocator`. | | `csrc/spyre_storage_impl.cpp` | `SpyreStorageImpl`, the storage object backing `SpyreTensorImpl`. | | `csrc/spyre_views.cpp` | Tensor view and striding support on device, including `_reshape_alias`. | | `csrc/spyre_guard.cpp` | `SpyreGuardImpl`, device guard and synchronization. | | `csrc/spyre_stream.cpp` | Stream management for asynchronous execution. | | `csrc/spyre_hooks.cpp` | `PrivateUse1HooksInterface`, wires PyTorch's PrivateUse1 hooks to Spyre. | | `csrc/spyre_device_enum.cpp` | Visible-device enumeration. Reads `AIU_WORLD_SIZE`, `SPYRE_DEVICES`, `FLEX_DEVICE`. | | `csrc/spyre_sendnn_utils.cpp` | Eager-mode helpers, including the `EAGER_MODE` env var. | | `csrc/logging.cpp` | C++ debug logging, gated on `TORCH_SPYRE_DEBUG`. | | `csrc/profiler/` | PyTorch Profiler (PrivateUse1) integration. | | `csrc/attn_utils.cpp` | SDPA dispatch. Routes `scaled_dot_product_attention` to the Spyre backend, with GQA support. | ## Python Entry Point `torch_spyre/__init__.py` is loaded automatically by PyTorch via the `torch.backends` entry point declared in `pyproject.toml`. This triggers device and backend registration without requiring an explicit import. :::{figure} ../_static/images/spyre-device-allocator.png :alt: Spyre device allocator call chain from torch.empty through SpyreAllocator to flex::FlexAllocator::allocate :width: 40% :align: center The Spyre device allocator call chain. A `torch.empty(..., device="spyre")` call flows through `spyre_empty_strided` into `SpyreAllocator::allocate`, which calls `flex::FlexAllocator::allocate(nbytes)` ([`spyre_allocator.cpp:137`](https://github.com/torch-spyre/torch-spyre/blob/main/torch_spyre/csrc/spyre_allocator.cpp#L137)). ::: ## Memory Model Spyre tensors live in off-chip LPDDR5. Before any kernel runs, the compiler stages the tiles it needs into a much smaller on-core LX scratchpad and the kernel reads from there. The runtime, though, only deals with the LPDDR5 side. Everything below is about how a Spyre tensor in Python turns into a real LPDDR5 allocation, and how that allocation eventually finds its way back to the pool. :::{figure} ../_static/images/spyre-memory-hierarchy.svg :alt: Spyre memory hierarchy showing host CPU, LPDDR5 device memory, and LX scratchpad :width: 75% :align: center The two levels of memory the device sees. Full tensors stay in LPDDR5. The compiler emits load/store instructions that stage active tiles into the per-core LX scratchpad just in time for each kernel. The runtime owns the LPDDR5 allocation that backs every Spyre tensor. ::: For the layout that lets the runtime actually walk one of those tensors, see [Tensors and Layouts](../user_guide/tensors_and_layouts.md). The next two sections cover what the C++ side of that looks like and how the lifetime ends. ### SpyreTensorImpl A standard PyTorch `(size, stride)` pair cannot describe a tiled device tensor, so Torch-Spyre defines `SpyreTensorImpl` as a subclass of `TensorImpl`. The subclass adds one piece of data, a `SpyreTensorLayout`, that captures everything the runtime needs: - `device_size` — the tensor's shape on device, including the extra tiling and padding dims. - `stride_map` — the host stride for each device dim. A `-1` here means the dim is synthetic or fully padded. - `device_dtype` — the on-device data format, for example `SEN169_FP16`. - `dma_sizes` and `dma_strides` — a host-shape DMA descriptor used when copying views back to the host. They drive `copyAsync()` in `spyre_stream.cpp`. Note that the handles returned to Python never carry a raw device pointer. That is a hard requirement on IBM Z. :::{figure} ../_static/images/spyre-tensor-impl-anatomy.png :alt: Nested boxes showing at::Tensor wrapping TensorImpl wrapping SpyreTensorImpl wrapping SpyreTensorLayout :width: 80% :align: center What is behind a Spyre tensor, drawn as a stack of layers. Python only ever sees the outermost `at::Tensor` handle. Underneath, `c10::TensorImpl` carries the standard tensor metadata, and the Spyre subclass adds a `SpyreTensorLayout` that holds the device shape, the `stride_map`, the device dtype, and the DMA descriptor. ::: ### SpyreAllocator `SpyreAllocator` (`csrc/spyre_allocator.cpp`) is a thin bridge between PyTorch's `c10::Allocator` and `flex::FlexAllocator`. Every `allocate(nbytes)` call passes straight through to `flex_alloc->allocate(nbytes)` and returns a `c10::DataPtr` with a `ReportAndDelete` callback wired in as its deleter. When the tensor's storage refcount hits zero, that deleter runs, updates the `DeviceStats` counters, and hands the allocation back to flex. The trigger is PyTorch's own refcount: Python's garbage collector is not in this loop at all. :::{figure} ../_static/images/spyre-tensor-lifetime.png :alt: Five-step flowchart showing how a Python tensor going out of scope frees a Spyre allocation :width: 75% :align: center What happens between a Python tensor going out of scope and the device allocation returning to the flex pool. The piece that connects the two ends is the `ReportAndDelete` callback that `SpyreAllocator` installs on every `c10::DataPtr` it hands out. ::: Physical-frame (PF) and virtual-frame (VF) execution are *not* allocator strategies inside `SpyreAllocator`. The mode is picked by the `FLEX_DEVICE` environment variable, which configures the underlying flex runtime (see `csrc/spyre_device_enum.cpp`): | Mode | Selection | Description | |------|-----------|-------------| | PF (Physical Frame) | `FLEX_DEVICE` set to a PF device | Direct hardware execution path. | | VF (Virtual Frame) | `FLEX_DEVICE` set to a VF device | Virtualized hardware, used in multi-tenant deployments. | ## Eager Operations Eager kernels reach the Spyre dispatch key from two Python sources. The first is manual registrations in [`torch_spyre/ops/eager.py`](https://github.com/torch-spyre/torch-spyre/blob/main/torch_spyre/ops/eager.py), which use `torch.library.register_kernel` to wire up ops like `mm`, `silu`, `mish`, `fill_.Scalar`, `normal_`, `uniform_`, `_local_scalar_dense`, and `_copy_from`. The second is CPU fallbacks in [`torch_spyre/ops/fallbacks.py`](https://github.com/torch-spyre/torch-spyre/blob/main/torch_spyre/ops/fallbacks.py), registered through `@register_fallback` (or the `register_fallback_default` helper for plain pass-throughs). These cover the long tail: `arange`, `embedding`, `cumsum`, `tril`/`triu`, `isin`, `bitwise_xor`/`bitwise_or`, `argmax`, and similar. Five Inductor decompositions registered through `register_spyre_decomposition` also dispatch eagerly: `rms_norm`, `layer_norm`, `softplus`, `linear`, and `_scaled_dot_product_fused_attention_overrideable`. C++ kernels can still be registered through the usual `TORCH_LIBRARY_IMPL` block, but most of the public eager surface today comes from the Python sources above. ## Streams Torch-Spyre supports stream-based asynchronous execution, following the same API pattern as `torch.cuda` streams: | API | Description | |-----|-------------| | `torch.spyre.Stream()` | Create a new Spyre stream | | `torch.spyre.stream(s)` | Pass-through helper used inside `with` blocks; the current-stream swap is performed by `Stream.__enter__/__exit__` | | `torch.spyre.current_stream()` | Get the current stream for the device | | `torch.spyre.default_stream()` | Get the default stream for the device | | `torch.spyre.synchronize()` | Wait for all operations on all streams to complete | Streams are implemented in `torch_spyre/streams.py` (Python) and `csrc/spyre_stream.cpp` (C++). ### Stream Pool Each device keeps a fixed pool of streams (see `csrc/spyre_stream.cpp`). Stream `0` is the default. Streams `1` through `32` form the low-priority pool (`priority == 0`); streams `33` through `64` form the high-priority pool (any non-zero priority). Each pool holds 32 streams per device and allocates round-robin. Note that `priority` is a binary switch: `0` selects the low-priority pool and any non-zero value selects the high-priority pool. There is no graded scale of priority levels. ## Multi-Card Support Ensembles of up to 8 Spyre cards deliver up to 1 TB of aggregate device memory. :::{admonition} Planned :class: note Multi-card collective communications (all-reduce, all-gather, reduce-scatter) using the standard PyTorch `ProcessGroup` API are planned but not yet implemented in this repository. The C++ tree currently has no `ProcessGroup` source files. ::: ## TODO - Document kernel launch sequence and Control Block Stream design - Document error handling and device reset