# Dataflow Accelerator Architecture

This document provides a reference overview of the dataflow accelerator
model as implemented in the IBM Spyre AI Card. It is intended both as
context for Torch-Spyre developers and as a general reference for the
dataflow accelerator design pattern.

:::{note}
**Key Concepts** — terms used throughout this page:

- **Tile** — a contiguous sub-tensor assigned to a single core.
- **Scratchpad** — fast on-core SRAM. The compiler manages it directly; there is no hardware cache.
- **DMA** — Direct Memory Access. On Spyre this is the PCIe path that carries tensor data between host memory and the device's LPDDR5.
- **Load/store** — the compiler-emitted instructions used by the on-core load/store units to move tiles between LPDDR5 and the LX scratchpad.
- **SPMD** — Single Program, Multiple Data. Every core runs the same program on its own slice of the data, picked by core ID.
- **SuperDSC** — the Spyre-specific kernel descriptor format. A single JSON describes one scheduled kernel operation across all cores.
- **DCI** — Data Conversion Information. The `DataConversionInfo` struct (built by `generate_dci()` in `spyre_mem.cpp`) that bundles loop ranges, host and device strides, and dtype info; the runtime feeds it to `copyAsync` to drive a host ↔ LPDDR5 DMA transfer.
:::

## What is a Dataflow Accelerator?

Traditional von Neumann processors execute instructions sequentially,
fetching data from memory on demand. **Dataflow accelerators** invert
this model: computation is expressed as a directed acyclic graph (DAG)
of operations, and each node fires as soon as all its input operands
are ready. There is no program counter — the data itself drives
execution order. This eliminates most control-flow overhead and enables
deeply pipelined, high-throughput execution
\[[Dennis 1974](#ref-dennis1974), [Veen 1986](#ref-veen1986)\].

Key characteristics:

- Operations are scheduled by data availability, not program counters.
- Tensors are staged in local scratchpad memories close to compute units.
- The compiler is responsible for all data movement and scheduling.

:::{figure} ../_static/images/dataflow-dag.svg
:alt: Dataflow execution graph showing operations firing when their input tensors are ready
:width: 680px
:align: center

Dataflow execution of two independent branches (Linear + LayerNorm) that merge at a Concat node. **Green (solid border) = ready** — all inputs available, fires immediately. **Yellow (dashed border) = waiting** — blocked on upstream output. Both MatMuls fire in parallel because their inputs are independent. No program counter controls execution order — data availability does.
:::

### Dataflow Firing Rule

Execution follows the **dataflow firing rule**: an operation is eligible
to execute as soon as all of its input operands are available
\[[Dennis 1974](#ref-dennis1974)\]. Operands are modeled as **tokens**
that propagate through the graph, activating downstream operations. This
is the fundamental principle that distinguishes dataflow from
control-flow execution. In AI accelerators, dataflow execution operates
at the granularity of operators (or tiles), unlike instruction-level
dataflow in CPUs. Nodes with multiple inputs act as synchronization
points, potentially introducing backpressure that limits throughput.

In the Spyre implementation, a **SuperDSC** (kernel descriptor) is what
"fires" once the compiler-emitted load/store instructions have staged
the necessary **tiles** into the scratchpad. The compiler schedules
this statically: load/store instructions move tiles from LPDDR5 into
the scratchpad, and the SuperDSC kernel runs once all inputs are
resident there.

### Static vs Dynamic Dataflow

There are two major variants of dataflow execution
\[[Veen 1986](#ref-veen1986)\]:

- **Static dataflow** — dependencies are fixed at compile time. The
  hardware executes a pre-determined graph with known shapes and
  scheduling. This is simpler and more efficient for regular workloads.

- **Dynamic dataflow** — uses tagged tokens to track dependencies at
  runtime, supporting multiple concurrent instances of the same
  operation. This is more flexible but significantly more complex in
  hardware.

Modern AI accelerators (including Spyre) typically implement a **static
dataflow** model optimized for workloads where operator graphs and
tensor shapes are known ahead of time.

### Relationship to Out-of-Order Execution

Modern CPUs implement a limited form of dataflow through **out-of-order
(OOO) execution**: instructions execute when their operands are ready,
within a bounded hardware window. Dependencies are tracked dynamically
by the reorder buffer \[[Tomasulo 1967](#ref-tomasulo1967)\].

Dataflow accelerators take this principle much further:

| Aspect | OOO CPU | Dataflow Accelerator |
|--------|---------|---------------------|
| Scope | Small instruction window (hundreds) | Entire computation graph |
| Dependency tracking | Dynamic, in hardware | Static, at compile time |
| Parallelism | Limited by window size | Limited by graph structure |
| Overhead | Significant (register renaming, speculation) | Minimal (pre-scheduled) |

This connection helps explain why dataflow architectures can be so
efficient for regular workloads: they trade the generality of dynamic
OOO scheduling for the performance of static, whole-graph optimization.

### Why Dataflow is Effective for Deep Learning

Deep neural networks exhibit properties that are particularly
well-suited to dataflow execution
\[[Chen 2017](#ref-chen2017), [Sze 2017](#ref-sze2017)\]:

- **Regular computation patterns** — operations like GEMM, convolution,
  and element-wise activations are highly predictable
- **High data reuse opportunities** — weights and activations are
  accessed repeatedly across layers and batches
- **Static execution graphs** — during inference (and often training),
  the operator graph is fixed and shapes are known

Dataflow architectures exploit these properties by:

- Maximizing data reuse in local scratchpad, avoiding redundant
  off-chip memory accesses
- Enabling pipeline parallelism across layers and operations
- Pre-planning most data movement at compile time, which keeps
  runtime allocation overhead low (though execution timing of the
  staged transfers still depends on runtime conditions)

**Data movement — not compute — is often the dominant cost** in DNN
execution \[[Sze 2017](#ref-sze2017)\]. This has been dramatically
demonstrated in the transformer era by FlashAttention
\[[Dao 2022](#ref-dao2022)\], which achieved large speedups purely by
restructuring attention computation to minimize memory reads/writes.
Dataflow optimization targets this bottleneck directly by keeping active
data close to compute units and minimizing DDR round-trips.

:::{note}
**For Torch-Spyre developers:** The dataflow model has direct
implications for how PyTorch operations are lowered. Because all data
movement is explicit and compiler-managed, the backend has to fix the
tiling strategy, the load/store schedule that stages tiles into
scratchpad, and the kernel descriptors (SuperDSC) at compile time —
there is no runtime fallback to a hardware cache. See
[Compiler Architecture](../compiler/architecture.md) and
[Work Division Planning](../compiler/work_division_planning.md) for
details.
:::

## Spyre Architecture Highlights

:::{figure} ../_static/images/telum2-spyre-chip.jpg
:alt: IBM Spyre Accelerator chip (left) and IBM Telum II processor (right)
:width: 680px
:align: center

The IBM Spyre Accelerator chip (left) and IBM Telum II processor (right). *Image credit: [IBM Newsroom](https://newsroom.ibm.com/ai-on-z).*
:::

| Feature | Detail |
|---------|--------|
| Cores | 32 AI accelerator cores |
| Technology | 5 nm |
| Memory per card | Up to 128 GB LPDDR5 |
| Peak performance | >300 TOPS per card |
| Supported data types | int4, int8, fp8, fp16 |
| Power envelope | 75 W per card |
| Host interface | PCIe |
| Max card cluster | 8 cards / 1 TB memory [^cluster] |

[^cluster]: Multiple Spyre cards can be clustered in a single IBM Z I/O drawer, sharing memory across cards for larger model capacity.

:::{figure} https://research-website-prod-cms-uploads.s3.us.cloud-object-storage.appdomain.cloud/IBM_AIU_PCIE_05_d6a1bd0d18.jpg
:alt: IBM Spyre Accelerator PCIe card
:width: 560px
:align: center

The IBM Spyre Accelerator PCIe card (reverse side), showing the physical form factor for IBM Z and Power systems. *Image credit: [IBM Newsroom](https://newsroom.ibm.com/ai-on-z).*
:::

Spyre implements a **hybrid dataflow** architecture: dataflow execution
drives the compute kernels, while control-flow mechanisms handle host
interaction, kernel sequencing, and device orchestration. This is
consistent with the design of most modern accelerators — pure dataflow
machines were largely unsuccessful historically
\[[Veen 1986](#ref-veen1986)\], and practical systems combine the
efficiency of dataflow execution with the flexibility of control flow
for coordination.

## Memory Hierarchy

Spyre exposes two levels of memory visible to the compiler.
Understanding this hierarchy is critical for performance: LX Scratchpad
is significantly lower-latency than DDR, so minimizing DDR round-trips
through effective tiling is the primary lever for optimizing kernel
throughput.

1. **DDR (device DRAM)** — large, off-core storage for full tensors.
2. **LX Scratchpad** — fast, on-core storage for tiles actively being
   processed. The compiler emits load/store instructions to stage
   tiles between DDR and the scratchpad; there is no hardware cache
   to fall back on.

:::{figure} ../_static/images/spyre-memory-hierarchy.svg
:alt: Spyre two-level memory hierarchy showing DDR and LX Scratchpad
:width: 600px
:align: center

Spyre memory hierarchy. Full tensors live in off-chip LPDDR5 — 128 GB physical, of which 16 GB is reserved for ECC, leaving roughly 112 GB usable. Before a kernel runs, the compiler issues load/store instructions to stage the tiles it needs into the on-core LX Scratchpad. The PCIe link between the host and LPDDR5 is the DMA path. Tiling that keeps traffic on-chip is the main lever for performance.
:::

The end-to-end data path is:

```
Host → DDR → LX Scratchpad → Compute Units → LX Scratchpad → DDR → Host
```

The compiler generates two kinds of artifacts to drive this pipeline:

- **Load/store instructions** that move tiles between DDR and the LX
  scratchpad at the right time.
- **SuperDSC** — JSON kernel descriptors that specify the computation
  performed on each tile once it arrives in scratchpad. SuperDSC is
  being superseded by KTIR, a tile-based MLIR intermediate
  representation — see [RFC 0682](../rfcs/index.md).

## Execution Model

Each Spyre core executes a **kernel** — a self-contained computation
on a **tile** (a contiguous sub-tensor region) of data. All three of
the following are determined **at compile time** — there is no global
instruction scheduler. Execution is driven by compile-time planning
and data readiness, with only minimal runtime control for transfer
sequencing and resource arbitration:

1. **Work division** — how to split a tensor operation across cores
   (see [Work Division Planning](../compiler/work_division_planning.md))
2. **Data staging** — when, and with what load/store instructions, to
   move tiles between LPDDR5 and the LX scratchpad
3. **Kernel specification** — the SuperDSC (Spyre kernel descriptor)
   JSON that fully describes the operation: operand shapes, data
   types, tiling parameters, and the sequence of compute instructions
   for the PT array.

Cores execute in SPMD (Single Program, Multiple Data) fashion: cores
follow a common program structure but operate on different tiles and
may execute different kernel phases over time, identified by their
core ID.

:::{figure} ../_static/images/spyre-core-microarchitecture.png
:alt: Spyre core microarchitecture showing PT units, PE, SFP, LX Scratchpad, and device memory
:width: 50%
:align: center

Spyre core microarchitecture. Each Spyre core contains two corelets (Corelet 0 and Corelet 1) that share a single 2 MB LX scratchpad (SRAM). A corelet is built from an 8 × 8 systolic Processing Element (PE) array, used for matrix-style compute on the PT execution unit, plus a 1D Special Function Unit (SFU) vector unit that handles non-linear activations such as GELU and softmax. Compiler-emitted load/store instructions move tiles between the LX scratchpad and off-chip LPDDR5; there is no hardware cache. Cores talk to each other over a bi-directional ring interconnect at 128 B per cycle per direction. The architecture descends from IBM's RaPiD AI accelerator (Venkataramani et al., ISCA 2021, [DOI:10.1109/ISCA52012.2021.00021](https://doi.org/10.1109/ISCA52012.2021.00021)).
:::

## Comparison with GPU and Other Accelerators

Spyre's dataflow model differs from GPU execution in several
fundamental ways that directly affect how Torch-Spyre lowers PyTorch
operations:

| Aspect | GPU (CUDA) | Spyre Dataflow |
|--------|-----------|----------------|
| Scheduling | Warp-level SIMT | Data-driven, core SPMD |
| Memory model | Shared memory + global | Scratchpad + DDR |
| Data movement | Implicit caching | Explicit, compiler-scheduled load/store between DDR and scratchpad |
| Supported precision | fp32 / bf16 / fp16 / int8 | int4 / int8 / fp8 / fp16 |
| Compiler model | Hybrid (AOT kernels with optional JIT from PTX) | Primarily AOT (scheduling and data movement planned at compile time) |
| Parallelism granularity | Thread blocks [^gpu] | Core tiles [^gpu] |

[^gpu]: GPU thread blocks and Spyre core tiles serve analogous roles (distributing work across parallel units) but differ in how scheduling is performed: thread blocks are dispatched by a hardware scheduler at runtime, while core tile assignment is fixed at compile time in Spyre.

Dataflow accelerators also differ from **systolic arrays** (as used in
Google TPUs \[[Jouppi 2017](#ref-jouppi2017),
[Jouppi 2023](#ref-jouppi2023)\]): systolic arrays move data through a
fixed pipeline of processing elements in a regular, predetermined
pattern, while dataflow architectures schedule execution based on graph
dependencies, allowing more flexible communication patterns at the cost
of more complex compilation \[[Xu 2024](#ref-xu2024)\]. Formal
frameworks such as MAESTRO \[[Kwon 2020](#ref-kwon2020)\] provide
analytical tools for reasoning about the data reuse, bandwidth, and
energy trade-offs across these architectural styles.

### Modern Dataflow-Inspired Accelerators

Spyre is part of a broader trend of dataflow-inspired AI accelerators:

| Accelerator | Organization | Key characteristic |
|-------------|-------------|-------------------|
| IBM Spyre | IBM | Static dataflow with explicit scratchpad management |
| SambaNova RDU | SambaNova Systems | Reconfigurable dataflow architecture \[[Prabhakar 2017](#ref-prabhakar2017)\] |
| Cerebras WSE | Cerebras Systems | Spatial/dataflow execution across wafer-scale fabric |
| Graphcore IPU | Graphcore | Bulk Synchronous Parallel model with explicit data movement |

Each makes different trade-offs in programmability, scalability, and
hardware complexity, but all share the principle of optimizing data
movement as the primary lever for performance and energy efficiency.
For a comprehensive survey of hardware accelerators in the LLM era, see
\[[Kachris 2025](#ref-kachris2025)\].

## Limitations and Challenges

While dataflow architectures offer strong parallelism and efficiency
for regular workloads, they face several challenges:

- **Compiler complexity** — extracting, scheduling, and mapping
  dataflow graphs to hardware requires sophisticated compilation
  passes. The compiler must handle tiling, data layout, memory
  allocation, and multi-core scheduling — all at compile time.

- **Irregular workloads** — dynamic control flow (e.g., variable-length
  sequences, conditional branching) and variable tensor shapes reduce
  the effectiveness of static dataflow scheduling. Operations that
  cannot be pre-planned may require fallback mechanisms (e.g., host
  execution or alternative kernels).

- **Placement and scheduling** — efficiently mapping large operator
  graphs onto a fixed number of cores with limited scratchpad memory
  remains a hard optimization problem, especially as model sizes grow.

- **Synchronization and backpressure** — operations that require
  multiple inputs (e.g., concatenation, residual additions) act as
  synchronization points where all upstream paths must complete before
  execution can proceed. The Torch-Spyre compiler mitigates this
  through tiling strategies that balance work across cores and minimize
  idle time at synchronization boundaries (see
  [Work Division Planning](../compiler/work_division_planning.md)).

- **Data movement overhead** — dataflow cuts down on redundant
  movement, but every tile that has to come from DDR still costs
  latency on the way into and out of the scratchpad. At scale the
  communication network is the bottleneck, especially for ops with
  poor data locality.

- **Limited general-purpose adoption** — dataflow architectures have
  been most successful in domain-specific applications (AI inference,
  signal processing) where computation patterns are predictable. They
  are not well-suited for general-purpose workloads with irregular
  memory access patterns \[[Veen 1986](#ref-veen1986)\].

These challenges are active areas of research in the Torch-Spyre
project. The [Compiler Architecture](../compiler/architecture.md)
documentation describes the specific strategies used to address tiling,
scheduling, and fallback handling.

## References

(ref-dennis1974)=
- **\[Dennis 1974\]** J. B. Dennis and D. P. Misunas, "A Preliminary Architecture for a Basic Data-Flow Processor," *Proc. 2nd Annual Symposium on Computer Architecture (ISCA)*, 1974. [DOI:10.1145/641675.642111](https://doi.org/10.1145/641675.642111)

(ref-veen1986)=
- **\[Veen 1986\]** A. H. Veen, "Dataflow Machine Architecture," *ACM Computing Surveys*, vol. 18, no. 4, pp. 365–396, 1986. [DOI:10.1145/27633.28055](https://doi.org/10.1145/27633.28055)

(ref-tomasulo1967)=
- **\[Tomasulo 1967\]** R. M. Tomasulo, "An Efficient Algorithm for Exploiting Multiple Arithmetic Units," *IBM Journal of Research and Development*, vol. 11, no. 1, pp. 25–33, 1967. [DOI:10.1147/rd.111.0025](https://doi.org/10.1147/rd.111.0025)

(ref-chen2017)=
- **\[Chen 2017\]** Y.-H. Chen *et al.*, "Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks," *IEEE Journal of Solid-State Circuits*, vol. 52, no. 1, pp. 127–138, 2017. [DOI:10.1109/JSSC.2016.2616357](https://doi.org/10.1109/JSSC.2016.2616357)

(ref-sze2017)=
- **\[Sze 2017\]** V. Sze *et al.*, "Efficient Processing of Deep Neural Networks: A Tutorial and Survey," *Proceedings of the IEEE*, vol. 105, no. 12, pp. 2295–2329, 2017. [DOI:10.1109/JPROC.2017.2761740](https://doi.org/10.1109/JPROC.2017.2761740)

(ref-jouppi2017)=
- **\[Jouppi 2017\]** N. P. Jouppi *et al.*, "In-Datacenter Performance Analysis of a Tensor Processing Unit," *Proc. 44th Annual International Symposium on Computer Architecture (ISCA)*, 2017. [DOI:10.1145/3079856.3080246](https://doi.org/10.1145/3079856.3080246)

(ref-prabhakar2017)=
- **\[Prabhakar 2017\]** R. Prabhakar *et al.*, "Plasticine: A Reconfigurable Architecture for Parallel Patterns," *Proc. 44th Annual International Symposium on Computer Architecture (ISCA)*, 2017. [DOI:10.1145/3079856.3080256](https://doi.org/10.1145/3079856.3080256)

(ref-kwon2020)=
- **\[Kwon 2020\]** H. Kwon *et al.*, "MAESTRO: A Data-Centric Approach to Understand Reuse, Performance, and Hardware Cost of DNN Mappings," *IEEE Micro*, vol. 40, no. 3, pp. 20–29, 2020. [DOI:10.1109/MM.2020.2985963](https://doi.org/10.1109/MM.2020.2985963)

(ref-dao2022)=
- **\[Dao 2022\]** T. Dao *et al.*, "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness," *Advances in Neural Information Processing Systems (NeurIPS)*, 2022. [arXiv:2205.14135](https://arxiv.org/abs/2205.14135)

(ref-jouppi2023)=
- **\[Jouppi 2023\]** N. P. Jouppi *et al.*, "TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings," *Proc. 50th Annual International Symposium on Computer Architecture (ISCA)*, 2023. [DOI:10.1145/3579371.3589350](https://doi.org/10.1145/3579371.3589350)

(ref-xu2024)=
- **\[Xu 2024\]** R. Xu *et al.*, "A Survey of Design and Optimization for Systolic Array-based DNN Accelerators," *ACM Computing Surveys*, vol. 56, no. 1, 2024. [DOI:10.1145/3604802](https://doi.org/10.1145/3604802)

(ref-kachris2025)=
- **\[Kachris 2025\]** C. Kachris, "A Survey on Hardware Accelerators for Large Language Models," *Applied Sciences*, vol. 15, no. 2, art. 586, 2025. [DOI:10.3390/app15020586](https://doi.org/10.3390/app15020586)

## Further Reading

- [IBM Spyre Accelerator Overview](spyre_accelerator.md)
- [Compiler Architecture](../compiler/architecture.md)
- [Tensor Layouts](../user_guide/tensors_and_layouts.md)