# Scratchpad (LX) optimization

Where LX scratchpad planning sits in torch-spyre today, and what we are
working on next.

**Quick navigation:**

- [Hardware context](#hardware-context)
- [Why scratchpad planning matters](#why-scratchpad-planning-matters)
- [Assumptions](#assumptions)
- [Pipeline position](#pipeline-position)
- [Optimizations on softmax](#optimizations-on-softmax)
- [Implementation](#implementation)
- [Solvers](#solvers)
- [Co-optimization with work-distribution](#co-optimization-with-work-distribution)
- [Current limitations](#current-limitations)
- [Target patterns](#target-patterns)
- [Future work](#future-work)

## Hardware context

Each Spyre core has a 2 MB on-core scratchpad (LX) alongside shared HBM.
LX reads are much cheaper than HBM and have no cross-core contention, so
the planner aims to keep reused tensors on-core and let HBM traffic happen
only at the graph boundary.

:::{figure} ../_static/images/lx/memory-hierarchy.svg
:alt: Spyre memory hierarchy. Large slow HBM shared by 32 cores, each with a 2 MB LX scratchpad.
:width: 480px
:align: center

HBM is plentiful but slow and shared. LX is small but fast and core-local.
The compiler picks which buffers live where.
:::

| Parameter | Value | Config |
|---|---|---|
| Total LX per core | 2 MB | fixed |
| Backend-reserved fraction | 20% | `DXP_LX_FRAC_AVAIL` |
| Usable LX per core | ~1.6 MB | `int((2<<20) * (1 - frac_avail))` |
| Alignment | 128-byte (stick) | implicit |
| Cores | 1 to 32 | `SENCORES` |
| Per-core HBM span limit | 256 MB | hardware, separate from LX |
| Inter-core data ring | yes | not yet used by compiler |
| Inter-core reduce-sum ring | yes | not yet used by compiler |

## Why scratchpad planning matters

Spyre is often memory-bound: compute cores stall waiting on HBM. Every byte
the compiler can keep on LX between producer and consumer is a byte the
runtime never has to fetch.

Take a single-core softmax over a `(512, 1024)` fp16 tensor, 1 MB of input.
The lowered op sequence is `max → sub → exp → sum → div`. Total HBM traffic
depends on which intermediates land on LX:

| Stage | What changes | HBM read+write | Speedup vs baseline |
|---|---|---|---|
| 1. baseline (HBM only) | every intermediate goes through HBM | 8MN + 4N | 1.0x |
| 2. pin reduction outputs to LX | `max` and `sum` outputs stay on-core | 8MN | ~1.0x (reductions are tiny) |
| 3. + in-place ops on LX | `exp`, `sub` reuse their input's address | 3MN | ~2.7x |
| 4. + clone the input to LX | one pass over HBM, everything else stays | 2MN | ~4.0x |

The ideal memory time after stage 4 is roughly 25% of baseline. End-to-end
measurements on this softmax kernel show the median runtime drop from
32.5 µs to 23.7 µs, a 27% reduction. The gap between the ideal and the
measured result is fixed per-bundle overhead.

The four stages map onto code under `torch_spyre/_inductor/scratchpad/`:
LX-eligible op outputs (stage 2), in-place reuse (stage 3), and
`CloneInputNodesPass` (stage 4).

## Assumptions

### LX state survives kernel boundaries

The planner assumes LX state persists across SuperDSC bundle boundaries. It
operates on the flat operations list before fusion and has no awareness of
where bundle boundaries will fall, so allocation decisions can span
multiple bundles.

There is a correctness gap under VF multi-tenancy: the runtime may wipe LX
on context switch at any bundle boundary. Once SpyreCode with symbolic
addresses is available, fusion will not be limited by the number of
tensors used by the bundle, and bundle boundaries should only land at
FallbackKernels, which are visible to the planner.

### Working sets are already right-sized

Tile size selection (BLOCK_M, BLOCK_N, BLOCK_K, etc.) to fit operands
within ~1.6 MB is a pre-Inductor concern, the same class of problem GPU
autotuners solve. Spad opt begins after tiling. Given operations whose
working sets are feasible, the planner decides which buffers to pin to
LX, at what addresses, and for how long. Tiling determines whether data
*can* fit; spad opt determines whether it *does* fit.

### No eviction from LX

Buffers placed on LX stay until end-of-life. There is no mechanism to
move a buffer to HBM and reload it later. This is deliberate. Eviction
only wins when a buffer is read many times on LX, goes dormant, then is
read many times again, which is rare in practice. Pre-Inductor tiling
already keeps per-op working sets small. The remaining problem (which
buffers to keep on LX when accumulated live buffers exceed capacity) is
better solved by smarter placement and spill decisions at allocation
time than by runtime eviction with its graph mutation complexity and
extra HBM round-trips.

## Pipeline position

Scratchpad planning runs as the last pass in `CustomPreSchedulingPasses`:

```
1. deadcode_elimination
2. propagate_spyre_tensor_layouts       # assign FixedTiledLayout
3. optimize_restickify_locations
4. finalize_layouts
5. insert_restickify
6. span_reduction                       # work division pass 1 (mandatory)
7. work_distribution                    # work division pass 2 (optional)
8. if config.lx_planning:
       scratchpad_planning              # ← THIS PASS
```

Two ordering constraints fix this slot:

- **Work division must run first.** Spad opt needs `op_it_space_splits`
  to compute per-core buffer sizes. Work division also decides whether
  adjacent ops have compatible core splits; incompatible splits trigger
  `core_div_mismatch` and disqualify shared buffers from LX (see
  [Current limitations](#current-limitations)).
- **Stickification must run first.** All buffers need `FixedTiledLayout`
  for device-memory size computation.

The pass is gated on `LX_PLANNING=1`. Disabled by default; experimental.

## Optimizations on softmax

The softmax example (`max → sub → exp → sum → div` over a `(512, 1024)`
input) is the easiest way to see what the planner does as each
optimization is added.

:::{figure} ../_static/images/lx/softmax-stages.svg
:alt: Four stages of LX optimization on softmax. Baseline, pin reductions, in-place, clone-input.
:width: 720px
:align: center

Each stage corresponds to one capability the planner gained. Boxes
coloured red touch HBM, green stays on LX, yellow is in-place reuse, and
blue is a clone inserted by the planner.
:::

**Stage 1, baseline.** No LX. Every op reads and writes HBM. For
`(M, N) = (512, 1024)` with a reduction along axis 0, total HBM I/O is
`8MN + 4N` bytes (eight full passes over the matrix plus four passes over
the reduction vector).

**Stage 2, pin reduction outputs to LX.** `max` and `sum` produce small
vectors (`1 × N`) that the next op reads immediately. Routing these
through LX instead of HBM costs almost no LX budget but eliminates
the `4N` term. On large `M × N` shapes this is a tiny relative win, but
it sets up the next two optimizations, which are large.

**Stage 3, in-place ops.** When a buffer is on LX and its last reader is
itself dying-after-this-op, the output of the next op can reuse the same
LX address. `exp` and `sub` are flagged
[OP_GOOD_FOR_LX_INPLACE](https://github.com/torch-spyre/torch-spyre/blob/main/torch_spyre/_inductor/scratchpad/utils.py)
and therefore in-placeable. After stage 3, the only HBM access left is
the graph input and graph output, for `3MN` bytes total, a 62% reduction.

**Stage 4, clone the input to LX.** The graph input is read by several
ops. Without a clone each reader would re-fetch from HBM.
`CloneInputNodesPass` detects multi-use inputs that fit in LX and inserts
a `clone` op at the front of the graph. The clone reads HBM once and
writes LX; every subsequent op reads from LX. After stage 4 total HBM is
`2MN`, the input read plus the output write, which is the theoretical
minimum for this graph.

Numbers from a 1000-iteration measurement (after 200 warm-up runs):

| Variant | dim | M×N | cores | LX | clone | in-place | median (µs) |
|---|---|---|---|---|---|---|---|
| baseline | 0 | 512×1024 | 1 | off | n/a | off | **32.51** |
| stage 2 | 0 | 512×1024 | 1 | on | n/a | off | 27.66 |
| stage 3 | 0 | 512×1024 | 1 | on | n/a | exp,sub | 23.93 |
| stage 4 | 0 | 512×1024 | 1 | on | yes | exp,sub | **23.67** |
| 4-core | 0 | 512×1024 | 4 | on | yes | exp,sub | 32.17 |

The 4-core run is *slower* on this small shape because communication and
work-distribution overhead dominate. Multi-core LX wins on larger
tensors, see below.

### Multi-core LX

A `(1024, 2048)` fp16 tensor is 4 MB, bigger than any single core's LX.
Splitting the rows over four cores gives each core a `(256, 2048)` slice
(~1 MB per core) that fits.

:::{figure} ../_static/images/lx/multicore-tiling.svg
:alt: Splitting a 4 MB tensor across four cores so each per-core slice fits in LX
:width: 580px
:align: center

For tensors larger than 2 MB, the same shape that overflows single-core
LX fits comfortably once it has been split across cores by work
distribution.
:::

Multi-core LX is not free. Adjacent ops can request different splits (one
sliced by rows, the next by columns), in which case the shared buffer is
stuck on HBM. That mismatch is what motivates co-optimization (below).

## Implementation

### Architecture

Scratchpad planning has three layers with separate concerns:

:::{figure} ../_static/images/lx/allocator-architecture.svg
:alt: ScratchpadAllocator delegates to a pluggable MemoryPlanSolver and runs optional graph passes around it
:width: 480px
:align: center

`DefaultAllocator` runs pre-passes (clone insertion), gathers
`LifetimeBoundBuffer`s, hands them to a pluggable solver, then writes the
chosen LX addresses onto buffer layouts. `StrategyBCoOptimizingAllocator`
extends this flow with a split-search step before the solver runs.
:::

The relevant code lives under `torch_spyre/_inductor/scratchpad/`:

| File | Responsibility |
|---|---|
| `passes.py` | `ScratchpadOptimizationPass` ABC, `CloneInputNodesPass` |
| `plan_solver.py` | `MemoryPlanSolver` ABC, `LifetimeBoundBuffer`, `GreedyLayoutSolver` |
| `firstfit_bestfit_solver.py` | `FirstFitLayoutSolver`, `BestFitLayoutSolver` |
| `allocator.py` | `ScratchpadAllocator` ABC, `DefaultAllocator`, `StrategyBCoOptimizingAllocator` |
| `utils.py` | liveness, in-place candidates, op eligibility lists |

### Entry point

```python
scratchpad_planning(graph, allocator=DefaultAllocator())
```

`DefaultAllocator` runs the following pipeline:

1. **Pre-passes.** `CloneInputNodesPass` walks graph inputs and inserts a
   `clone` for any HBM input that is read more than once *and* fits on
   LX. The clone output becomes a fresh LX-eligible buffer.
2. **Buffer analysis.** `_generate_buffers` produces a list of
   `LifetimeBoundBuffer(name, size, start_time, end_time, in_place_parents)`
   for every op that survives the eligibility filter (graph i/o is
   excluded; so are buffers whose users have incompatible core splits).
3. **Layout planning.** The solver assigns an `address` to each buffer
   it can fit; the rest get `address=None` and stay on HBM.
4. **Push allocation.** Successful placements are written to
   `layout.allocation["lx"] = addr` on each buffer's `FixedTiledLayout`.
5. **Post-passes.** Currently empty. Reserved for solver-driven graph
   mutations (output cloning, op re-ordering).

### Codegen integration

Once `layout.allocation["lx"]` is set:

- `spyre_kernel.py` removes LX-allocated buffers from kernel args
  (core-local, no HBM backing needed).
- `codegen/compute_ops.py` writes `component_` as `"lx"`, `memOrg_` as
  LX only, and `startAddressCoreCorelet_` as the baked-in LX address
  (the same address per core on their respective scratchpads).

## Solvers

`config.layout_solver` (`"greedy" | "firstfit" | "bestfit"`) picks the
solver.

### GreedyLayoutSolver (default)

Walks transition points in chronological order. At each point it
deallocates expired buffers, then for each newly-live buffer:

1. If a declared in-place parent is alive at the previous time step and
   the child fits in the parent's slot, reuse the parent's address.
2. Otherwise find a free block. Try address 0, then above the high-water
   mark, then gaps between live allocations.

It is simple, easy to reason about, and in-place reuse is automatic.
Decisions are local, though. Placing buffer A at address 0 can block a
later large buffer C that would have benefited from a low address.

### FirstFitLayoutSolver and BestFitLayoutSolver

Both solvers see *all* buffers up front, sort them topologically with
ties broken by ascending lifetime, and place them shortest-life-first
into the free address space.

For each buffer, free gaps during its lifetime are computed by
subtracting the address intervals of every overlapping placed buffer.
In-place parent addresses are kept as candidate gaps so the child can
land on top of them.

The two solvers differ only in the gap-selection policy:

- `FirstFitLayoutSolver` picks the first gap large enough.
- `BestFitLayoutSolver` picks the gap that leaves the smallest remainder
  after placement.

Both naturally avoid the "buffer at address 0 blocks everything else"
failure mode of the greedy solver. They are not yet selected by default.
Once a deeptools dependency clears, first-fit is the expected default.

## Co-optimization with work-distribution

Work division optimizes each op independently for parallelism. Adjacent
ops sharing a buffer can get different splits (different shapes mean
different optimal decompositions), which triggers `core_div_mismatch`
and disqualifies the shared buffer from LX even when it would have fit.

`StrategyBCoOptimizingAllocator` (gated by
`config.co_optimizing_lx_planning`, env var `CO_OPTIMIZING_LX_PLANNING=1`)
treats split choices and LX placement jointly:

:::{figure} ../_static/images/lx/co-optimization.svg
:alt: Co-optimization searches over alternative split assignments, scoring each by HBM bytes left unpinned
:width: 700px
:align: center

The co-optimizer enumerates split variants per op, scores each
combination by counting HBM bytes the solver could not pin, and commits
the winning assignment back before the standard allocator flow.
:::

The current POC (`v1`) only considers dim-flipping. When an op's seed
split has a single output-dim split factor, it generates variants that
move that factor onto each compatible alternative output dim.
Reduction-axis splits and multi-dim splits are skipped for now. The
search is bounded by `DEFAULT_VARIANT_CAP = 6` per op and uses DFS over
the cross-product with no early-stop pruning, so the search stays
compatible with future non-greedy solvers that can reach interior states.

The leaf-scoring function is intentionally cheap and solver-agnostic. It
runs the full `_generate_buffers + plan_layout` pass on the candidate
splits and counts the HBM bytes of every buffer the solver could not pin.

## Current limitations

### Greedy single-pass, no lookahead (default solver)

The greedy solver processes ops in topological order making irrevocable
placement decisions without considering future ops. First-fit and
best-fit mitigate this by sorting all buffers up front before placing.

### No defragmentation

`find_free_block` can locate holes between allocations but cannot
compact the address space. Allocate/deallocate cycles fragment LX.

### Co-optimization is a POC

`StrategyBCoOptimizingAllocator` implements the joint
work-division + LX planning idea, but the current variant generator only
flips a single output-dim split. Productionisation needs richer variant
generation (multi-dim splits, fewer-cores, reduction-axis), a
performance model that balances compute throughput against memory
traffic so we do not trade away compute parallelism for trivial LX wins,
and coverage when the `coarse_tiling` pass also drives split decisions.

### No cross-core ring utilization

The hardware has a data ring (core-to-core LX reads/writes) and a
reduce-sum ring (cross-core sum reduction, useful for matmul K-splits).
The compiler does not yet generate code that uses either ring. The
`core_div_mismatch` hard wall exists because without ring transfers, a
buffer split N ways in one op cannot be read by M cores in the next
(with M ≠ N). Ring support could remove this wall by redistributing
data across cores without going through HBM (the ring is always faster
than HBM). Enabling it requires compiler and codegen support to emit
ring transfer instructions in the SuperDSC schedule.

## Target patterns

The test suite `test_scratchpad_patterns.py` encodes patterns the greedy
allocator cannot handle (`@expectedFailure`). Each documents a class of
problem to be solved:

| Pattern | Problem | What's needed |
|---|---|---|
| Simple fragmentation | Greedy places A at addr 0, blocking later large allocation C | Placement aware of future deallocations |
| Staircase (up/down) | Increasing or decreasing buffer sizes overflow LX under greedy append | Lookahead and placement-order optimization |
| GQ attention | Large/small buffer lifecycle alternation (Q_K, scores vs. max, denominators) | Size-aware packing exploiting lifecycle patterns |
| MoE MLP | Many buffers of varying sizes and lifetimes, shared hidden state | Stack-like placement with complex lifetime management |

Best-fit and first-fit pass several patterns the greedy solver fails.
The remaining `@expectedFailure` cases motivate the items in
[Future work](#future-work).

## Future work

The items below are not in-tree. They sit on top of the
`MemoryPlanSolver` and `ScratchpadOptimizationPass` interfaces so they
can be plugged in without disturbing the rest of the planner.

### Non-greedy solvers

Two non-greedy solver families are being prototyped on top of the same
`MemoryPlanSolver` interface:

- **Simulated Annealing** (Imanishi-Xu) uses a first-fit or best-fit
  allocation as the initial guess, then perturbs the order to escape
  local minima.
- **Integer Linear Programming** via OR-Tools formulates placement as a
  2D bin-packing constraint and lets a general-purpose solver search
  exhaustively for graphs small enough to be tractable.

### Richer co-optimization

The current dim-flipping variant generator is a starting point. Planned
extensions:

- multi-dim splits, fewer-cores variants, reduction-axis splits;
- a performance model that balances compute throughput against memory
  traffic so we do not trade away parallelism for trivial LX wins;
- joint operation with the `coarse_tiling` pass when that pass also
  drives split decisions.

### Solver-driven graph mutations

`ScratchpadOptimizationPass` plug-ins run before or after the solver.
Candidates under evaluation:

- **Buffer evictions.** Move a buffer from LX to HBM and bring it back
  later. This is the counterpart of the "no eviction" assumption above
  and is only worthwhile when liveness shows it pays off.
- **Operation re-ordering.** Re-order independent ops to extend or
  shorten lifetimes for better packing.
- **Output node cloning.** Promote a producer to LX and clone to HBM
  only when an HBM-resident copy is required (draft PR
  [#2028](https://github.com/torch-spyre/torch-spyre/pull/2028)).
- **Driving cloning from the solver.** `CloneInputNodesPass` currently
  runs as a pre-pass with a heuristic. The longer-term plan is for the
  solver to decide which clones pay off based on the global layout.

### Cross-core ring transfers

Remove the `core_div_mismatch` hard wall by emitting data-ring or
reduce-sum-ring transfers in the SuperDSC schedule, so a buffer split N
ways in one op can feed a different M-way split in the next without
going through HBM. Requires compiler and codegen support.

### Non-terminal kernel hints

Extend the runtime to support a non-terminal kernel annotation. A
bundle marked non-terminal guarantees no context switch before the next
bundle, preserving LX state across the boundary. The compiler emits the
annotation based on cross-bundle LX liveness.

This buys real time on tightly coupled op sequences (for example,
softmax decomposed across bundles due to the 6-tensor limit). It needs
runtime scheduler support and compiler liveness tracking across bundle
boundaries.

## Testing

Three suites cover the planner:

- `tests/inductor/test_scratchpad_solver.py`: solver-level unit tests.
  Buffers are constructed directly as `LifetimeBoundBuffer` lists and
  fed to each solver.
- `tests/inductor/test_scratchpad_use.py`: end-to-end op-level checks
  that LX is actually used for representative graphs.
- `tests/inductor/test_scratchpad_patterns.py`: the `@expectedFailure`
  patterns above. Promoting one to passing is the typical signal that a
  new solver or pass is doing useful work.
- `tests/inductor/test_inductor_ops_lx_planning.py`: runs the full
  Inductor op suite under `LX_PLANNING=1` to catch regressions.

The auto-generated coverage suite tracked in the
[scratchpad workshop](https://github.com/torch-spyre/torch-spyre/issues/2062)
fans out per-op tests by composing each supported op with simple
reduction or pointwise tails. It surfaces scratchpad-planning bugs that
no hand-written test would catch.

## Related documents

- [`work_division_planning.md`](work_division_planning.md) describes how
  work distribution decides per-op core splits that scratchpad planning
  then consumes.
- [`coarse_tiling_loops.md`](coarse_tiling_loops.md) describes coarse
  tiling, which reduces working sets so adjacent ops can fit on LX in
  the first place.