Scratchpad (LX) optimization

Where LX scratchpad planning sits in torch-spyre today, and what we are working on next.

Quick navigation:

Hardware context

Each Spyre core has a 2 MB on-core scratchpad (LX) alongside shared HBM. LX reads are much cheaper than HBM and have no cross-core contention, so the planner aims to keep reused tensors on-core and let HBM traffic happen only at the graph boundary.

Spyre memory hierarchy. Large slow HBM shared by 32 cores, each with a 2 MB LX scratchpad.

HBM is plentiful but slow and shared. LX is small but fast and core-local. The compiler picks which buffers live where.

Parameter

Value

Config

Total LX per core

2 MB

fixed

Backend-reserved fraction

20%

DXP_LX_FRAC_AVAIL

Usable LX per core

~1.6 MB

int((2<<20) * (1 - frac_avail))

Alignment

128-byte (stick)

implicit

Cores

1 to 32

SENCORES

Per-core HBM span limit

256 MB

hardware, separate from LX

Inter-core data ring

yes

not yet used by compiler

Inter-core reduce-sum ring

yes

not yet used by compiler

Why scratchpad planning matters

Spyre is often memory-bound: compute cores stall waiting on HBM. Every byte the compiler can keep on LX between producer and consumer is a byte the runtime never has to fetch.

Take a single-core softmax over a (512, 1024) fp16 tensor, 1 MB of input. The lowered op sequence is max sub exp sum div. Total HBM traffic depends on which intermediates land on LX:

Stage

What changes

HBM read+write

Speedup vs baseline

1. baseline (HBM only)

every intermediate goes through HBM

8MN + 4N

1.0x

2. pin reduction outputs to LX

max and sum outputs stay on-core

8MN

~1.0x (reductions are tiny)

3. + in-place ops on LX

exp, sub reuse their input’s address

3MN

~2.7x

4. + clone the input to LX

one pass over HBM, everything else stays

2MN

~4.0x

The ideal memory time after stage 4 is roughly 25% of baseline. End-to-end measurements on this softmax kernel show the median runtime drop from 32.5 µs to 23.7 µs, a 27% reduction. The gap between the ideal and the measured result is fixed per-bundle overhead.

The four stages map onto code under torch_spyre/_inductor/scratchpad/: LX-eligible op outputs (stage 2), in-place reuse (stage 3), and CloneInputNodesPass (stage 4).

Assumptions

LX state survives kernel boundaries

The planner assumes LX state persists across SuperDSC bundle boundaries. It operates on the flat operations list before fusion and has no awareness of where bundle boundaries will fall, so allocation decisions can span multiple bundles.

There is a correctness gap under VF multi-tenancy: the runtime may wipe LX on context switch at any bundle boundary. Once SpyreCode with symbolic addresses is available, fusion will not be limited by the number of tensors used by the bundle, and bundle boundaries should only land at FallbackKernels, which are visible to the planner.

Working sets are already right-sized

Tile size selection (BLOCK_M, BLOCK_N, BLOCK_K, etc.) to fit operands within ~1.6 MB is a pre-Inductor concern, the same class of problem GPU autotuners solve. Spad opt begins after tiling. Given operations whose working sets are feasible, the planner decides which buffers to pin to LX, at what addresses, and for how long. Tiling determines whether data can fit; spad opt determines whether it does fit.

No eviction from LX

Buffers placed on LX stay until end-of-life. There is no mechanism to move a buffer to HBM and reload it later. This is deliberate. Eviction only wins when a buffer is read many times on LX, goes dormant, then is read many times again, which is rare in practice. Pre-Inductor tiling already keeps per-op working sets small. The remaining problem (which buffers to keep on LX when accumulated live buffers exceed capacity) is better solved by smarter placement and spill decisions at allocation time than by runtime eviction with its graph mutation complexity and extra HBM round-trips.

Pipeline position

Scratchpad planning runs as the last pass in CustomPreSchedulingPasses:

1. deadcode_elimination
2. propagate_spyre_tensor_layouts       # assign FixedTiledLayout
3. optimize_restickify_locations
4. finalize_layouts
5. insert_restickify
6. span_reduction                       # work division pass 1 (mandatory)
7. work_distribution                    # work division pass 2 (optional)
8. if config.lx_planning:
       scratchpad_planning              # ← THIS PASS

Two ordering constraints fix this slot:

  • Work division must run first. Spad opt needs op_it_space_splits to compute per-core buffer sizes. Work division also decides whether adjacent ops have compatible core splits; incompatible splits trigger core_div_mismatch and disqualify shared buffers from LX (see Current limitations).

  • Stickification must run first. All buffers need FixedTiledLayout for device-memory size computation.

The pass is gated on LX_PLANNING=1. Disabled by default; experimental.

Optimizations on softmax

The softmax example (max sub exp sum div over a (512, 1024) input) is the easiest way to see what the planner does as each optimization is added.

Four stages of LX optimization on softmax. Baseline, pin reductions, in-place, clone-input.

Each stage corresponds to one capability the planner gained. Boxes coloured red touch HBM, green stays on LX, yellow is in-place reuse, and blue is a clone inserted by the planner.

Stage 1, baseline. No LX. Every op reads and writes HBM. For (M, N) = (512, 1024) with a reduction along axis 0, total HBM I/O is 8MN + 4N bytes (eight full passes over the matrix plus four passes over the reduction vector).

Stage 2, pin reduction outputs to LX. max and sum produce small vectors (1 × N) that the next op reads immediately. Routing these through LX instead of HBM costs almost no LX budget but eliminates the 4N term. On large M × N shapes this is a tiny relative win, but it sets up the next two optimizations, which are large.

Stage 3, in-place ops. When a buffer is on LX and its last reader is itself dying-after-this-op, the output of the next op can reuse the same LX address. exp and sub are flagged OP_GOOD_FOR_LX_INPLACE and therefore in-placeable. After stage 3, the only HBM access left is the graph input and graph output, for 3MN bytes total, a 62% reduction.

Stage 4, clone the input to LX. The graph input is read by several ops. Without a clone each reader would re-fetch from HBM. CloneInputNodesPass detects multi-use inputs that fit in LX and inserts a clone op at the front of the graph. The clone reads HBM once and writes LX; every subsequent op reads from LX. After stage 4 total HBM is 2MN, the input read plus the output write, which is the theoretical minimum for this graph.

Numbers from a 1000-iteration measurement (after 200 warm-up runs):

Variant

dim

M×N

cores

LX

clone

in-place

median (µs)

baseline

0

512×1024

1

off

n/a

off

32.51

stage 2

0

512×1024

1

on

n/a

off

27.66

stage 3

0

512×1024

1

on

n/a

exp,sub

23.93

stage 4

0

512×1024

1

on

yes

exp,sub

23.67

4-core

0

512×1024

4

on

yes

exp,sub

32.17

The 4-core run is slower on this small shape because communication and work-distribution overhead dominate. Multi-core LX wins on larger tensors, see below.

Multi-core LX

A (1024, 2048) fp16 tensor is 4 MB, bigger than any single core’s LX. Splitting the rows over four cores gives each core a (256, 2048) slice (~1 MB per core) that fits.

Splitting a 4 MB tensor across four cores so each per-core slice fits in LX

For tensors larger than 2 MB, the same shape that overflows single-core LX fits comfortably once it has been split across cores by work distribution.

Multi-core LX is not free. Adjacent ops can request different splits (one sliced by rows, the next by columns), in which case the shared buffer is stuck on HBM. That mismatch is what motivates co-optimization (below).

Implementation

Architecture

Scratchpad planning has three layers with separate concerns:

ScratchpadAllocator delegates to a pluggable MemoryPlanSolver and runs optional graph passes around it

DefaultAllocator runs pre-passes (clone insertion), gathers LifetimeBoundBuffers, hands them to a pluggable solver, then writes the chosen LX addresses onto buffer layouts. StrategyBCoOptimizingAllocator extends this flow with a split-search step before the solver runs.

The relevant code lives under torch_spyre/_inductor/scratchpad/:

File

Responsibility

passes.py

ScratchpadOptimizationPass ABC, CloneInputNodesPass

plan_solver.py

MemoryPlanSolver ABC, LifetimeBoundBuffer, GreedyLayoutSolver

firstfit_bestfit_solver.py

FirstFitLayoutSolver, BestFitLayoutSolver

allocator.py

ScratchpadAllocator ABC, DefaultAllocator, StrategyBCoOptimizingAllocator

utils.py

liveness, in-place candidates, op eligibility lists

Entry point

scratchpad_planning(graph, allocator=DefaultAllocator())

DefaultAllocator runs the following pipeline:

  1. Pre-passes. CloneInputNodesPass walks graph inputs and inserts a clone for any HBM input that is read more than once and fits on LX. The clone output becomes a fresh LX-eligible buffer.

  2. Buffer analysis. _generate_buffers produces a list of LifetimeBoundBuffer(name, size, start_time, end_time, in_place_parents) for every op that survives the eligibility filter (graph i/o is excluded; so are buffers whose users have incompatible core splits).

  3. Layout planning. The solver assigns an address to each buffer it can fit; the rest get address=None and stay on HBM.

  4. Push allocation. Successful placements are written to layout.allocation["lx"] = addr on each buffer’s FixedTiledLayout.

  5. Post-passes. Currently empty. Reserved for solver-driven graph mutations (output cloning, op re-ordering).

Codegen integration

Once layout.allocation["lx"] is set:

  • spyre_kernel.py removes LX-allocated buffers from kernel args (core-local, no HBM backing needed).

  • codegen/compute_ops.py writes component_ as "lx", memOrg_ as LX only, and startAddressCoreCorelet_ as the baked-in LX address (the same address per core on their respective scratchpads).

Solvers

config.layout_solver ("greedy" | "firstfit" | "bestfit") picks the solver.

GreedyLayoutSolver (default)

Walks transition points in chronological order. At each point it deallocates expired buffers, then for each newly-live buffer:

  1. If a declared in-place parent is alive at the previous time step and the child fits in the parent’s slot, reuse the parent’s address.

  2. Otherwise find a free block. Try address 0, then above the high-water mark, then gaps between live allocations.

It is simple, easy to reason about, and in-place reuse is automatic. Decisions are local, though. Placing buffer A at address 0 can block a later large buffer C that would have benefited from a low address.

FirstFitLayoutSolver and BestFitLayoutSolver

Both solvers see all buffers up front, sort them topologically with ties broken by ascending lifetime, and place them shortest-life-first into the free address space.

For each buffer, free gaps during its lifetime are computed by subtracting the address intervals of every overlapping placed buffer. In-place parent addresses are kept as candidate gaps so the child can land on top of them.

The two solvers differ only in the gap-selection policy:

  • FirstFitLayoutSolver picks the first gap large enough.

  • BestFitLayoutSolver picks the gap that leaves the smallest remainder after placement.

Both naturally avoid the “buffer at address 0 blocks everything else” failure mode of the greedy solver. They are not yet selected by default. Once a deeptools dependency clears, first-fit is the expected default.

Co-optimization with work-distribution

Work division optimizes each op independently for parallelism. Adjacent ops sharing a buffer can get different splits (different shapes mean different optimal decompositions), which triggers core_div_mismatch and disqualifies the shared buffer from LX even when it would have fit.

StrategyBCoOptimizingAllocator (gated by config.co_optimizing_lx_planning, env var CO_OPTIMIZING_LX_PLANNING=1) treats split choices and LX placement jointly:

Co-optimization searches over alternative split assignments, scoring each by HBM bytes left unpinned

The co-optimizer enumerates split variants per op, scores each combination by counting HBM bytes the solver could not pin, and commits the winning assignment back before the standard allocator flow.

The current POC (v1) only considers dim-flipping. When an op’s seed split has a single output-dim split factor, it generates variants that move that factor onto each compatible alternative output dim. Reduction-axis splits and multi-dim splits are skipped for now. The search is bounded by DEFAULT_VARIANT_CAP = 6 per op and uses DFS over the cross-product with no early-stop pruning, so the search stays compatible with future non-greedy solvers that can reach interior states.

The leaf-scoring function is intentionally cheap and solver-agnostic. It runs the full _generate_buffers + plan_layout pass on the candidate splits and counts the HBM bytes of every buffer the solver could not pin.

Current limitations

Greedy single-pass, no lookahead (default solver)

The greedy solver processes ops in topological order making irrevocable placement decisions without considering future ops. First-fit and best-fit mitigate this by sorting all buffers up front before placing.

No defragmentation

find_free_block can locate holes between allocations but cannot compact the address space. Allocate/deallocate cycles fragment LX.

Co-optimization is a POC

StrategyBCoOptimizingAllocator implements the joint work-division + LX planning idea, but the current variant generator only flips a single output-dim split. Productionisation needs richer variant generation (multi-dim splits, fewer-cores, reduction-axis), a performance model that balances compute throughput against memory traffic so we do not trade away compute parallelism for trivial LX wins, and coverage when the coarse_tiling pass also drives split decisions.

No cross-core ring utilization

The hardware has a data ring (core-to-core LX reads/writes) and a reduce-sum ring (cross-core sum reduction, useful for matmul K-splits). The compiler does not yet generate code that uses either ring. The core_div_mismatch hard wall exists because without ring transfers, a buffer split N ways in one op cannot be read by M cores in the next (with M ≠ N). Ring support could remove this wall by redistributing data across cores without going through HBM (the ring is always faster than HBM). Enabling it requires compiler and codegen support to emit ring transfer instructions in the SuperDSC schedule.

Target patterns

The test suite test_scratchpad_patterns.py encodes patterns the greedy allocator cannot handle (@expectedFailure). Each documents a class of problem to be solved:

Pattern

Problem

What’s needed

Simple fragmentation

Greedy places A at addr 0, blocking later large allocation C

Placement aware of future deallocations

Staircase (up/down)

Increasing or decreasing buffer sizes overflow LX under greedy append

Lookahead and placement-order optimization

GQ attention

Large/small buffer lifecycle alternation (Q_K, scores vs. max, denominators)

Size-aware packing exploiting lifecycle patterns

MoE MLP

Many buffers of varying sizes and lifetimes, shared hidden state

Stack-like placement with complex lifetime management

Best-fit and first-fit pass several patterns the greedy solver fails. The remaining @expectedFailure cases motivate the items in Future work.

Future work

The items below are not in-tree. They sit on top of the MemoryPlanSolver and ScratchpadOptimizationPass interfaces so they can be plugged in without disturbing the rest of the planner.

Non-greedy solvers

Two non-greedy solver families are being prototyped on top of the same MemoryPlanSolver interface:

  • Simulated Annealing (Imanishi-Xu) uses a first-fit or best-fit allocation as the initial guess, then perturbs the order to escape local minima.

  • Integer Linear Programming via OR-Tools formulates placement as a 2D bin-packing constraint and lets a general-purpose solver search exhaustively for graphs small enough to be tractable.

Richer co-optimization

The current dim-flipping variant generator is a starting point. Planned extensions:

  • multi-dim splits, fewer-cores variants, reduction-axis splits;

  • a performance model that balances compute throughput against memory traffic so we do not trade away parallelism for trivial LX wins;

  • joint operation with the coarse_tiling pass when that pass also drives split decisions.

Solver-driven graph mutations

ScratchpadOptimizationPass plug-ins run before or after the solver. Candidates under evaluation:

  • Buffer evictions. Move a buffer from LX to HBM and bring it back later. This is the counterpart of the “no eviction” assumption above and is only worthwhile when liveness shows it pays off.

  • Operation re-ordering. Re-order independent ops to extend or shorten lifetimes for better packing.

  • Output node cloning. Promote a producer to LX and clone to HBM only when an HBM-resident copy is required (draft PR #2028).

  • Driving cloning from the solver. CloneInputNodesPass currently runs as a pre-pass with a heuristic. The longer-term plan is for the solver to decide which clones pay off based on the global layout.

Cross-core ring transfers

Remove the core_div_mismatch hard wall by emitting data-ring or reduce-sum-ring transfers in the SuperDSC schedule, so a buffer split N ways in one op can feed a different M-way split in the next without going through HBM. Requires compiler and codegen support.

Non-terminal kernel hints

Extend the runtime to support a non-terminal kernel annotation. A bundle marked non-terminal guarantees no context switch before the next bundle, preserving LX state across the boundary. The compiler emits the annotation based on cross-bundle LX liveness.

This buys real time on tightly coupled op sequences (for example, softmax decomposed across bundles due to the 6-tensor limit). It needs runtime scheduler support and compiler liveness tracking across bundle boundaries.

Testing

Three suites cover the planner:

  • tests/inductor/test_scratchpad_solver.py: solver-level unit tests. Buffers are constructed directly as LifetimeBoundBuffer lists and fed to each solver.

  • tests/inductor/test_scratchpad_use.py: end-to-end op-level checks that LX is actually used for representative graphs.

  • tests/inductor/test_scratchpad_patterns.py: the @expectedFailure patterns above. Promoting one to passing is the typical signal that a new solver or pass is doing useful work.

  • tests/inductor/test_inductor_ops_lx_planning.py: runs the full Inductor op suite under LX_PLANNING=1 to catch regressions.

The auto-generated coverage suite tracked in the scratchpad workshop fans out per-op tests by composing each supported op with simple reduction or pointwise tails. It surfaces scratchpad-planning bugs that no hand-written test would catch.