Scratchpad (LX) optimization
Where LX scratchpad planning sits in torch-spyre today, and what we are working on next.
Quick navigation:
Hardware context
Each Spyre core has a 2 MB on-core scratchpad (LX) alongside shared HBM. LX reads are much cheaper than HBM and have no cross-core contention, so the planner aims to keep reused tensors on-core and let HBM traffic happen only at the graph boundary.
HBM is plentiful but slow and shared. LX is small but fast and core-local. The compiler picks which buffers live where.
Parameter |
Value |
Config |
|---|---|---|
Total LX per core |
2 MB |
fixed |
Backend-reserved fraction |
20% |
|
Usable LX per core |
~1.6 MB |
|
Alignment |
128-byte (stick) |
implicit |
Cores |
1 to 32 |
|
Per-core HBM span limit |
256 MB |
hardware, separate from LX |
Inter-core data ring |
yes |
not yet used by compiler |
Inter-core reduce-sum ring |
yes |
not yet used by compiler |
Why scratchpad planning matters
Spyre is often memory-bound: compute cores stall waiting on HBM. Every byte the compiler can keep on LX between producer and consumer is a byte the runtime never has to fetch.
Take a single-core softmax over a (512, 1024) fp16 tensor, 1 MB of input.
The lowered op sequence is max → sub → exp → sum → div. Total HBM traffic
depends on which intermediates land on LX:
Stage |
What changes |
HBM read+write |
Speedup vs baseline |
|---|---|---|---|
1. baseline (HBM only) |
every intermediate goes through HBM |
8MN + 4N |
1.0x |
2. pin reduction outputs to LX |
|
8MN |
~1.0x (reductions are tiny) |
3. + in-place ops on LX |
|
3MN |
~2.7x |
4. + clone the input to LX |
one pass over HBM, everything else stays |
2MN |
~4.0x |
The ideal memory time after stage 4 is roughly 25% of baseline. End-to-end measurements on this softmax kernel show the median runtime drop from 32.5 µs to 23.7 µs, a 27% reduction. The gap between the ideal and the measured result is fixed per-bundle overhead.
The four stages map onto code under torch_spyre/_inductor/scratchpad/:
LX-eligible op outputs (stage 2), in-place reuse (stage 3), and
CloneInputNodesPass (stage 4).
Assumptions
LX state survives kernel boundaries
The planner assumes LX state persists across SuperDSC bundle boundaries. It operates on the flat operations list before fusion and has no awareness of where bundle boundaries will fall, so allocation decisions can span multiple bundles.
There is a correctness gap under VF multi-tenancy: the runtime may wipe LX on context switch at any bundle boundary. Once SpyreCode with symbolic addresses is available, fusion will not be limited by the number of tensors used by the bundle, and bundle boundaries should only land at FallbackKernels, which are visible to the planner.
Working sets are already right-sized
Tile size selection (BLOCK_M, BLOCK_N, BLOCK_K, etc.) to fit operands within ~1.6 MB is a pre-Inductor concern, the same class of problem GPU autotuners solve. Spad opt begins after tiling. Given operations whose working sets are feasible, the planner decides which buffers to pin to LX, at what addresses, and for how long. Tiling determines whether data can fit; spad opt determines whether it does fit.
No eviction from LX
Buffers placed on LX stay until end-of-life. There is no mechanism to move a buffer to HBM and reload it later. This is deliberate. Eviction only wins when a buffer is read many times on LX, goes dormant, then is read many times again, which is rare in practice. Pre-Inductor tiling already keeps per-op working sets small. The remaining problem (which buffers to keep on LX when accumulated live buffers exceed capacity) is better solved by smarter placement and spill decisions at allocation time than by runtime eviction with its graph mutation complexity and extra HBM round-trips.
Pipeline position
Scratchpad planning runs as the last pass in CustomPreSchedulingPasses:
1. deadcode_elimination
2. propagate_spyre_tensor_layouts # assign FixedTiledLayout
3. optimize_restickify_locations
4. finalize_layouts
5. insert_restickify
6. span_reduction # work division pass 1 (mandatory)
7. work_distribution # work division pass 2 (optional)
8. if config.lx_planning:
scratchpad_planning # ← THIS PASS
Two ordering constraints fix this slot:
Work division must run first. Spad opt needs
op_it_space_splitsto compute per-core buffer sizes. Work division also decides whether adjacent ops have compatible core splits; incompatible splits triggercore_div_mismatchand disqualify shared buffers from LX (see Current limitations).Stickification must run first. All buffers need
FixedTiledLayoutfor device-memory size computation.
The pass is gated on LX_PLANNING=1. Disabled by default; experimental.
Optimizations on softmax
The softmax example (max → sub → exp → sum → div over a (512, 1024)
input) is the easiest way to see what the planner does as each
optimization is added.
Each stage corresponds to one capability the planner gained. Boxes coloured red touch HBM, green stays on LX, yellow is in-place reuse, and blue is a clone inserted by the planner.
Stage 1, baseline. No LX. Every op reads and writes HBM. For
(M, N) = (512, 1024) with a reduction along axis 0, total HBM I/O is
8MN + 4N bytes (eight full passes over the matrix plus four passes over
the reduction vector).
Stage 2, pin reduction outputs to LX. max and sum produce small
vectors (1 × N) that the next op reads immediately. Routing these
through LX instead of HBM costs almost no LX budget but eliminates
the 4N term. On large M × N shapes this is a tiny relative win, but
it sets up the next two optimizations, which are large.
Stage 3, in-place ops. When a buffer is on LX and its last reader is
itself dying-after-this-op, the output of the next op can reuse the same
LX address. exp and sub are flagged
OP_GOOD_FOR_LX_INPLACE
and therefore in-placeable. After stage 3, the only HBM access left is
the graph input and graph output, for 3MN bytes total, a 62% reduction.
Stage 4, clone the input to LX. The graph input is read by several
ops. Without a clone each reader would re-fetch from HBM.
CloneInputNodesPass detects multi-use inputs that fit in LX and inserts
a clone op at the front of the graph. The clone reads HBM once and
writes LX; every subsequent op reads from LX. After stage 4 total HBM is
2MN, the input read plus the output write, which is the theoretical
minimum for this graph.
Numbers from a 1000-iteration measurement (after 200 warm-up runs):
Variant |
dim |
M×N |
cores |
LX |
clone |
in-place |
median (µs) |
|---|---|---|---|---|---|---|---|
baseline |
0 |
512×1024 |
1 |
off |
n/a |
off |
32.51 |
stage 2 |
0 |
512×1024 |
1 |
on |
n/a |
off |
27.66 |
stage 3 |
0 |
512×1024 |
1 |
on |
n/a |
exp,sub |
23.93 |
stage 4 |
0 |
512×1024 |
1 |
on |
yes |
exp,sub |
23.67 |
4-core |
0 |
512×1024 |
4 |
on |
yes |
exp,sub |
32.17 |
The 4-core run is slower on this small shape because communication and work-distribution overhead dominate. Multi-core LX wins on larger tensors, see below.
Multi-core LX
A (1024, 2048) fp16 tensor is 4 MB, bigger than any single core’s LX.
Splitting the rows over four cores gives each core a (256, 2048) slice
(~1 MB per core) that fits.
For tensors larger than 2 MB, the same shape that overflows single-core LX fits comfortably once it has been split across cores by work distribution.
Multi-core LX is not free. Adjacent ops can request different splits (one sliced by rows, the next by columns), in which case the shared buffer is stuck on HBM. That mismatch is what motivates co-optimization (below).
Implementation
Architecture
Scratchpad planning has three layers with separate concerns:
DefaultAllocator runs pre-passes (clone insertion), gathers
LifetimeBoundBuffers, hands them to a pluggable solver, then writes the
chosen LX addresses onto buffer layouts. StrategyBCoOptimizingAllocator
extends this flow with a split-search step before the solver runs.
The relevant code lives under torch_spyre/_inductor/scratchpad/:
File |
Responsibility |
|---|---|
|
|
|
|
|
|
|
|
|
liveness, in-place candidates, op eligibility lists |
Entry point
scratchpad_planning(graph, allocator=DefaultAllocator())
DefaultAllocator runs the following pipeline:
Pre-passes.
CloneInputNodesPasswalks graph inputs and inserts aclonefor any HBM input that is read more than once and fits on LX. The clone output becomes a fresh LX-eligible buffer.Buffer analysis.
_generate_buffersproduces a list ofLifetimeBoundBuffer(name, size, start_time, end_time, in_place_parents)for every op that survives the eligibility filter (graph i/o is excluded; so are buffers whose users have incompatible core splits).Layout planning. The solver assigns an
addressto each buffer it can fit; the rest getaddress=Noneand stay on HBM.Push allocation. Successful placements are written to
layout.allocation["lx"] = addron each buffer’sFixedTiledLayout.Post-passes. Currently empty. Reserved for solver-driven graph mutations (output cloning, op re-ordering).
Codegen integration
Once layout.allocation["lx"] is set:
spyre_kernel.pyremoves LX-allocated buffers from kernel args (core-local, no HBM backing needed).codegen/compute_ops.pywritescomponent_as"lx",memOrg_as LX only, andstartAddressCoreCorelet_as the baked-in LX address (the same address per core on their respective scratchpads).
Solvers
config.layout_solver ("greedy" | "firstfit" | "bestfit") picks the
solver.
GreedyLayoutSolver (default)
Walks transition points in chronological order. At each point it deallocates expired buffers, then for each newly-live buffer:
If a declared in-place parent is alive at the previous time step and the child fits in the parent’s slot, reuse the parent’s address.
Otherwise find a free block. Try address 0, then above the high-water mark, then gaps between live allocations.
It is simple, easy to reason about, and in-place reuse is automatic. Decisions are local, though. Placing buffer A at address 0 can block a later large buffer C that would have benefited from a low address.
FirstFitLayoutSolver and BestFitLayoutSolver
Both solvers see all buffers up front, sort them topologically with ties broken by ascending lifetime, and place them shortest-life-first into the free address space.
For each buffer, free gaps during its lifetime are computed by subtracting the address intervals of every overlapping placed buffer. In-place parent addresses are kept as candidate gaps so the child can land on top of them.
The two solvers differ only in the gap-selection policy:
FirstFitLayoutSolverpicks the first gap large enough.BestFitLayoutSolverpicks the gap that leaves the smallest remainder after placement.
Both naturally avoid the “buffer at address 0 blocks everything else” failure mode of the greedy solver. They are not yet selected by default. Once a deeptools dependency clears, first-fit is the expected default.
Co-optimization with work-distribution
Work division optimizes each op independently for parallelism. Adjacent
ops sharing a buffer can get different splits (different shapes mean
different optimal decompositions), which triggers core_div_mismatch
and disqualifies the shared buffer from LX even when it would have fit.
StrategyBCoOptimizingAllocator (gated by
config.co_optimizing_lx_planning, env var CO_OPTIMIZING_LX_PLANNING=1)
treats split choices and LX placement jointly:
The co-optimizer enumerates split variants per op, scores each combination by counting HBM bytes the solver could not pin, and commits the winning assignment back before the standard allocator flow.
The current POC (v1) only considers dim-flipping. When an op’s seed
split has a single output-dim split factor, it generates variants that
move that factor onto each compatible alternative output dim.
Reduction-axis splits and multi-dim splits are skipped for now. The
search is bounded by DEFAULT_VARIANT_CAP = 6 per op and uses DFS over
the cross-product with no early-stop pruning, so the search stays
compatible with future non-greedy solvers that can reach interior states.
The leaf-scoring function is intentionally cheap and solver-agnostic. It
runs the full _generate_buffers + plan_layout pass on the candidate
splits and counts the HBM bytes of every buffer the solver could not pin.
Current limitations
Greedy single-pass, no lookahead (default solver)
The greedy solver processes ops in topological order making irrevocable placement decisions without considering future ops. First-fit and best-fit mitigate this by sorting all buffers up front before placing.
No defragmentation
find_free_block can locate holes between allocations but cannot
compact the address space. Allocate/deallocate cycles fragment LX.
Co-optimization is a POC
StrategyBCoOptimizingAllocator implements the joint
work-division + LX planning idea, but the current variant generator only
flips a single output-dim split. Productionisation needs richer variant
generation (multi-dim splits, fewer-cores, reduction-axis), a
performance model that balances compute throughput against memory
traffic so we do not trade away compute parallelism for trivial LX wins,
and coverage when the coarse_tiling pass also drives split decisions.
No cross-core ring utilization
The hardware has a data ring (core-to-core LX reads/writes) and a
reduce-sum ring (cross-core sum reduction, useful for matmul K-splits).
The compiler does not yet generate code that uses either ring. The
core_div_mismatch hard wall exists because without ring transfers, a
buffer split N ways in one op cannot be read by M cores in the next
(with M ≠ N). Ring support could remove this wall by redistributing
data across cores without going through HBM (the ring is always faster
than HBM). Enabling it requires compiler and codegen support to emit
ring transfer instructions in the SuperDSC schedule.
Target patterns
The test suite test_scratchpad_patterns.py encodes patterns the greedy
allocator cannot handle (@expectedFailure). Each documents a class of
problem to be solved:
Pattern |
Problem |
What’s needed |
|---|---|---|
Simple fragmentation |
Greedy places A at addr 0, blocking later large allocation C |
Placement aware of future deallocations |
Staircase (up/down) |
Increasing or decreasing buffer sizes overflow LX under greedy append |
Lookahead and placement-order optimization |
GQ attention |
Large/small buffer lifecycle alternation (Q_K, scores vs. max, denominators) |
Size-aware packing exploiting lifecycle patterns |
MoE MLP |
Many buffers of varying sizes and lifetimes, shared hidden state |
Stack-like placement with complex lifetime management |
Best-fit and first-fit pass several patterns the greedy solver fails.
The remaining @expectedFailure cases motivate the items in
Future work.
Future work
The items below are not in-tree. They sit on top of the
MemoryPlanSolver and ScratchpadOptimizationPass interfaces so they
can be plugged in without disturbing the rest of the planner.
Non-greedy solvers
Two non-greedy solver families are being prototyped on top of the same
MemoryPlanSolver interface:
Simulated Annealing (Imanishi-Xu) uses a first-fit or best-fit allocation as the initial guess, then perturbs the order to escape local minima.
Integer Linear Programming via OR-Tools formulates placement as a 2D bin-packing constraint and lets a general-purpose solver search exhaustively for graphs small enough to be tractable.
Richer co-optimization
The current dim-flipping variant generator is a starting point. Planned extensions:
multi-dim splits, fewer-cores variants, reduction-axis splits;
a performance model that balances compute throughput against memory traffic so we do not trade away parallelism for trivial LX wins;
joint operation with the
coarse_tilingpass when that pass also drives split decisions.
Solver-driven graph mutations
ScratchpadOptimizationPass plug-ins run before or after the solver.
Candidates under evaluation:
Buffer evictions. Move a buffer from LX to HBM and bring it back later. This is the counterpart of the “no eviction” assumption above and is only worthwhile when liveness shows it pays off.
Operation re-ordering. Re-order independent ops to extend or shorten lifetimes for better packing.
Output node cloning. Promote a producer to LX and clone to HBM only when an HBM-resident copy is required (draft PR #2028).
Driving cloning from the solver.
CloneInputNodesPasscurrently runs as a pre-pass with a heuristic. The longer-term plan is for the solver to decide which clones pay off based on the global layout.
Cross-core ring transfers
Remove the core_div_mismatch hard wall by emitting data-ring or
reduce-sum-ring transfers in the SuperDSC schedule, so a buffer split N
ways in one op can feed a different M-way split in the next without
going through HBM. Requires compiler and codegen support.
Non-terminal kernel hints
Extend the runtime to support a non-terminal kernel annotation. A bundle marked non-terminal guarantees no context switch before the next bundle, preserving LX state across the boundary. The compiler emits the annotation based on cross-bundle LX liveness.
This buys real time on tightly coupled op sequences (for example, softmax decomposed across bundles due to the 6-tensor limit). It needs runtime scheduler support and compiler liveness tracking across bundle boundaries.
Testing
Three suites cover the planner:
tests/inductor/test_scratchpad_solver.py: solver-level unit tests. Buffers are constructed directly asLifetimeBoundBufferlists and fed to each solver.tests/inductor/test_scratchpad_use.py: end-to-end op-level checks that LX is actually used for representative graphs.tests/inductor/test_scratchpad_patterns.py: the@expectedFailurepatterns above. Promoting one to passing is the typical signal that a new solver or pass is doing useful work.tests/inductor/test_inductor_ops_lx_planning.py: runs the full Inductor op suite underLX_PLANNING=1to catch regressions.
The auto-generated coverage suite tracked in the scratchpad workshop fans out per-op tests by composing each supported op with simple reduction or pointwise tails. It surfaces scratchpad-planning bugs that no hand-written test would catch.