Performance Analysis Methodology
Stack: torch-spyre (new, Inductor-based).
Stub
This page is a scaffold. Methodology examples — bottleneck classification, kernel drill-down, category breakdowns, multi-rank analysis — will land here as real new-stack traces become available and are validated against RFC 0601 tooling. Contributions welcome.
The high-value pattern today is capturing a time-bounded
torch.profiler trace alongside aiu-smi telemetry and reading them
together.
1. Bound the measured region
Use torch.profiler’s schedule + record_function to avoid
measuring compile/warmup cost and to make iterations easy to select
in the viewer:
from torch.profiler import profile, ProfilerActivity, schedule, record_function
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.PrivateUse1],
schedule=schedule(wait=1, warmup=2, active=5, repeat=1),
) as prof:
for step in range(10):
with record_function(f"iteration_{step}"):
output = model(inputs)
prof.step()
prof.export_chrome_trace("spyre_trace.json")
See the upstream PyTorch profiler documentation
for the full schedule / record_function API.
2. Pair the trace with aiu-smi
Run aiu-smi in a second shell during the profiling window (see
Device monitoring). Both timestamps are
wall-clock, so you can line up a region of the trace with the
corresponding sample lines.
Which aiu-smi columns to look at depends on the question you’re
asking — consult aiu-smi --help for the current column set. Note
that on the current new-stack build rsvmem and pt_act are not
captured correctly.
For post-processing the captured trace (additional statistics, trace
enrichment), see aiu-trace-analyzer
(public repository).
3. Filing a performance report
When opening an issue, include:
[ ] Minimal reproducer script and iteration count
[ ] PyTorch version and torch-spyre commit SHA
[ ]
aiu-smioutput covering at least one full active iteration[ ]
spyre_trace.jsonor the TensorBoard log directory[ ] Summary table printed by
prof.key_averages().table(...)[ ] What you expected vs. what you saw (latency or throughput)
[ ] For a performance regression, cite the previous metric — the numeric value, the build date or commit SHA it was measured on, and the workload type — so the regression window is unambiguous.
See also
PyTorch Profiler — generating traces
Device monitoring —
aiu-smitelemetryTrace analysis — viewer mechanics
RFC 0601 — planned toolkit