Performance Analysis Methodology

Stack: torch-spyre (new, Inductor-based).

Stub

This page is a scaffold. Methodology examples — bottleneck classification, kernel drill-down, category breakdowns, multi-rank analysis — will land here as real new-stack traces become available and are validated against RFC 0601 tooling. Contributions welcome.

The high-value pattern today is capturing a time-bounded torch.profiler trace alongside aiu-smi telemetry and reading them together.

1. Bound the measured region

Use torch.profiler’s schedule + record_function to avoid measuring compile/warmup cost and to make iterations easy to select in the viewer:

from torch.profiler import profile, ProfilerActivity, schedule, record_function

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.PrivateUse1],
    schedule=schedule(wait=1, warmup=2, active=5, repeat=1),
) as prof:
    for step in range(10):
        with record_function(f"iteration_{step}"):
            output = model(inputs)
        prof.step()

prof.export_chrome_trace("spyre_trace.json")

See the upstream PyTorch profiler documentation for the full schedule / record_function API.

2. Pair the trace with aiu-smi

Run aiu-smi in a second shell during the profiling window (see Device monitoring). Both timestamps are wall-clock, so you can line up a region of the trace with the corresponding sample lines.

Which aiu-smi columns to look at depends on the question you’re asking — consult aiu-smi --help for the current column set. Note that on the current new-stack build rsvmem and pt_act are not captured correctly.

For post-processing the captured trace (additional statistics, trace enrichment), see aiu-trace-analyzer (public repository).

3. Filing a performance report

When opening an issue, include:

  • [ ] Minimal reproducer script and iteration count

  • [ ] PyTorch version and torch-spyre commit SHA

  • [ ] aiu-smi output covering at least one full active iteration

  • [ ] spyre_trace.json or the TensorBoard log directory

  • [ ] Summary table printed by prof.key_averages().table(...)

  • [ ] What you expected vs. what you saw (latency or throughput)

  • [ ] For a performance regression, cite the previous metric — the numeric value, the build date or commit SHA it was measured on, and the workload type — so the regression window is unambiguous.

See also