# Performance Analysis Methodology

**Stack:** torch-spyre (new, Inductor-based).

:::{admonition} Stub
:class: warning

This page is a scaffold. Methodology examples — bottleneck
classification, kernel drill-down, category breakdowns, multi-rank
analysis — will land here as real new-stack traces become available
and are validated against [RFC 0601][rfc-0601] tooling. Contributions
welcome.
:::

The high-value pattern today is capturing a time-bounded
`torch.profiler` trace alongside `aiu-smi` telemetry and reading them
together.

## 1. Bound the measured region

Use `torch.profiler`'s `schedule` + `record_function` to avoid
measuring compile/warmup cost and to make iterations easy to select
in the viewer:

```python
from torch.profiler import profile, ProfilerActivity, schedule, record_function

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.PrivateUse1],
    schedule=schedule(wait=1, warmup=2, active=5, repeat=1),
) as prof:
    for step in range(10):
        with record_function(f"iteration_{step}"):
            output = model(inputs)
        prof.step()

prof.export_chrome_trace("spyre_trace.json")
```

See the upstream [PyTorch profiler documentation][torch-profiler-docs]
for the full `schedule` / `record_function` API.

## 2. Pair the trace with `aiu-smi`

Run `aiu-smi` in a second shell during the profiling window (see
[Device monitoring](device_monitoring.md)). Both timestamps are
wall-clock, so you can line up a region of the trace with the
corresponding sample lines.

Which `aiu-smi` columns to look at depends on the question you're
asking — consult `aiu-smi --help` for the current column set. Note
that on the current new-stack build `rsvmem` and `pt_act` are not
captured correctly.

For post-processing the captured trace (additional statistics, trace
enrichment), see [`aiu-trace-analyzer`](trace_analysis.md#aiu-trace-analyzer)
([public repository][ata]).

## 3. Filing a performance report

When opening an issue, include:

- [ ] Minimal reproducer script and iteration count
- [ ] PyTorch version and torch-spyre commit SHA
- [ ] `aiu-smi` output covering at least one full active iteration
- [ ] `spyre_trace.json` or the TensorBoard log directory
- [ ] Summary table printed by `prof.key_averages().table(...)`
- [ ] What you expected vs. what you saw (latency or throughput)
- [ ] **For a performance regression**, cite the previous metric — the
  numeric value, the build date or commit SHA it was measured on, and
  the workload type — so the regression window is unambiguous.

## See also

- [PyTorch Profiler](pytorch_profiler.md) — generating traces
- [Device monitoring](device_monitoring.md) — `aiu-smi` telemetry
- [Trace analysis](trace_analysis.md) — viewer mechanics
- [RFC 0601][rfc-0601] — planned toolkit

[rfc-0601]: https://github.com/torch-spyre/rfcs/blob/main/0601-SpyreProfilingToolkit/0601-SpyreProfilingToolkitRFC.md
[torch-profiler-docs]: https://pytorch.org/docs/stable/profiler.html
[ata]: https://github.com/IBM/aiu-trace-analyzer