PyTorch Profiler on Spyre

Stack: torch-spyre (new, Inductor-based).

torch.profiler.profile is the entry point for per-op timing on Spyre. Two modes are available:

CPU-only — no extra install; measures host-side Python and torch.compile activity.
CPU + PrivateUse1 — measures CPU and Spyre-side kernel activity; requires the kineto-spyre PyTorch wheel.

CPU-only (no extra install)

import torch
from torch.profiler import profile, ProfilerActivity

compiled = torch.compile(model, backend="spyre")

with profile(activities=[ProfilerActivity.CPU]) as prof:
    output = compiled(x_spyre)

print(prof.key_averages().table(sort_by="cpu_time_total"))

This captures CPU wall-clock for every ATen call and every Dynamo / Inductor stage.

CPU + PrivateUse1

Install a matching kineto-spyre wheel for your PyTorch version (check the releases page for the current combination). Example URL for PyTorch 2.10.0:

uv pip install --no-deps --force-reinstall \
  https://github.com/IBM/kineto-spyre/releases/download/torch-2.10.0.aiu.kineto.1.1.1/torch-2.10.0+aiu.kineto.1.1.1-cp312-cp312-linux_x86_64.whl

Then profile with ProfilerActivity.PrivateUse1:

import torch
from torch.profiler import profile, ProfilerActivity

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.PrivateUse1],
    record_shapes=True,
    profile_memory=True,
    on_trace_ready=torch.profiler.tensorboard_trace_handler("./logs/mymodel"),
) as prof:
    compiled_result = compiled(x_device).cpu()

Print aggregates

print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10).replace("CUDA", "AIU"))
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10).replace("CUDA", "AIU"))

The .replace("CUDA", "AIU") is a cosmetic workaround — the profiler’s internal column category is still named after CUDA; native renaming is on the roadmap.

Export a trace for viewers

prof.export_chrome_trace("spyre_trace.json")

See Trace analysis for viewing.

Advanced features

Full reference lives in the upstream PyTorch profiler documentation:

record_function — annotate named spans
schedule — skip warmup, sample a bounded window
on_trace_ready — stream to TensorBoard-compatible JSON
with_stack — include file and line for Python ops

Known issues (from torch-spyre-docs)

Multi-AIU communication profiling is not supported yet.