Profiling

Stack: torch-spyre (new, Inductor-based).

Scope: performance — why is it slow? For correctness questions (why is the result wrong?) see Debugging.

Torch-Spyre provides tooling to measure the performance of PyTorch workloads running on the Spyre accelerator. The full design of the planned toolkit is in RFC 0601 — Spyre Profiling Toolkit.

The in-tree torch_spyre.profiler package is currently a scaffold — torch_spyre.profiler.is_available() returns False, and there is no public API yet. Profiling today goes through torch.profiler plus the external integrations described on this page (kineto-spyre, aiu-smi, aiu-trace-analyzer); the in-tree API will be populated as RFC 0601 lands.

What can be profiled today

Capability	Status	Where
Compiler pipeline logs	Available	Environment variables
CPU-side timing with `torch.profiler`	Available	PyTorch Profiler
Device telemetry (power, temperature, bandwidth)	Available — PF and VF mode (IBM-internal distribution; public release tracked in #1335)	Device monitoring
Device-side kernel timing via `ProfilerActivity.PrivateUse1`	Preview (requires `kineto-spyre` wheel)	PyTorch Profiler
Trace post-processing (aiu-trace-analyzer)	Available, known gaps	Trace analysis
`torch.spyre.memory.memory_allocated()` / `max_memory_allocated()`	Available — delegates to `torch.accelerator.memory` (PR #770)	Quick example
Kineto bridge (`SpyreActivityProfiler`)	In progress — in-tree Kineto integration for `ProfilerActivity.PrivateUse1` device-side events (PR #1856)	upstream Kineto integration
Scratchpad utilization metrics	Planned	RFC 0601
IR-instrumentation-based fine-grained profiler	Planned	RFC 0601

Memory API quick example

torch.spyre.memory re-exports torch.accelerator.memory, so the same memory-query calls used on CUDA apply to Spyre. The example below allocates a tensor, frees it, and reads the current and peak allocations:

import torch

# Reset the peak counter so max_memory_allocated() starts from zero.
torch.spyre.memory.reset_peak_memory_stats()

# Allocate on the device; memory_allocated() reflects the new total.
x = torch.rand((64, 64), dtype=torch.float16, device="spyre")
print(torch.spyre.memory.memory_allocated())     # bytes currently allocated

# Free the tensor. memory_allocated() drops back, but the peak persists.
del x
print(torch.spyre.memory.memory_allocated())     # current allocation
print(torch.spyre.memory.max_memory_allocated()) # peak since reset

The module also exposes reset_accumulated_memory_stats() and memory_stats().

Toolkit layers

Layer	Tool	Granularity
Application / PyTorch	`torch.profiler` + kineto-spyre	Kernel-level
Compiler frontend	Inductor logging	Pass-level
Compiler backend	IR instrumentation (planned)	Intra-kernel
Runtime	`libaiupti` kernel + memory events	Kernel + memory
Device / HW	`aiu-smi`	Device-level telemetry
Post-processing	aiu-trace-analyzer	Derived metrics

Profiling topics

Environment variables — logging, device enumeration, runtime/driver variables used by aiu-smi and aiu-trace-analyzer
PyTorch Profiler — torch.profiler usage, CPU today, device-side preview
Device monitoring — aiu-smi setup
Trace analysis — Chrome / Perfetto / TensorBoard viewing and aiu-trace-analyzer post-processing
Performance analysis methodology — bounding a region and pairing traces with telemetry
Toolkit usage matrix — which tool for which metric
End-to-end example — profiling a Granite model on Spyre, gluing all four tools into one workflow