End-to-End Example: Profiling a Granite Model on Spyre

Stack: torch-spyre (new, Inductor-based).

This page shows how to capture a torch.profiler trace of a Granite-class model running on Spyre, paired with aiu-smi device telemetry. It uses today’s tooling: torch.profiler + kineto-spyre + aiu-smi + aiu-trace-analyzer.

The Granite end-to-end path on Spyre today goes through the Foundation Model Stack and aiu-fms-testing-utilsnot HuggingFace AutoModelForCausalLM directly. Spyre support for both currently exists on the eager_spyre branch of each repo, so install them from source off that branch rather than from PyPI.

Where this page is going

The script below wires torch.profiler around fms.get_model(...) explicitly. Once RFC 0601 lands, the in-tree torch_spyre.profiler API will replace this glue and the script will shrink. This page will be revised to the in-tree API once it ships.

What you need

Piece

Source

Sample install (verify against the upstream README)

foundation-model-stack (fms)

github.com/foundation-model-stack/foundation-model-stack (eager_spyre branch)

git clone -b eager_spyre <repo>.git && uv pip install -e ./foundation-model-stack

aiu-fms-testing-utils

github.com/foundation-model-stack/aiu-fms-testing-utils (eager_spyre branch)

git clone -b eager_spyre <repo>.git && uv pip install -e ./aiu-fms-testing-utils

kineto-spyre

github.com/IBM/kineto-spyre

uv pip install --no-deps <release-wheel-url-matching-your-pytorch> (see releases page)

aiu-trace-analyzer (optional)

github.com/IBM/aiu-trace-analyzer

pip install aiu-trace-analyzer

Granite checkpoint

huggingface.co/ibm-granite/granite-3.3-8b-instruct

huggingface-cli download ibm-granite/granite-3.3-8b-instruct --local-dir /tmp/models/granite-3.3-8b-instruct

The sample commands above are starting points; each upstream README is the source of truth and may require additional steps (extras, source installs, branch selection) depending on your platform and PyTorch build.

Setup

# Install fms and aiu-fms-testing-utils from the eager_spyre branch
# (the branch that registers the Spyre device backend today).
git clone -b eager_spyre https://github.com/foundation-model-stack/foundation-model-stack.git
git clone -b eager_spyre https://github.com/foundation-model-stack/aiu-fms-testing-utils.git
uv pip install -e ./foundation-model-stack
uv pip install -e ./aiu-fms-testing-utils

# Cache HuggingFace artifacts and download the Granite checkpoint.
export HF_HOME=/tmp/models/hf_cache
huggingface-cli download ibm-granite/granite-3.3-8b-instruct \
  --local-dir /tmp/models/granite-3.3-8b-instruct

# Example kineto-spyre wheel for PyTorch 2.10.0 + Python 3.12 on x86_64
# Linux. Pick the wheel that matches your stack from the releases page.
uv pip install --no-deps --force-reinstall \
  https://github.com/IBM/kineto-spyre/releases/download/torch-2.10.0.aiu.kineto.1.1.1/torch-2.10.0+aiu.kineto.1.1.1-cp312-cp312-linux_x86_64.whl

Useful environment variables (see Environment variables for the full list):

export PYTHONUNBUFFERED=1
export SENCORES=32                 # full accelerator (1–32; default 32)
# Inductor visibility — uncomment when investigating compile-time issues:
# export TORCH_LOGS=ir_post_fusion,output_code,graph,aot_graphs,post_grad_graphs
# export TORCH_LOGS_FORMAT=short
# export TORCH_SPYRE_DEBUG=1

The script

Save as profile_granite.py.

import time
from statistics import mean, median

import torch
from torch.profiler import ProfilerActivity, profile
from fms.models import get_model
from transformers import AutoTokenizer

DEVICE = torch.device("spyre")
DTYPE = torch.float16  # Spyre's default dtype
MODEL_PATH = "/tmp/models/granite-3.3-8b-instruct"

# 1. Load Granite via FMS.
model = get_model(
    architecture="hf_pretrained",
    model_path=MODEL_PATH,
    device_type="spyre",
    data_type=DTYPE,
    unfuse_weights=True,
).eval().to(DEVICE)
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)

# 2. Compile through the Spyre Inductor backend.
model.compile()

# 3. Build a prefill input. Replace torch.randint with a real-prompt
#    encoding (tokenizer(..., padding="max_length", max_length=512))
#    for non-toy runs.
inputs = torch.randint(
    0, tokenizer.vocab_size, (1, 512), dtype=torch.int64,
).to(DEVICE)

# 4. Warm up — first call triggers torch.compile + kernel codegen.
with torch.no_grad():
    model(inputs)

# 5. Profile a steady-state forward pass.
N_RUNS = 5
wall_clock_ms = []
with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.PrivateUse1],
    record_shapes=True,
    profile_memory=True,
    on_trace_ready=torch.profiler.tensorboard_trace_handler("logs/granite"),
) as prof:
    for _ in range(N_RUNS):
        t0 = time.perf_counter()
        with torch.no_grad():
            model(inputs)
        wall_clock_ms.append((time.perf_counter() - t0) * 1000)
        prof.step()

# 6. Two timing signals: wall-clock (what the user feels) and
#    profiler-derived CPU time (host-side overhead). The gap between
#    them ≈ device-side work.
cpu_per_run_ms = sum(e.self_cpu_time_total for e in prof.events()) / 1000 / N_RUNS

print(prof.key_averages().table(sort_by="device_time_total", row_limit=10))
print(f"wall-clock ms: mean={mean(wall_clock_ms):.3f} median={median(wall_clock_ms):.3f}")
print(f"profiler-derived CPU ms (per run): {cpu_per_run_ms:.3f}")

Three patterns to call out:

  • Run iteration 1 outside the timed loop. First call carries the torch.compile cost; including it skews the average.

  • Two orthogonal timing signals. Wall-clock from time.perf_counter() is what the user feels; profiler-derived CPU time (sum(e.self_cpu_time_total)) is host-side overhead. The difference between them ≈ device-side work.

  • tensorboard_trace_handler(log_dir) over export_chrome_trace. Per-step JSON files isolate compile-time warmup from steady-state runs and open in TensorBoard and Chrome / Perfetto.

See PyTorch Profiler.

Inspect the trace

The logs/granite/ directory will contain one JSON per profiler step. Open in any of:

  • chrome://tracing — built into Chromium / Chrome.

  • Perfetto UI — drag-and-drop the file.

  • TensorBoard — tensorboard --logdir=logs/granite.

Then post-process with aiu-trace-analyzer to extract derived metrics (kernel durations, gap analysis, idle bubbles). See Trace analysis.

Run with telemetry alongside

aiu-smi requires the senlib config file environment variable to be set before it can talk to the device. Set it (and any other device-discovery env vars your environment requires) in the same shell before launching aiu-smi.

In one terminal:

export SENLIB_DEVEL_CONFIG_FILE=/path/to/senlib_config.json
aiu-smi dmon | tee /tmp/aiu-smi.log

In another:

python profile_granite.py

aiu-smi dmon samples once a second and streams power, temperature, PT-array utilization, device-memory and PCIe bandwidth. Pair its timestamps with the trace timeline to attribute idle gaps to either host-side work or device-side stalls. See Device monitoring.

What to look for

For a Granite-class transformer the typical signals are:

Symptom

Likely cause

Where to dig

First iteration much slower than the rest

torch.compile warmup

Expected — discard the first iteration.

Wall-clock ≫ profiler CPU

Device-side work dominates (good for compute-bound layers like MLP / large matmul)

Cross-check with aiu-smi PT-array util.

Wall-clock ≈ profiler CPU

Host-side bottleneck — Python or Dynamo overhead

TORCH_LOGS="+inductor"

Per-layer kernel gaps

Tile staging between LPDDR5 and LX scratchpad

Performance analysis methodology

Low PT-array utilization in aiu-smi

Work-division inefficiency, stick-alignment padding

Compiler work division

Long Inductor pass times in stderr

Compile-time regression

Inductor debug artifacts

Idle bubbles between consecutive kernels

Reconfiguration latency or DMA stalls

aiu-trace-analyzer gap analysis

See also