# End-to-End Example: Profiling a Granite Model on Spyre **Stack:** torch-spyre (new, Inductor-based). This page shows how to capture a `torch.profiler` trace of a Granite-class model running on Spyre, paired with `aiu-smi` device telemetry. It uses today's tooling: `torch.profiler` + [`kineto-spyre`][kineto-spyre] + [`aiu-smi`](device_monitoring.md) + [`aiu-trace-analyzer`](trace_analysis.md). The Granite end-to-end path on Spyre today goes through the [Foundation Model Stack][fms] and [`aiu-fms-testing-utils`][aiu-fms] — **not** HuggingFace `AutoModelForCausalLM` directly. Spyre support for both currently exists on the `eager_spyre` branch of each repo, so install them from source off that branch rather than from PyPI. :::{admonition} Where this page is going :class: note The script below wires `torch.profiler` around `fms.get_model(...)` explicitly. Once [RFC 0601][rfc-0601] lands, the in-tree `torch_spyre.profiler` API will replace this glue and the script will shrink. This page will be revised to the in-tree API once it ships. ::: ## What you need | Piece | Source | Sample install (verify against the upstream README) | |---|---|---| | `foundation-model-stack` (`fms`) | [github.com/foundation-model-stack/foundation-model-stack][fms] (`eager_spyre` branch) | `git clone -b eager_spyre .git && uv pip install -e ./foundation-model-stack` | | `aiu-fms-testing-utils` | [github.com/foundation-model-stack/aiu-fms-testing-utils][aiu-fms] (`eager_spyre` branch) | `git clone -b eager_spyre .git && uv pip install -e ./aiu-fms-testing-utils` | | `kineto-spyre` | [github.com/IBM/kineto-spyre][kineto-spyre] | `uv pip install --no-deps ` (see [releases page][kineto-spyre-releases]) | | `aiu-trace-analyzer` (optional) | [github.com/IBM/aiu-trace-analyzer][ata] | `pip install aiu-trace-analyzer` | | Granite checkpoint | [huggingface.co/ibm-granite/granite-3.3-8b-instruct](https://huggingface.co/ibm-granite/granite-3.3-8b-instruct) | `huggingface-cli download ibm-granite/granite-3.3-8b-instruct --local-dir /tmp/models/granite-3.3-8b-instruct` | The sample commands above are starting points; each upstream README is the source of truth and may require additional steps (extras, source installs, branch selection) depending on your platform and PyTorch build. ## Setup ```bash # Install fms and aiu-fms-testing-utils from the eager_spyre branch # (the branch that registers the Spyre device backend today). git clone -b eager_spyre https://github.com/foundation-model-stack/foundation-model-stack.git git clone -b eager_spyre https://github.com/foundation-model-stack/aiu-fms-testing-utils.git uv pip install -e ./foundation-model-stack uv pip install -e ./aiu-fms-testing-utils # Cache HuggingFace artifacts and download the Granite checkpoint. export HF_HOME=/tmp/models/hf_cache huggingface-cli download ibm-granite/granite-3.3-8b-instruct \ --local-dir /tmp/models/granite-3.3-8b-instruct # Example kineto-spyre wheel for PyTorch 2.10.0 + Python 3.12 on x86_64 # Linux. Pick the wheel that matches your stack from the releases page. uv pip install --no-deps --force-reinstall \ https://github.com/IBM/kineto-spyre/releases/download/torch-2.10.0.aiu.kineto.1.1.1/torch-2.10.0+aiu.kineto.1.1.1-cp312-cp312-linux_x86_64.whl ``` Useful environment variables (see [Environment variables](environment_variables.md) for the full list): ```bash export PYTHONUNBUFFERED=1 export SENCORES=32 # full accelerator (1–32; default 32) # Inductor visibility — uncomment when investigating compile-time issues: # export TORCH_LOGS=ir_post_fusion,output_code,graph,aot_graphs,post_grad_graphs # export TORCH_LOGS_FORMAT=short # export TORCH_SPYRE_DEBUG=1 ``` ## The script Save as `profile_granite.py`. ```python import time from statistics import mean, median import torch from torch.profiler import ProfilerActivity, profile from fms.models import get_model from transformers import AutoTokenizer DEVICE = torch.device("spyre") DTYPE = torch.float16 # Spyre's default dtype MODEL_PATH = "/tmp/models/granite-3.3-8b-instruct" # 1. Load Granite via FMS. model = get_model( architecture="hf_pretrained", model_path=MODEL_PATH, device_type="spyre", data_type=DTYPE, unfuse_weights=True, ).eval().to(DEVICE) tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH) # 2. Compile through the Spyre Inductor backend. model.compile() # 3. Build a prefill input. Replace torch.randint with a real-prompt # encoding (tokenizer(..., padding="max_length", max_length=512)) # for non-toy runs. inputs = torch.randint( 0, tokenizer.vocab_size, (1, 512), dtype=torch.int64, ).to(DEVICE) # 4. Warm up — first call triggers torch.compile + kernel codegen. with torch.no_grad(): model(inputs) # 5. Profile a steady-state forward pass. N_RUNS = 5 wall_clock_ms = [] with profile( activities=[ProfilerActivity.CPU, ProfilerActivity.PrivateUse1], record_shapes=True, profile_memory=True, on_trace_ready=torch.profiler.tensorboard_trace_handler("logs/granite"), ) as prof: for _ in range(N_RUNS): t0 = time.perf_counter() with torch.no_grad(): model(inputs) wall_clock_ms.append((time.perf_counter() - t0) * 1000) prof.step() # 6. Two timing signals: wall-clock (what the user feels) and # profiler-derived CPU time (host-side overhead). The gap between # them ≈ device-side work. cpu_per_run_ms = sum(e.self_cpu_time_total for e in prof.events()) / 1000 / N_RUNS print(prof.key_averages().table(sort_by="device_time_total", row_limit=10)) print(f"wall-clock ms: mean={mean(wall_clock_ms):.3f} median={median(wall_clock_ms):.3f}") print(f"profiler-derived CPU ms (per run): {cpu_per_run_ms:.3f}") ``` Three patterns to call out: - **Run iteration 1 outside the timed loop.** First call carries the `torch.compile` cost; including it skews the average. - **Two orthogonal timing signals.** Wall-clock from `time.perf_counter()` is what the user feels; profiler-derived CPU time (`sum(e.self_cpu_time_total)`) is host-side overhead. The difference between them ≈ device-side work. - **`tensorboard_trace_handler(log_dir)` over `export_chrome_trace`.** Per-step JSON files isolate compile-time warmup from steady-state runs and open in TensorBoard *and* Chrome / Perfetto. See [PyTorch Profiler](pytorch_profiler.md). ## Inspect the trace The `logs/granite/` directory will contain one JSON per profiler step. Open in any of: - `chrome://tracing` — built into Chromium / Chrome. - [Perfetto UI](https://ui.perfetto.dev/) — drag-and-drop the file. - TensorBoard — `tensorboard --logdir=logs/granite`. Then post-process with `aiu-trace-analyzer` to extract derived metrics (kernel durations, gap analysis, idle bubbles). See [Trace analysis](trace_analysis.md). ## Run with telemetry alongside `aiu-smi` requires the senlib config file environment variable to be set before it can talk to the device. Set it (and any other device-discovery env vars your environment requires) in the same shell before launching `aiu-smi`. In one terminal: ```bash export SENLIB_DEVEL_CONFIG_FILE=/path/to/senlib_config.json aiu-smi dmon | tee /tmp/aiu-smi.log ``` In another: ```bash python profile_granite.py ``` `aiu-smi dmon` samples once a second and streams power, temperature, PT-array utilization, device-memory and PCIe bandwidth. Pair its timestamps with the trace timeline to attribute idle gaps to either host-side work or device-side stalls. See [Device monitoring](device_monitoring.md). ## What to look for For a Granite-class transformer the typical signals are: | Symptom | Likely cause | Where to dig | |---|---|---| | First iteration much slower than the rest | `torch.compile` warmup | Expected — discard the first iteration. | | Wall-clock ≫ profiler CPU | Device-side work dominates (good for compute-bound layers like MLP / large matmul) | Cross-check with `aiu-smi` PT-array util. | | Wall-clock ≈ profiler CPU | Host-side bottleneck — Python or Dynamo overhead | `TORCH_LOGS="+inductor"` | | Per-layer kernel gaps | Tile staging between LPDDR5 and LX scratchpad | [Performance analysis methodology](performance_analysis_methodology.md) | | Low PT-array utilization in `aiu-smi` | Work-division inefficiency, stick-alignment padding | [Compiler work division](../../compiler/work_division_planning.md) | | Long Inductor pass times in stderr | Compile-time regression | [Inductor debug artifacts](../debugging/inductor_artifacts.md) | | Idle bubbles between consecutive kernels | Reconfiguration latency or DMA stalls | `aiu-trace-analyzer` gap analysis | ## See also - [PyTorch Profiler](pytorch_profiler.md) — `torch.profiler` reference - [Device monitoring](device_monitoring.md) — `aiu-smi` setup - [Trace analysis](trace_analysis.md) — viewers and `aiu-trace-analyzer` - [Performance analysis methodology](performance_analysis_methodology.md) — bounding a region and pairing traces with telemetry - [Environment variables](environment_variables.md) — full list of logging and runtime flags - [Running Models](../running_models.md) — `torch.compile` basics - [RFC 0601][rfc-0601] — full profiling toolkit design [fms]: https://github.com/foundation-model-stack/foundation-model-stack [aiu-fms]: https://github.com/foundation-model-stack/aiu-fms-testing-utils [kineto-spyre]: https://github.com/IBM/kineto-spyre [kineto-spyre-releases]: https://github.com/IBM/kineto-spyre/releases [ata]: https://github.com/IBM/aiu-trace-analyzer [rfc-0601]: https://github.com/torch-spyre/rfcs/blob/main/0601-SpyreProfilingToolkit/0601-SpyreProfilingToolkitRFC.md