Torch-Spyre

For Users

  • Getting Started
    • How Torch-Spyre works: an out-of-tree PyTorch backend
      • A device with a different execution model from a GPU
      • Challenge 1: making PyTorch recognize a new device
      • Challenge 2: teaching PyTorch a memory layout it had never seen
      • Challenge 3: extending TorchInductor for dataflow compilation
        • SuperDSC: the Spyre tile-level intermediate representation
      • Challenge 4: covering ops in a model forward pass
      • What we learned
      • What is next
      • Getting started
      • Appendix: extension point reference for out-of-tree PyTorch backends
        • Challenge 1: device registration (PrivateUse1)
        • Challenge 2: tiled tensor layout (FixedTiledLayout)
        • Challenge 3: Inductor backend for dataflow compilation
        • Challenge 4: op coverage
        • Profiling (in progress)
      • Acknowledgments
    • Key concepts
      • 1. Execution model
      • 2. Hardware
      • 3. Memory hierarchy
      • 4. Sticks and tiled tensors
      • 5. Eager vs compiled path
      • 6. Graph breaks
      • 7. Compilation pipeline
      • 8. Dtype defaults and casting
      • 9. Running models today: FMS vs stock HuggingFace
      • 10. Hardware constraints checklist
      • Where to go next
    • Glossary
    • Installation
      • Status
      • Requirements
      • Verify the Installation
      • Running the Test Suite
      • Next Steps
    • Run PyTorch on Spyre device
      • Creating a Tensor
      • Running Tensor Operations
    • More Examples
  • User Guide
    • Running Models on Spyre
      • Using torch.compile
      • Supported Operations
      • Configuration
      • Examples
      • Troubleshooting
    • Supported Operations
      • Operations Table
      • Unsupported Operations
    • Tensor Layouts
      • Conceptual Overview
      • PyTorch Tensor Layouts
        • Rank and Dimensions
        • Size and Stride
        • Mapping Tensor Coordinates to Memory
      • Motivation for Spyre Tensor Layouts
        • Per-operation stick and padding constraints
      • Spyre Tensor Layouts
        • Why stride_map replaced dim_map
      • DMA Encoding
      • Access Patterns
      • Default Layouts and Controlling Layouts
        • Default Layout Example
        • Specifiying Alternate Layouts
      • Layout Compatibility
      • Generating DCIs and SuperDSCs
      • Future Extensions
    • Profiling
      • Environment Variables for Profiling
        • Logging
        • Compiler configuration
        • Device enumeration
        • Runtime / driver (for aiu-smi and aiu-trace-analyzer)
        • Quick-reference recipes
      • PyTorch Profiler on Spyre
        • CPU-only (no extra install)
        • CPU + PrivateUse1
        • Advanced features
        • Known issues (from torch-spyre-docs)
        • See also
      • Device Monitoring with aiu-smi
        • Install
        • Two-terminal workflow
        • Known issues
        • See also
      • Trace Analysis
        • Quick start
        • aiu-trace-analyzer
        • See also
      • Performance Analysis Methodology
        • 1. Bound the measured region
        • 2. Pair the trace with aiu-smi
        • 3. Filing a performance report
        • See also
      • Toolkit Usage Matrix
      • End-to-End Example: Profiling a Granite Model on Spyre
        • What you need
        • Setup
        • The script
        • Inspect the trace
        • Run with telemetry alongside
        • What to look for
        • See also
      • What can be profiled today
      • Toolkit layers
      • Contents
      • See also
    • Debugging
      • Inductor Debug Artifacts
        • Enabling artifact dumps
        • Directory layout
        • What each layer tells you
        • Inductor provenance tracking
        • Quick reference
        • See also
      • Overview
      • Step 1 — Create a Minimal Reproducer
      • Step 2 — Enable Debug Environment Variables
      • Step 3 — Examine Compiler Artifacts
        • What to look for at each layer
        • Example: debugging an incorrect clone result
      • Step 4 — Bisect Frontend vs. Backend with sendnn
      • Checklist for Filing a Bug Report
      • Quick Reference
    • Examples
      • Available Examples
      • Running an Example
      • Writing Your Own Example
      • See Also
  • API Reference
    • torch_spyre
      • Device Management
        • torch.spyre.is_available()
        • torch.spyre.device_count()
        • torch.spyre.current_device()
        • torch.spyre.set_device()
        • torch.spyre.is_initialized()
      • Random Number Generation
        • torch.spyre.manual_seed()
        • torch.spyre.manual_seed_all()
      • Streams
        • torch.spyre.Stream
        • torch.spyre.stream()
        • torch.spyre.current_stream()
        • torch.spyre.default_stream()
        • torch.spyre.synchronize()
      • Tensor Operations
      • Compilation
      • Tensor Layouts
        • torch_spyre._C.SpyreTensorLayout
        • torch_spyre._C.DataFormats
        • torch_spyre._C.get_spyre_tensor_layout()
        • torch_spyre._C.set_spyre_tensor_layout()
      • Warnings
        • torch_spyre._C.get_downcast_warning()
        • torch_spyre._C.set_downcast_warning()
      • Constants
        • torch_spyre.constants.DEVICE_NAME
      • Environment Variables

For Developers

  • Architecture
    • IBM Spyre device
      • What is Spyre
      • Key features
        • Core microarchitecture
        • Memory and tiling constants
        • Production deployments
      • Use cases
      • Integration with PyTorch
      • Learn more
    • Dataflow Accelerator Architecture
      • What is a Dataflow Accelerator?
        • Dataflow Firing Rule
        • Static vs Dynamic Dataflow
        • Relationship to Out-of-Order Execution
        • Why Dataflow is Effective for Deep Learning
      • Spyre Architecture Highlights
      • Memory Hierarchy
      • Execution Model
      • Comparison with GPU and Other Accelerators
        • Modern Dataflow-Inspired Accelerators
      • Limitations and Challenges
      • References
      • Further Reading
  • Compiler Stack
    • Overview
      • Background
      • Front-end Compiler Overview
        • Additional Topics
    • Inductor Front-End: Deep Dive
      • Inductor Backend Registration
      • Extending Compilation
        • FX Graph Passes
        • LoopLevelIR Passes
        • Views and Index Translation
        • Code Generation
      • Extending Operations
        • Custom Operations
        • Decompositions
        • Lowerings
      • Module Reference
    • Back-End Compiler (DeepTools)
      • Responsibilities
      • Front-End Artifacts
      • SuperDSC Format
        • Top-level structure
        • Folding and affine transforms
        • Codegen pipeline (front-end to SuperDSC)
        • Example: an add OpSpec
        • Why JSON
        • From SuperDSC to KTIR
      • Invocation
      • Further Reading
    • Spyre Inductor Operation Cookbook
      • Direct mapping from ATen to OpFunc
      • Spyre-specific decompositions
      • Spyre-specific lowerings
      • Spyre-specific OpFuncs
      • Custom ops as CPU fallbacks
    • Working Set Reduction - Design Document
      • Approach
      • Working set reduction hints
      • Example 1: Naming Dimensions and Tiling
      • Dimensions vs. named dimensions
      • Example 2: View-Based Dimension Splitting
      • Implementation
        • Intermediate representation
        • Lowering
        • Transformation
      • Related documents
    • Coarse-Tiling Loop IR for the Spyre Backend
      • Background
      • Design Overview
      • Small Example
        • What the coarse-tiling pass stamps
        • LoopLevel IR after CustomPreSchedulingPasses
        • Generated OpSpec (Python wrapper source)
        • Generated bundle.mlir
      • Layer 1 — Pre-scheduling IR pass
        • Attribute contract on ir.Operation
        • Why these three attributes are sufficient
        • Loops is a frozen dataclass
        • Public API: coarse_tile()
        • Feature flag and groups callable
        • Placement in CustomPreSchedulingPasses
        • Buffer propagation: insert_tiling_propagation
      • Layer 2 — CountedLoopSchedulerNode
        • Class definition
        • Why FusedSchedulerNode is the right base
        • The post-fusion pass and ordering
        • The grouping algorithm
      • Layer 3 — LoopSpec and codegen
        • LoopSpec and OpSpec.tiled_symbols in op_spec.py
        • Nested loops and the loop_group_id tree
        • Bundle boundary constraint
        • Changes to SuperDSCScheduling.codegen_node()
        • Serialization in codegen_kernel()
        • bundle.mlir generation for loops
      • Files changed
      • Invariants and failure modes
      • Rejected design alternatives
        • Inductor’s existing loop IR
        • Helion’s ForLoopGraphInfo
        • Attribute-only approach (Option B)
      • Out of scope
    • Work Division Planning
      • Motivation
      • Iteration Space
      • Hardware Memory Span Constraint
      • Planning Algorithm
        • Pass 1 — Span Reduction (span_reduction)
        • Pass 2 — Work Distribution (work_distribution)
      • Operation-Specific Strategies
        • Pointwise Operations
        • Reduction Operations
      • Configuration
      • Limitations and Future Work
      • See Also
    • Scratchpad (LX) optimization
      • Hardware context
      • Why scratchpad planning matters
      • Assumptions
        • LX state survives kernel boundaries
        • Working sets are already right-sized
        • No eviction from LX
      • Pipeline position
      • Optimizations on softmax
        • Multi-core LX
      • Implementation
        • Architecture
        • Entry point
        • Codegen integration
      • Solvers
        • GreedyLayoutSolver (default)
        • FirstFitLayoutSolver and BestFitLayoutSolver
      • Co-optimization with work-distribution
      • Current limitations
        • Greedy single-pass, no lookahead (default solver)
        • No defragmentation
        • Co-optimization is a POC
        • No cross-core ring utilization
      • Target patterns
      • Future work
        • Non-greedy solvers
        • Richer co-optimization
        • Solver-driven graph mutations
        • Cross-core ring transfers
        • Non-terminal kernel hints
      • Testing
      • Related documents
  • Runtime
    • Responsibilities
    • Device Registration
      • Device Enumeration
    • Key C++ Components
    • Python Entry Point
    • Memory Model
      • SpyreTensorImpl
      • SpyreAllocator
    • Eager Operations
    • Streams
      • Stream Pool
    • Multi-Card Support
    • TODO
  • Contributing
    • Contribution Guidelines
      • Development Workflow
        • 1. Fork and clone
        • 2. Keep your fork in sync
        • 3. Create a branch
        • 4. Push to your fork and open a PR
        • 5. After your PR is merged
      • Before You Start
      • Code Quality Standards
      • Building the Docs Locally
      • How to Extend the Compiler
      • Reporting Issues
    • Op enablement overview
      • Create an Op Enablement Issue
      • Verify Basic Functionality With Simple Parameters
      • Add Unit Tests to torch-spyre
      • Verify your implementation with actual parameters used in target models
      • Fix failures
      • Submit a PR
      • Report Test Failures as Bug Issues
      • Assign Classification Labels
    • Contributing to the Profiler
      • What makes profiler work different
      • Branch naming
      • PR title prefix
      • PR scope
      • Building with the profiler enabled
      • Testing profiler changes
      • Trace and telemetry sanity checks
      • Coordinating with kineto-spyre
      • Documentation expectations
      • Reviewers
  • RFCs
    • Index
    • Summaries
      • RFC 0047 — Tensors with Device-Specific Layouts
      • RFC 0171 — Spyre Device Construct in PyTorch
      • RFC 0186 — Test Frameworks
      • RFC 0601 — Spyre Profiling Toolkit
      • RFC 0682 — Kernel Tile Intermediate Representation (KTIR)
      • RFC 1287 — Test Suite Configuration for Upstream PyTorch Tests on OOT Devices
      • RFC 1632 — Model Enablement Tracking
      • RFC 1633 — End-to-End Model Performance Testing
Torch-Spyre
  • User Guide
  • View page source

User Guide

This section covers everything you need to run PyTorch models on the Spyre device: supported operations, tensor layouts, profiling, debugging, and worked examples.

  • Running Models on Spyre
    • Using torch.compile
    • Supported Operations
    • Configuration
    • Examples
    • Troubleshooting
  • Supported Operations
    • Operations Table
    • Unsupported Operations
  • Tensor Layouts
    • Conceptual Overview
    • PyTorch Tensor Layouts
    • Motivation for Spyre Tensor Layouts
    • Spyre Tensor Layouts
    • DMA Encoding
    • Access Patterns
    • Default Layouts and Controlling Layouts
    • Layout Compatibility
    • Generating DCIs and SuperDSCs
    • Future Extensions
  • Profiling
    • What can be profiled today
    • Toolkit layers
    • Contents
    • See also
  • Debugging
    • Overview
    • Step 1 — Create a Minimal Reproducer
    • Step 2 — Enable Debug Environment Variables
    • Step 3 — Examine Compiler Artifacts
    • Step 4 — Bisect Frontend vs. Backend with sendnn
    • Checklist for Filing a Bug Report
    • Quick Reference
  • Examples
    • Available Examples
    • Running an Example
    • Writing Your Own Example
    • See Also
Previous Next

© Copyright 2025, Torch-Spyre Core Team.

Built with Sphinx using a theme provided by Read the Docs.