Triton Release Notes
Table of Contents
- Dialect & Frontend
- Backend & Compiler
- AMD/HIP Backend
- NVIDIA Backend
- Gluon & Layout Improvements
- Kernels & Benchmarks
- Testing & CI
- Build & Infrastructure
- Documentation
- Breaking Changes
Dialect & Frontend
New Features
- Warp Specialization Enhancements (#8005): Made warp specialization require at least 4 warps with proper error messaging to prevent compiler crashes
- Ragged TMA Support (#7792, #7783): Added support for write-only and general ragged TMAs with automatic bounds checking using higher-dimensional TMA descriptors
- Device Assert Mask Support (#7905): Added
maskparameter totl.device_assertfor easier debugging with masked operations - Padding Option for TMA Loads (#7993): Added support for padding option (including NaN) in TMA descriptor creation and fallback paths
- Implicit Downcast in TMA Descriptor Store (#6236): Fixed missing implicit downcast when storing blocks through TMA descriptors
- Mutations Disallowed (#7762): Disabled all mutations to address semantic issues in the language
- Specialized Recursion (#7468): Enabled functions to recurse on specialized versions of themselves
- Constexpr Function Cache Invalidation (#7802): Reworked
constexpr_functionto support cache invalidation and capability checks
Bug Fixes
- Floating Point Argument Passing (#7439): Fixed floating point argument passing for
tl.float16and other FP types - Non-Associative Reduce Rematerialization (#7272): Avoided rematerialization for non-associative reduce operations to prevent data consistency issues
- PDL Issue Fix (#7379): Fixed PDL-related issues in the frontend
- Constexpr in Tuples (#7442): Improved handling of constexpr in tuples, fixing type mismatches and in-place mutations
- Loop Carry Detection (#7200): Improved detection of loop carries when
@builtinor@core.externfunctions modify their arguments - Liveouts in Conditionals (#7318): Fixed detection of liveouts in conditional blocks
Improvements
- MLIR Verifier After Parsing (#7999): Run MLIR verifier after parsing to catch errors early
- Better Error for num_cta > 1 on sm < 90 (#7812): Improved error messaging for unsupported configurations
- Extern Elementwise Type Handling (#7930): Fixed mismatched type handling for
core.extern_elementwise - Libdevice Exposure in Gluon (#7890): Exposed libdevice functions with improved layout propagation
Backend & Compiler
LLVM Updates
- LLVM Bump (#7881): Updated to llvm/llvm-project@bc773632355b with multiple API changes including:
- Switched
Constant{Int|Float}Optype and value order - Provided triple for
TargetLibraryInfoImpl - Fixed atomic sync scope for NVIDIA
- Updated MLIR lib names and ops
- Switched
Code Generation
- Generic Swizzling for convert_layout (#6982, #7565): Implemented generalized swizzling algorithm for
convert_layoutthat:- Finds optimal shared memory layout maximizing read/write vectorization
- Minimizes bank conflicts
- Supports
ldmatrix/stmatrixand transpose versions - Uses columns and diagonals for better performance
- Warp-Local Layout Conversion (#7558): Improved warp-local layout conversion algorithm using shuffles with:
- Better handling of broadcasting in layouts
- Fewer
selectandshuffleinstructions - Register packing for sub-32-bit data types
- Byte Permutes in Intra-Warp Conversion (#7809): Used byte permute instructions for better performance in layout conversions
- Tmem Alloc Hoisting (#7568): Hoisted tmem alloc outside of if statements to reduce register pressure
- CP.Async Lowering Improvements (#7314): Moved cp.async to better lowering sequence reusing previous optimizations
Optimizations
- Simpler Codegen for Linear Layouts (#7201): Simplified code generation for linear layouts
- Vectorization Fixes (#7845): Fixed vectorization for
PaddedSharedEncodingwith non-default order - XOR Trick Refactoring (#7397): Refactored XOR trick into helper function for better code reuse
- Shared Memory Offset Fixes (#7949): Fixed various issues with smem base offsets
- Min/Max Redux Optimization for Blackwell (#7465): Implemented new redux.sync optimization
Bug Fixes
- Atomic RMW Broadcasting (#7460): Fixed atomic rmw ops to broadcast results when necessary
- TMA Load with Multiple Users (#7398): Fixed lowering of TMA load when users have differing encodings
- Subview Padding (#7404): Fixed subview padding for PaddedSharedEncoding
- Memdesc Subview Fixes (#7480, #7515): Properly handled memdesc_subview with slicing and offsets
- FP16 to FP32 Conversion (#7585): Fixed fp16 to fp32 conversion issues
- Barrier Synchronization (#7993): Added bar.sync before deallocating tmem to prevent race conditions
AMD/HIP Backend
New Features
- GFX950 (MI350) Support: Added comprehensive support for AMD's latest architecture including:
- ChainedDot Schedule (#7601, #7638): Added new scheduling variant for loops with 2 chained dots
- Ping-Pong Transformation (#7638, #7458): Added ping-pong support for:
- Chained dot schedules
- Async load with num_stages=3
- MXFP types
- Buffer Atomic CAS (#7292): Added support for buffer atomic compare-and-swap
- FP64 MFMA Support (#7461): Added support for fp64 dot operations using MFMA intrinsics
Layout & Memory Optimizations
- General Swizzling Support (#7482, #7606): Enabled ConvertLayoutOp general swizzling
- Padded vs Swizzled Allocation (#7328, #7750): Introduced specialized allocation pass with proper layout selection strategy
- Improved LDS Usage (#7750, #7813): Optimized LDS usage by:
- Preferring swizzle layouts when LDS limits allow
- Using single LDS for both transposed and non-transposed access
- Better layout selection in optimize-lds-usage pass
- TilesPerWarp Parameter (#7283): Added tilesPerWarp parameter to MFMA layout for contiguous tile computation
- Extract Slice Rewrite (#7128): Refactored extract_slice to support:
- Arbitrary tensor ranks
- Relaxed layout constraints
- CTA tile boundary alignment
Code Generation Improvements
- PermlaneSwap Pattern (#7825, #7861): Added general permlane_swap pattern for ConvertLayoutOp
- Register Broadcast (#7407): Added support for register broadcast in slice/concat ops
- Shared Memory Ops for FP4 (#7626): Added support for M/N packed FP4 with transposition
- Direct-to-LDS Loads (#7829): Refactored lowering via common
lowerLdStpath - Local Load/Store Lowering (#7355): Enabled common code path for local_load/store operations
FP8 & Numeric Support
- FP8 Variant Support:
- Dot Scaled Support: Enabled on gfx11 (#7954) and gfx12 (#7644) with emulation via decomposition
- True16 Handling: Disabled on gfx11 due to test failures (#7953)
Stream Pipeliner Enhancements
- Refactoring (#7526, #7556): Refactored to use more common pipeliner functionality
- Async Wait Handling (#7577): Restricted merging async_wait when pipelining with num_stages=3
- Mask Operation Support (#7620): Added ttg.mask handling in stream pipeliner
Build & Driver
- LLD Library API (#7548): Replaced shell-out to lld with direct library API calls
- hipGetProcAddress (#7350): Switched to using hipGetProcAddress for querying HIP symbols
- Driver Version Check (#7501): Added runtime driver version check with descriptive errors
- AOT Compilation (#7007): Added HIP AOT compilation support to compile.py tool
Bug Fixes
- Pointer Canonicalizer (#7242): Fixed attribute propagation when ranks don't match
- Global Atomic Optimization (#7496): Optimized global atomic operations following memory model semantics
- FP32/FP16 to OCP FP8 (#7382): Fixed conversion for subnormal numbers
- Async Copy Vectorization (#7250): Fixed async load pipeline for less than 32-bit loads
- OptimizeLDSUtility Crash (#7434): Fixed nullptr crash in createTmpLayout
- Memrealtime on GFX11/12 (#7357): Added proper support using s_sendmsg_rtn_b64
NVIDIA Backend
Hopper/Blackwell Features
- Warp Specialization:
- Enable for persistent matmul and FA (#7642, #7623)
- Assign final try_wait to partition (#7757)
- Tightened user critical section with accumulator (#7509)
- Fixed rematerialization bug in partitioner (#7427)
- Optimized partitioning by hoisting above broadcasts (#7692)
- Enabled 1 buffer for SSA partition dependencies (#7686)
- Control flow support in TMEM allocation (#7698)
- WGMMA Support in Gluon (#7300, #7313): Added Hopper WGMMA with async wait support
- Aref Operations (#7479, #7561, #7645): Updated aref ops and lower_aref pass with:
- Multi-consumer support
- Stage/cluster attribute passing
- TMA load aref insertion
- Control flow handling
- Partition Loops Rewrite (#7415): Reimplemented supporting general control flow using mutual recursion
Blackwell-Specific
- TMEM Support:
- Tcgen05.commit Op (#7335): Added separate commit op for better persistent kernel support
- Subtile QK TMEM Load (#7655): Improved non-causal fp8 performance by 40-50 TFLOPS
MMA Improvements
- FP64 SIMT FMA (#7310): Added fp64 simt fma support and fp64 mma for SM80/SM90
- FP8 MMAv2 (#7409): Don't promote fp8 MMAv2 dot inputs for sm120 (~1.9x speedup)
- Min Dot Sizes Update (#7411): Reduced minimum dot sizes (e.g., N=16 to lower values)
- NVVM Op Migration (#7420, #7471, #7512): Replaced inline assembly with NVVM ops for:
- WGMMAFenceOp, WGMMACommitGroupOp, ClusterWaitOp
- ClusterCTAIdOp conversion
- Better optimization opportunities
Other Enhancements
- Bar.warp.sync for 1 Warp (#7336): Emit more efficient bar.warp.sync for single-warp barriers
- L2 Cache Hints (#7219): Limited L2 cache hints to sm >= 80
- Cublas.gemm Exposure (#7656): Exposed cublas.gemm for performance testing
- PTX Workarounds:
Gluon & Layout Improvements
Gluon Language Features
- AutoLayout (#7447, #7466): Added AutoLayout for backward layout inference with:
- Custom layout inference interface
- Propagation through operations
- Conflict detection and error reporting
assert_trivialflag for performance validation
- Docstrings (#7323): Added comprehensive documentation for public Gluon API
- API Improvements:
Layout System
- Padded Shared Layout (#7212): Added new shared memory layout for padding
- Slice Encoding for SplitOp (#7247): Improved slice encoding inference
- LinearLayout Improvements:
- NVIDIA Shared Layout Improvements:
- DotOperandLayout Exposure (#7730): Exposed for WGMMA with LHS in registers
Tutorial & Examples
- Attention Kernels (#7009, #7298, #7488): Implemented complete attention for d64 and d128 with:
- Persistent kernel support
- Causal masking optimization
- FADD2 for row_sum computation (D64)
- FFMA2 for QK scale
- Turnstile over exp2 to control MFU access
- Subtiling optimizations
- 100-120 TFLOPS improvement for D64
- Tutorials Added (#7657): Comprehensive set covering basic to advanced optimized techniques
Kernels & Benchmarks
MXFP Support
- Naming Fixes (#7870): Changed dequantize to quantize, matched arg names
- FP32 Support (#7672): Added quant/dequant from/to fp32
- Transposed Weight Support (#7795): Handle both transposed and non-transposed mxfp weights
- Act In/Out Matmul (#7598): Added mxfp act input and output support
- Blackwell Value Padding (#7958): Fixed mxfp value padding for Blackwell
- Test Coverage (#7591): Added missing mxfp4 tests
- MXFP_BLOCK_SIZE Constant (#7567): Added constant for better readability
- Empty Tensor Handling (#7579): Fixed handling of empty tensors in downcast_to_mxfp
Matmul Optimizations
- Significant Cleanup (#7882): Major refactoring of matmul_ogs.py with:
- More efficient post-processing
- Better intelligibility
- Improved documentation
- Heuristics Improvements (#7664, #7632): Tweaked block sizes and heuristics
- Block Size Improvements (#7897): Better block sizes for batched matmul_ogs with small m/n/k
- Swizzling Fixes (#7582, #7587): Fixed swizzling numerics and contiguity
- Host TMA Usage (#7182): Increased use of host TMA for X, W, Mx scales
- Bias Subtiling (#7232): Added then reverted bias subtiling changes due to regression
- Zero Elements Support (#7808): Added support for inputs with 0 elements
- Index Casting (#7794): Cast index to int64 to avoid overflow
MoE & Multi-GPU
- Routing Improvements (#7369): 30% performance improvement through:
- Liberal kernel fusion (7 launches → 4 launches)
- Specific kernel optimizations
- FP32 logits: 22.8us → 18.0us
- FP16 logits: 17.3us → 12.2us
- BitMatrix for Routing (#7789): Used bitmatrix for distributed routing
- Simple Multi-GPU MoE (#7352): Initialized baseline implementation
Benchmarks & Tests
- Launch Overhead (#7849): Added microbenchmark to track dispatch overhead
- Total Time Computation (#7752): Added total kernel time computation
- Roofline Fixes (#7670): Fixed roofline plots for compute-bound kernels
- MLP Fixes (#7926): Fixed bench_mlp.py for various issues
Testing & CI
Test Infrastructure
- Fresh Knobs Usage (#7687): Use fresh_knobs when touching triton.knobs
- Environment Variable Restore (#7807): Fixed monkey patching for proper cleanup
- Test Cleanup (#7801): Various cleanups on test_core.py including:
- Changed kernel launch to warmup
- Moved tuple tests to test_tuple.py
- Improved error handling
- Reduce Test Overflow Fix (#7470): Limited integer range to avoid overflow
- Input Generation Consolidation (#7477): Consolidated input generation for reduce tests
AMD Testing
- GFX950 CI (#7189): Enabled CI for GFX950
- GPU Isolation (#7650): Added env-file for better GPU isolation in CI
- Passing Tests (#7363, #7365, #7236, #7183): Enabled many passing tests for AMD GFX942 and GFX12
- Test Skipping: Properly skipped flaky or unsupported tests (globaltimer, True16, etc.)
Lit Tests
- Subfolder Test Fixes (#7966): Fixed lit tests failing when run via ninja check-triton-lit-tests-
- LLD Configuration (#7992): Added llc to lit tool configuration
- Test Updates: Updated numerous lit tests for new features and bug fixes
NVIDIA Testing
- GB200 Error Handling (#7537): Continue running CI when GB200 errors out
- Warp Specialization Tests (#7623, #7642): Enabled WS tests for Hopper
Build & Infrastructure
Build System
- Out-of-Tree Build (#7347, #7871): Enabled complete out-of-tree build with TRITON_BUILD_DIR
- Compile Commands Symlink (#7305, #7341): Symlink compile_commands.json to root for better IDE support
- Elapsed Time for MacOS (#7559): Added elapsed time logging to MacOS builds
- LLD for MacOS (#7559): Enabled LLD for macOS build to reduce time
- Clang Warning Fixes (#7868): Fixed warnings to build triton with clang
- Debug Build Default (#7872): Changed default LLVM build to release
Dependencies & Environment
- Custom LLVM Build (#7279, #6709): Mentioned
make dev-install-llvmin README - Python 3.14 Wheels (#7695): Added Python 3.14 wheel build support
- Setuptools Removal (#7983): Removed setuptools requirement from setup.py
- New CUDA Versions (#7384): Automatically handle newer CUDA versions
- LLVM System Suffix (#7430): Added TRITON_LLVM_SYSTEM_SUFFIX for user-specified prebuilt LLVM
Runtime & Compilation
- Async Compile Mode (#7306): Added AsyncCompileMode to build multiple kernels in parallel
- Thread-Safe Allocator (#7685): Made set_allocator thread-safe using ContextVar
- AsyncCompileMode Thread Safety (#7701): Made AsyncCompileMode thread safe
- Kernel Caching:
Driver & Backend
- Host Compiler Flags (#7659): Allow backend-provided runtime host compiler flags
- NVIDIA Driver Improvements (#7769): Slightly improved NVIDIA driver backend with string caching
- HIP Driver Updates:
Documentation
New Documentation
- Community Meetup Notes:
- Moderators Guide (#7787): Updated with YouTube info and event creation
- Installation Instructions (#7572): Updated install instructions in docs
- README Improvements (#7368): Improved readability and fixed minor issues
- Running Meetups (#7103): Added documentation for running Triton Community Meetups
Code Quality
- CODEOWNERS Updates:
- NFC Refactorings: Multiple no-functional-change refactorings for better code organization
- Unused Code Removal (#7703): Killed unused functions throughout codebase
Breaking Changes
API Changes
- Mutations Disallowed (#7762): All mutations are now disabled in the language
- Min Dot Sizes (#7411, #7451): Relaxed minimum dot size requirements (may affect autotuning)
- Constexpr Handling: Changed how constexprs are included in cache keys (#7348)
- Environment Variables:
- Hook System Changes:
Removed Features
- Python 3.9 Support (#8222, #8287): Cleaned up Python 3.9 related code
- Nightly Installation: Removed from documentation
- Local Prefetch Schedule (#7395): Retired AMD local prefetch schedule hint variant
Deprecations
- Warp Size: Removed hardcoded warp size assumptions (#7253)
- GetShapePerCTA: Moving toward elimination in AMD backend (#7740)
Performance Improvements
Measured Improvements
- Attention Kernels: Up to 785 TFLOPS for D64, 1230 TFLOPS for D128
- MoE Routing: 30% faster (17.3us → 12.2us for fp16)
- FP8 on Blackwell: ~1.9x speedup for large matmuls
- Launch Overhead: Reduced by various optimizations (DriverConfig cleanup, etc.)
- Compile Time: ~20% savings by skipping link_extern_libs when unnecessary (#7570)
Optimization Techniques
- Register Pressure: Better management through tmem alloc hoisting
- Vectorization: Improved through generic swizzling and layout optimizations
- Bank Conflicts: Minimized through optimized shared memory layouts
- Instruction Scheduling: Better code generation for linear layouts
Notable Bug Fixes
Correctness Issues
- Non-Associative Reduce (#7272): Fixed rematerialization causing incorrect results
- Atomic Operations (#7460): Fixed broadcasting for atomic_cas and rmw operations
- Memory Model: Multiple fixes for proper fence insertion and synchronization
- FP8 Conversions: Fixed numerous rounding and clamping issues
- TMA Operations: Fixed various edge cases in TMA load/store
Crash Fixes
- Warp Specialization: Fixed iterator invalidation and use-after-free issues
- AMD OptimizeLDS: Fixed nullptr crash
- Memory Leaks: Fixed in TritonNvidiaGPU InterleaveTMem.cpp (#7924)
- Nullptr Access: Fixed in AMD pingpong ChainedDot (#7694)
Regression Fixes
- Block Size Logic Revert (#7971): Reverted fp8 matmul issues
- Byte Permutes Revert (#7899): Reverted due to functional regression, then relanded with fix (#7933)
- Diagonal Iteration Partial Revert (#7245): Addressed internal regressions
Contributors
This release includes contributions from engineers at:
- OpenAI
- Meta
- AMD
- NVIDIA
- Intel
- And many individual contributors
Special thanks to all contributors who submitted bug reports, feature requests, and code improvements!