Highlights
Gluon Framework Comprehensive Enhancement
The Gluon framework has received major enhancements across all areas including new APIs, tensor memory management, layout operations, and synchronization primitives. Key additions include static_assert
functionality, TensorDescriptor kernel arguments, async TMA operations, tensor memory implementation, thread synchronization barriers, and comprehensive tensor operations like split/join/reshape and reductions. (#7172, #7168, #7165, #7160, #7152, #7151, #7149, #7145, #7142, #7122, #7121, #7120, #7115, #7114, #7106, #7102, #7099, #7097, #7091, #7089, #7080, #7061, #7057, #7022, #7020, #7009, #7006, #7004, #7001, #6998, #6997, #6994, #6992, #6989, #6985, #6971, #6950)
Hardware Support Expansion
- AMD GFX950 Architecture Support - Comprehensive support for GFX950 including WMMA operations, performance optimizations, and architectural-specific features (#7175, #7171, #7127, #6744, #6594)
- Blackwell Enhanced TMEM Support - Improved tensor memory operations with better register usage and performance optimizations (#7160, #7079, #6817)
- Hopper WGMMA Improvements - Enhanced matrix multiplication with subtiling and prefetching optimizations (#7136, #6130)
Performance Optimizations
- Automatic Warp Specialization - Introduced automatic warp specialization optimization for enhanced kernel performance on NVIDIA GPUs (#6289, #6246, #6217)
- MMAv5 Pipelining - Re-enabled and improved MMAv5 pipelining with better performance and scheduling (#6732, #6613, #6256)
- TMA Operations Enhancement - Improved tensor memory access with better layout support and reduced register pressure (#6725, #6238, #6580)
New Features
Language and Frontend
- Aggregate Type Support - Added
@tl.aggregate
decorator for autogenerating Triton types from Python classes (#6970) - JITFunction Constexpr Support - Enhanced constexpr support for function lists and improved JIT functionality (#6988, #6963, #7105)
- Enhanced Boolean Operations - Improved handling of boolean operators and scalars with chained operations (#6769)
- Bitonic Top-k and Sorting - Added support for bitonic top-k operations and improved sort implementations (#6461, #6486)
- Masked Histograms - Added support for masked histogram operations (#6695)
- Syntactic Sugar Additions - Added
.item()
as syntactic sugar for.reshape([])
(#6873)
Backend and Compilation
- Generic Swizzling Implementation - Implemented generic swizzling algorithm for convert_layout lowering (#6982)
- Enhanced Register Allocation - Improved dynamic register reallocation for warp specialization (#6877, #6694, #6407)
- TMA Reduce Operations - Added TMA reduce operations for descriptor-based reducing stores (#6580)
- Improved Subtiling - Enhanced subtiling code generation for tensor memory loading (#6415)
- BF16 Atomic Operations - Added support for BF16 atomic add operations (#6519)
- Stmatrix Support - Added comprehensive stmatrix support including transpose operations (#6910, #6899)
Hardware-Specific Features
- AMD AsyncCopy Optimizations - Enhanced AsyncCopy support in StreamPipeliner with improved memory operations (#6270, #6639, #6382)
- AMD Buffer Operations - Comprehensive improvements to buffer operations with better vectorization and alignment (#6126, #6145, #6329)
- AMD Ping-pong Scheduler - Enhanced ping-pong scheduler for better memory operation handling (#6254, #6301, #6198)
- NVIDIA PDL Support - Enabled Programmatic Dependent Launch for overlapping kernel execution (#6394)
- AMD HIP AOT Support - Added HIP Ahead-of-Time compilation support (#7007)
Improvements
Performance
- Routing Kernel Optimizations - Multiple performance improvements achieving up to 5% runtime reduction (#6866, #6546, #7040)
- Matrix Multiplication Enhancements - Enhanced persistent TMA matmul with epilogue subtiling and metadata alignment (#6724, #6882, #7123)
- SwiGLU Optimizations - Improved SwiGLU kernel performance and fused activation functions (#6797, #6553)
- Attention Kernel Fixes - Fixed and optimized attention tutorials with better performance metrics (#7037, #6839)
Developer Experience
- Enhanced CI/CD - Improved continuous integration with better caching and timeout handling (#6815, #6816, #6582)
- Testing Infrastructure - Enhanced test coverage and organization (#7109, #6867)
- Documentation Updates - Improved documentation for installation and new features (#7103, #6778, #6235)
- Build System Improvements - Better CMake support and dependency management (#6330, #6903)
Code Quality
- Type System Enhancements - Improved type checking with mypy integration (#6596, #6704)
- Layout System Improvements - Better layout handling with LinearLayout-based implementations (#6252, #6169, #6170)
- Code Organization - Extensive refactoring and cleanup for better maintainability (#6500, #6285)
Bug Fixes
Critical Fixes
- AST Parsing Regression - Fixed parsing failures for float("inf") and float("-inf") expressions (#6344)
- Memory Allocation Issues - Fixed tensor memory allocation boundary collisions and use-after-free errors (#6318, #6433)
- TMA Layout Consistency - Fixed layout assignment from rank-reducing loads (#6362)
- Dot Operation Fixes - Fixed bug where passing None as accumulator caused errors (#7130)
- Version Detection - Fixed version detection when using source tarballs (#7164, #6381)
Hardware-Specific Fixes
- AMD Range Analysis - Improved range analysis for persistent kernels and loop bounds (#6390, #6133)
- AMD Buffer Operations - Fixed vector size computation and alignment issues (#6114, #6126)
- AMD Atomic Operations - Fixed f16/bf16 buffer atomic operations (#6090, #6139)
- NVIDIA Register Pressure - Fixed register allocation issues in warp specialization (#6403)
- NVIDIA TMEM Operations - Fixed various tensor memory access issues (#6888)
Stability Improvements
- Test Reliability - Resolved intermittent test failures across various components (#6861, #6889)
- Memory Usage - Fixed memory leaks and reduced peak memory consumption (#6796)
- Error Handling - Improved error messages and crash prevention (#6865)
Deprecations and Breaking Changes
Breaking Changes
- Cumsum Type Promotion - Upcast boolean inputs in cumsum to uint32_t for correct results (#6927)
- Experimental API Cleanup - Removed outdated experimental descriptor APIs (#6488)
- Python Support - Dropped Python 3.8 support, minimum version now 3.9 (#6649)
- Tensor Descriptor APIs - Removed experimental prefix from tensor descriptor operations (#6194)
- Register Spilling Performance Regression - Bad interaction between new LLVM changes and PTXAS optimizations can cause increased register spilling in some kernels (#7138)
Deprecations
- FP8 Format Warnings - Enhanced warnings for deprecated FP8 formats (#6931)
- Configuration Module - Renamed config.py to knobs.py to avoid confusion (#6641)
Performance
Benchmark Results
- Matrix Multiplication - Up to 15% speedup in dense 8k x 8k x 8k operations (#6804)
- Attention Kernels - Achieved 700+ TFLOPS on DHEAD=64, 960-1080 TFLOPS on DHEAD=128 (#6660)
- Routing Operations - 5% runtime reduction with optimized kernels (#6866)
- MoE Kernels - Up to 30% performance boost with optimized TMA layouts (#7123)
Memory Optimizations
- Register Usage - Reduced register pressure in various operations (#6817)
- Shared Memory - Improved shared memory utilization with better swizzling (#6982)
- Cache Efficiency - Enhanced cache utilization with L2 cache hints (#6278)
Documentation
New Guides
- Community Meetups - Added documentation for running Triton Community Meetups (#7103)
- Installation Instructions - Updated with better memory management guidance (#6235)
- Hardware Support - Updated PyTorch installation for Blackwell support (#6778)
API Documentation
- Tensor Descriptors - Comprehensive documentation for tensor descriptor APIs (#6911, #7028)
- Cache Modifiers - Updated tl.load documentation with correct cache modifier usage (#6214)
- Scan Operations - Enhanced docstrings with appropriate parameters (#6946)
Developers
Build System
- LLVM Integration - Multiple LLVM version bumps with latest upstream changes (#7138, #7129, #6754, #6361)
- CMake Updates - Improved build configuration and parallel building support (#6830, #6953)
- Dependency Management - Better handling of external dependencies (#7078)
Testing Infrastructure
- Lit Tests - Enhanced lit test coverage and organization (#6855, #6661)
- Benchmarking - Enhanced benchmarking infrastructure with roofline analysis (#6703)
- CI/CD Improvements - Better hardware support and workflow organization (#6582)