triton 3.4.0 on Python PyPI

Highlights

Gluon Framework Comprehensive Enhancement

The Gluon framework has received major enhancements across all areas including new APIs, tensor memory management, layout operations, and synchronization primitives. Key additions include static_assert functionality, TensorDescriptor kernel arguments, async TMA operations, tensor memory implementation, thread synchronization barriers, and comprehensive tensor operations like split/join/reshape and reductions. (#7172, #7168, #7165, #7160, #7152, #7151, #7149, #7145, #7142, #7122, #7121, #7120, #7115, #7114, #7106, #7102, #7099, #7097, #7091, #7089, #7080, #7061, #7057, #7022, #7020, #7009, #7006, #7004, #7001, #6998, #6997, #6994, #6992, #6989, #6985, #6971, #6950)

Hardware Support Expansion

AMD GFX950 Architecture Support - Comprehensive support for GFX950 including WMMA operations, performance optimizations, and architectural-specific features (#7175, #7171, #7127, #6744, #6594)
Blackwell Enhanced TMEM Support - Improved tensor memory operations with better register usage and performance optimizations (#7160, #7079, #6817)
Hopper WGMMA Improvements - Enhanced matrix multiplication with subtiling and prefetching optimizations (#7136, #6130)

Performance Optimizations

Automatic Warp Specialization - Introduced automatic warp specialization optimization for enhanced kernel performance on NVIDIA GPUs (#6289, #6246, #6217)
MMAv5 Pipelining - Re-enabled and improved MMAv5 pipelining with better performance and scheduling (#6732, #6613, #6256)
TMA Operations Enhancement - Improved tensor memory access with better layout support and reduced register pressure (#6725, #6238, #6580)

New Features

Language and Frontend

Aggregate Type Support - Added @tl.aggregate decorator for autogenerating Triton types from Python classes (#6970)
JITFunction Constexpr Support - Enhanced constexpr support for function lists and improved JIT functionality (#6988, #6963, #7105)
Enhanced Boolean Operations - Improved handling of boolean operators and scalars with chained operations (#6769)
Bitonic Top-k and Sorting - Added support for bitonic top-k operations and improved sort implementations (#6461, #6486)
Masked Histograms - Added support for masked histogram operations (#6695)
Syntactic Sugar Additions - Added .item() as syntactic sugar for .reshape([]) (#6873)

Backend and Compilation

Generic Swizzling Implementation - Implemented generic swizzling algorithm for convert_layout lowering (#6982)
Enhanced Register Allocation - Improved dynamic register reallocation for warp specialization (#6877, #6694, #6407)
TMA Reduce Operations - Added TMA reduce operations for descriptor-based reducing stores (#6580)
Improved Subtiling - Enhanced subtiling code generation for tensor memory loading (#6415)
BF16 Atomic Operations - Added support for BF16 atomic add operations (#6519)
Stmatrix Support - Added comprehensive stmatrix support including transpose operations (#6910, #6899)

Hardware-Specific Features

AMD AsyncCopy Optimizations - Enhanced AsyncCopy support in StreamPipeliner with improved memory operations (#6270, #6639, #6382)
AMD Buffer Operations - Comprehensive improvements to buffer operations with better vectorization and alignment (#6126, #6145, #6329)
AMD Ping-pong Scheduler - Enhanced ping-pong scheduler for better memory operation handling (#6254, #6301, #6198)
NVIDIA PDL Support - Enabled Programmatic Dependent Launch for overlapping kernel execution (#6394)
AMD HIP AOT Support - Added HIP Ahead-of-Time compilation support (#7007)

Improvements

Performance

Routing Kernel Optimizations - Multiple performance improvements achieving up to 5% runtime reduction (#6866, #6546, #7040)
Matrix Multiplication Enhancements - Enhanced persistent TMA matmul with epilogue subtiling and metadata alignment (#6724, #6882, #7123)
SwiGLU Optimizations - Improved SwiGLU kernel performance and fused activation functions (#6797, #6553)
Attention Kernel Fixes - Fixed and optimized attention tutorials with better performance metrics (#7037, #6839)

Developer Experience

Enhanced CI/CD - Improved continuous integration with better caching and timeout handling (#6815, #6816, #6582)
Testing Infrastructure - Enhanced test coverage and organization (#7109, #6867)
Documentation Updates - Improved documentation for installation and new features (#7103, #6778, #6235)
Build System Improvements - Better CMake support and dependency management (#6330, #6903)

Code Quality

Type System Enhancements - Improved type checking with mypy integration (#6596, #6704)
Layout System Improvements - Better layout handling with LinearLayout-based implementations (#6252, #6169, #6170)
Code Organization - Extensive refactoring and cleanup for better maintainability (#6500, #6285)

Bug Fixes

Critical Fixes

AST Parsing Regression - Fixed parsing failures for float("inf") and float("-inf") expressions (#6344)
Memory Allocation Issues - Fixed tensor memory allocation boundary collisions and use-after-free errors (#6318, #6433)
TMA Layout Consistency - Fixed layout assignment from rank-reducing loads (#6362)
Dot Operation Fixes - Fixed bug where passing None as accumulator caused errors (#7130)
Version Detection - Fixed version detection when using source tarballs (#7164, #6381)

Hardware-Specific Fixes

AMD Range Analysis - Improved range analysis for persistent kernels and loop bounds (#6390, #6133)
AMD Buffer Operations - Fixed vector size computation and alignment issues (#6114, #6126)
AMD Atomic Operations - Fixed f16/bf16 buffer atomic operations (#6090, #6139)
NVIDIA Register Pressure - Fixed register allocation issues in warp specialization (#6403)
NVIDIA TMEM Operations - Fixed various tensor memory access issues (#6888)

Stability Improvements

Test Reliability - Resolved intermittent test failures across various components (#6861, #6889)
Memory Usage - Fixed memory leaks and reduced peak memory consumption (#6796)
Error Handling - Improved error messages and crash prevention (#6865)

Deprecations and Breaking Changes

Breaking Changes

Cumsum Type Promotion - Upcast boolean inputs in cumsum to uint32_t for correct results (#6927)
Experimental API Cleanup - Removed outdated experimental descriptor APIs (#6488)
Python Support - Dropped Python 3.8 support, minimum version now 3.9 (#6649)
Tensor Descriptor APIs - Removed experimental prefix from tensor descriptor operations (#6194)
Register Spilling Performance Regression - Bad interaction between new LLVM changes and PTXAS optimizations can cause increased register spilling in some kernels (#7138)

Deprecations

FP8 Format Warnings - Enhanced warnings for deprecated FP8 formats (#6931)
Configuration Module - Renamed config.py to knobs.py to avoid confusion (#6641)

Performance

Benchmark Results

Matrix Multiplication - Up to 15% speedup in dense 8k x 8k x 8k operations (#6804)
Attention Kernels - Achieved 700+ TFLOPS on DHEAD=64, 960-1080 TFLOPS on DHEAD=128 (#6660)
Routing Operations - 5% runtime reduction with optimized kernels (#6866)
MoE Kernels - Up to 30% performance boost with optimized TMA layouts (#7123)

Memory Optimizations

Register Usage - Reduced register pressure in various operations (#6817)
Shared Memory - Improved shared memory utilization with better swizzling (#6982)
Cache Efficiency - Enhanced cache utilization with L2 cache hints (#6278)

Documentation

New Guides

Community Meetups - Added documentation for running Triton Community Meetups (#7103)
Installation Instructions - Updated with better memory management guidance (#6235)
Hardware Support - Updated PyTorch installation for Blackwell support (#6778)

API Documentation

Tensor Descriptors - Comprehensive documentation for tensor descriptor APIs (#6911, #7028)
Cache Modifiers - Updated tl.load documentation with correct cache modifier usage (#6214)
Scan Operations - Enhanced docstrings with appropriate parameters (#6946)

Developers

Build System

LLVM Integration - Multiple LLVM version bumps with latest upstream changes (#7138, #7129, #6754, #6361)
CMake Updates - Improved build configuration and parallel building support (#6830, #6953)
Dependency Management - Better handling of external dependencies (#7078)

Testing Infrastructure

Lit Tests - Enhanced lit test coverage and organization (#6855, #6661)
Benchmarking - Enhanced benchmarking infrastructure with roofline analysis (#6703)
CI/CD Improvements - Better hardware support and workflow organization (#6582)

Code Organization

Module Structure - Better organization of modules and passes (#6500)
Type System - Enhanced type checking and inference (#6285, #6231)
Error Handling - Improved error messages and debugging support throughout the codebase

triton 3.4.0 Triton 3.4.0 Release on Python PyPI

Highlights

Gluon Framework Comprehensive Enhancement

Hardware Support Expansion

Performance Optimizations

New Features

Language and Frontend

Backend and Compilation

Hardware-Specific Features

Improvements

Performance

Developer Experience

Code Quality

Bug Fixes

Critical Fixes

Hardware-Specific Fixes

Stability Improvements

Deprecations and Breaking Changes

Breaking Changes

Deprecations

Performance

Benchmark Results

Memory Optimizations

Documentation

New Guides

API Documentation

Developers

Build System

Testing Infrastructure

Code Organization

triton 3.4.0
Triton 3.4.0 Release

on Python PyPI