triton-lang/triton v3.7.0 on GitHub

Dialect & Frontend
Backend & Compiler
AMD/HIP Backend
NVIDIA Backend
Gluon & Layout Improvements
Kernels & Benchmarks
Proton Profiling
Testing & CI
Build & Infrastructure
Documentation
Breaking Changes
Contributors

Dialect & Frontend

New Features

tl.squeeze / tl.unsqueeze: Added tl.squeeze and tl.unsqueeze operations to the standard library (#8924)
Scaled BMM: Added support for scaled batched matmul in the frontend (#9000)
FP8 Constants: Frontend can now create FP8 constants directly (#8882)
Returning Constexpr from JIT: Functions can return constexpr values from JIT-compiled code (#8785)
get_int_attr for Out-of-Tree Walk: Added get_int_attr to Operation to support out-of-tree IR walks (#8892)
Optional Device Arg to preload: Added optional device argument to preload and guardrails for cross-target preload (#8951, #8952, #9234)
tl.cat(can_reorder=False): Added a non-reordering variant of tl.cat with broadcast support (#9312, #9163)
Round f32→tf32 in Descriptor: Added option to round f32 to tf32 inside tensor descriptors (#9295)
Plugin Hooks & Out-of-Tree Dialects: Added support for out-of-tree TTIR/TTGIR passes and Triton Dialect Plugins, with example documentation (#8401, #8523, #8815)

Bug Fixes

desc.shape for FP4 Padded: Fixed desc.shape values for fp4-padded tensor descriptors (#9012)
Setting Attr on Constexpr Argument: Fixed setting attributes on constexpr arguments (#9053)
Named Tuples in Constexpr Functions: Preserved named tuples through constexpr_functions (#8876)
must_use_result for Methods: Fixed must_use_result check for methods (#8902)
_semantic Default to None: Defaulted _semantic parameter to None (#8909)
make_tensor_descriptor Error Typo: Fixed typo in make_tensor_descriptor error message (#8912)
tl.cat Determinism: Made tl.cat deterministic via permute+reshape+join, then reverted (#9312, #8854, #8878)
Deprecation Warning for make_block_ptr: Emitted a deprecation warning when make_block_ptr is used (#9667)

Improvements

Frontend Performance: Pre-computed inspect.signature for builtins, lazily computed tuple type names, avoided find_paths_if and inspect.getclosurevars, removed outdated catch_warnings blocks — all to reduce JIT overhead (#8843, #8844, #8846, #8845, #8881)
Revert Deep Copy on Scope Entry: Removed deep copy when entering a new scope (#8832)
Default 32-bit Dot Precision Change: Briefly changed default 32-bit dot precision to TF32x3, then reverted (#9080, #9090)
Tutorial Updates (#8565, #8853, #8982)
Interpreter Cleanups: Typing and efficiency cleanups in the interpreter (#9072)

Backend & Compiler

LLVM Updates

LLVM Bumps: Multiple LLVM uprevs through the cycle, with one bump reverted on the release branch for stability (#8766, #8840, #8919, #8987, #9264, #9333, #9431, #9942)
llvm-head Merge: Merged changes from llvm-head (#8842)
Infinite Rewrite Loop in Latest LLVM: Fixed an infinite rewrite loop introduced by a newer LLVM revision (#9249)

2CTA / Multicast / TMA

2CTA Mode End-to-End: Gluon multi-cta + 2CTA support, M=64 2CTA mode, removed unnecessary synchronization in 2CTA MMA, and proper TMEM deallocation timing (#8684, #8874, #8922, #8986)
TMA + Multicast: Backend support for TMA with multicast (#9005)
tcgen05.mma + Multicast: Added multicast support for tcgen05.mma (#9071)
TMA Index Translation: Moved TMA index translation from mid-end to lowering (#9082)
tcgen05.mma Verifier & Errors: Throw a clear error instead of miscompiling very large tcgen05.mma along N (#8915)
MMAv5 Illegal Instruction Fix: Fixed illegal instruction in MMAv5 lowering (#8910)

Warp Specialization

Nested Loops: Nested-loop support in warp specialization (#8687)
Partition Scheduling: Improved partition scheduling pass; correct stage/cluster annotations for block-arg producers (#7312, #8883)
WS Lowering Hardening: Variable naming fix in LowerAref, per-partition asyncOp storage, explicit captures to WarpSpecializePartitionsOp, skip InsertTmemAref when WS isn't used (#8978, #9007, #9023, #9133, #9212)
RegionBranchInterface: Made WarpSpecializePartitionsOp implement RegionBranchInterface (#8799)
Mixed TMA / non-TMA Loads: Fixed AutoWS when mixing TMA and non-TMA loads (#9111)
aref.get Filtering: aref.get creation now filters results not in the scheduled loop (#9114)
Multibuffering Acc Logic: Improved multibuffering accumulator logic in WS (#8950)

Code Generation & Analysis

tt.scan Layout Fixes: Fixed tt.scan with broadcasted layouts and additional scan layout issues (#9185, #9189)
Reduce/Scan Verifier: Verify reduce/scan op axis values (#9061)
Pipelined Loops Skip Asserts/Prints: Loops containing assert or print are no longer pipelined (#9055)
Async Op Semantics: Added explicit semantics for async ops (#8966)
WGMMA Wait Delay: Delay wgmma wait(0) to first use of the accumulator (#9021, #9179)
WGMMA Register Pipelining: Added missing waits in WGMMA RHS register pipelining (#8964, #8970, #8997)
WGMMA RS Split Limit: Limit RS-dot splitting to two splits (#9152)
Layout Hoisting Fix: Fixed handling of conflicting layouts when hoisting convert into conditionals (#9083)
Rematerialization Cost: Consider rematerialisation cost when hoisting over ext; improved robustness of ext slice rematerialization (#9194, #9019)
AxisInfo Improvements: Enhanced divisibility handling in AxisInfo for add/sub; reland of unvisited-operand handling (#9297, #8758)
Layout Picker for Small async_cp: Pick better layouts for small async_cp (#9183)
Skip Conversion-Backward-Slice Cycle: Skip values with existing conversions in getConvertBackwardSlice (#8291)
Membar Improvements: Consider memdesc_slice in Membar; extended membar with third-party ops via traits; AMD-aware membarFilter (#8755, #8798, #9265)
Reduce Op Lowering: Improvements to ReduceOp lowering, later reverted on release branch (#9192, #9214)
Clamp on Scalars: Support clamp optimization on scalars (#8796)
kReg smem Padding: Separated additive kReg shared-memory padding contribution (#9286)
tcgen05.mma + multicast support and Generalized Encodings: continued generalization of TMEM and shared-memory layouts (#9071)
SwizzledShared Layout, uniform hint on ttg.warp_id, and CGAEncoding rename (#9286, #9073, #8850, #9040, #9125)
Pipelining Barrier Location: Fixed barrier placement in loop lowering for MMA ops with non-pipelined operands (#8732)
Properly Async wgmma Loop Detection: Fixed dotCanBeProperlyAsync when wgmma is not yielded by the loop and an associated infinite loop (#9274, #9282)
FuncOpToLLVM Refactors: Moved handleArgPtrDatatype to Utility.h; support for LLVM struct/array types in DITypeAttr (#9120, #9124)
Cache Robustness: Handle corrupted on-disk cache (#8923)
Async Sentinel: Added a sentinel when async-compiling (#9251)
JITFunction in preload: Support JITFunction in preload (#8794)

CONSAN (Concurrency Sanitizer) & Debug

Buffer-region analysis, aliasing support, false-positive deadlock fix, overflow-check disable, compile-time optimization, reduced coverage configurations, TMEM allocation handling, and removal of TMEM size verification (#8837, #8939, #9046, #8940, #9240, #9294, #8787, #8782)
Debug Info: Fixed missing kernel arguments in LLVM debug info; fixed address-sanitizer stack-use-after-scope (#9002, #9088)

AMD/HIP Backend

3.7 is heavy on gfx1250 (RDNA4) maturation, warp specialization on AMD, Tensor Data Movement (TDM), and a new warp-pipeline path.

Warp Specialization & Warp Pipelining on AMD

Warp-Pipeline Support: New AMD warp-pipeline path with Gluon and LLVM lowering (#8586, #8975, #8980)
Warp Specialization on gfx1250 (#8947, #8968)
Warp-Pipeline Fixes: Priority hints and Gluon fixes for the new pipeline (#9301)
ttg.warp_id and AMD Conversions (#8659)

gfx1250 / RDNA4 Maturation

Mixed-Precision Scaled Dot: Enabled mixed-precision (scaled) dot in Triton on gfx1250 (#8938)
4-Warp / 8-Warp MXFP GEMM: 4-warp scheduling and 8-warp pingpong + MXGEMM refactor (#9031, #9356)
Persistent WS f16 GEMM: Persistent variant and persistent subtiled variant for WS f16 GEMM (#8990, #9052)
F16 GEMM Examples Updates: Updated MXFP FA example and f16 GEMM examples (#9326, #8972)
Buffer Atomics for RDNA4: Enabled buffer atomics on RDNA4 (#8778)
v_permlane16_swap: Enabled for convert_layout and reduceOp on GFX1250 (#8724)
Extended FP Conversion: Including RTZ rounding fixes for GFX1250 (#8821, #8965)
libdevice for ROCm 7.1: Updated libdevice bitcode files (#8807)
Cluster Loads / Multi-CTA: Multi-CTA GEMM example for gfx1250, multi-CTA support for AMDWmmaEncodingAttr, scalar-pointer cluster-load avoidance (#9342, #9340, #9129)
Gluon AMDWMMALayout Rank Consistency (#9127)
WMMA Database Additions: Added i8xi8xi32 v3, missing f64.16x16x4.f64, and clamp operand on WMMA int intrinsic (#9267, #9271, #9291, #9359)
Wavefront Scheduling: Fixed waitcnt for gfx1250 (#8835)
Gluon Stream-K: 4- and 8-warp stream-k Gluon kernels for gfx1250 (#9370)
Roll-up Updates: Bundled small gfx1250 fixes (#9365)

Tensor Data Movement (TDM)

Multi-CTA & Multicast for TDM (#8790)
Host-Side TDM Descriptor: 1D-5D support on gfx1250 (#8977)
TDM L2 Prefetch: Backend and Gluon exposure (#9086, #9148)
TDM Predicate: Use TDM predicate in f16 GEMM variants (#9054)
TDM Async Wait: Support TDM AsyncWait in UpdateAsyncWaitCount (#9352)
TDM Padding in Store: Support padding when interval equals the inner dimension (#9360)
TDM Async Scatter/Gather: Tensor async scatter/gather support and fixed OOB handling (#9299, #9313, #9371)
TDM Shape Adjustment: Account for CGA offset in TDM shape adjustment (#9341)
4D+ TDM Bug Fix: Fixed TDM behavior when dim > 2 (#8994)
Some TDM Features Enabled (#9283)

Async Copy / LDS

AsyncCopy Default On: Enabled AsyncCopy by default for gfx950 and gfx1250 — later reverted on release/3.7.x (#9445, #9087)
Async Copy Block Dim Duplication: Allow async load global-to-load block-dim duplication (#8788)
Direct-to-LDS Refactors: Fixed shared-order selection on GFX9, refactored coalescing checks, contiguity hints, vector-size fixes for padded encodings (#9028, #9041, #9048, #9089, #9149)
v_perm for convert_layout (#9014)
Padded Layout Heuristic: Relaxed heuristics for smaller block sizes (#9074)

Reorder / Pipelining Cleanup

ReorderInstructions: Removed sinkSecondLoad, sinkDotConversion, and moveUpTranspose optimizations (#9119, #9139, #9204, #9229)
Replace ReorderInstructions with MoveUpPrologueLoads (#9328)
UpdateAsyncWaitCount: Support single-block execute regions (#9126)
OptimizeLDSUsage Removal (#8282)

libdevice / Layouts / Misc

finite/isfinited, rint, clampf via v_med3: libdevice and codegen additions (#9097, #9166, #9256)
BlockPingpong Improvements: Debug messages and dot-dominates-predecessors fix (#8804, #9027)
kWidth mandatory for WMMA v3 (#8783)
copysign Replacement: Replaced LLVM copysign intrinsic (#8789)
WMMA Layout CTA Fields: Generalized (#8946)
TDM with CanonicalizePointers: Support MakeTensorDescOp in CanonicalizePointers (#9228)
PartitionedSharedEncodingAttr: Introduced and reverted (#9314, #9367)
scf.if Combining: Added PrepareIfCombining pass (#9253)
Fine-Grained Cluster Barrier: New AMD cluster barrier exposed to Gluon (#9206)
SinkLayoutConversions Pass (#9168)
MIR Swap: Option to swap MIR; addOccurrence for proper LLVM-option disabling; ScopedNoAliasAAWrapperPass in MIR swap pipeline (#8711, #9311, #9309)

AMD Bug Fixes (selected)

atomic_cas Fixes: Wrong struct index for atomic-CAS pattern, ignored sem/scope, and atomic-CAS for non-int types (#8867, #9042, #9116)
Atomic-RMW Mask Vectorization: Fixed wrong vectorization width for masked atomic-RMW (#9142)
BroadcastedRegisters in Compilation: Fixed compilation crash (#8828)
uniformSum Crash: Fixed null uniformSum in CanonicalizePointers (#8991)
Cooperative Groups Support: Driver check (#8935)
FP8/BF8 WMMA Selection on release/3.7.x: Fixed mixed FP8 promotion / instruction selection (#9567, #9581)
True16 on gfx11: Disabled True16 for assembler on gfx11 (#9447, #9476)
RangeAnalysis tripCount: Fixed trip-count calculation (#9383, #9944)
Padded-Layout Async Copy OOM: Fixed OOM in pipelining with padded async copy on GFX950 (#9442, #9945)
BlockPingpong for non-MFMA dot (#9618, #9948)
CanonicalizePointers Different Bases (#9541, #9950)
Backend cherry-pick dance (#9487, #9502, #9673, #9675)
FP4 Matmul Tests Skipping: Skip tests packed along M/N for gfx1250 (#9176)

NVIDIA Backend

Blackwell & Newer SMs

tcgen05 MMA on sm110 (Jetson Thor) (#9160)
tcgen05.ld.red on sm103: Implemented in Gluon (#9151)
x Scale Swizzling for Blackwell + Batched Matmul (#8863)
Block-Scaled Matmul Baselining: mxfp8/nvfp4 block-scaled cuBLAS baselines (#9044)
ptxas for Blackwell: Repeated ptxas-version uprev/revert; final state on release/3.7.x cherry-picks the GB300/Spark/THOR-required commits (#8941, #9011, #9016, #8983, #9363, #9621)
NVMMA Variadic CUDA Launcher: Variadic-argument pre-compiled CUDA launcher (#6788)
NVIDIA::canSkipBarSync: Resurrected (#9246)

TMA

TMA im2col Mode: End-to-end im2col TMA support — AsyncTMACopyGlobalToLocalOp, tensor-descriptor support, fix for tma load, and driver support (#9202, #9225, #9303, #9305)
TMA Encoding Verification: Verify encodings on TMA ops (#8886)
TMA Descriptor Mitigation: Mitigation against potential TMA descriptor creation errors (#9235)

Hopper / WS

tt.split/join in WS Data Partition: Hopper WS support for tt.split/tt.join (#456, #9147)
mx8 w_scale Mask: Fixed Hopper mask (#8974)
Small-Batch Hopper: Bench fixes for small batches on Hopper (#8877)
SM89 ptxas Workaround Reverted: Removed the older workaround for the SM89 ptxas bug now that it is unnecessary (#9756)

Gluon & Layout Improvements

New Features

Local Scatter/Gather: Added local scatter/gather support to Gluon (#8480)
get_view(): Added get_view() for Gluon layouts (#9270)
Finer Cluster Fences: Exposed finer-grained cluster fences (#9076)
Multi-CTA Refactor of PaddedSharedLayouts (#9336)
"Illegal Instruction" Sanitize Mode: Tightened TMA op verifiers and added an "illegal instruction" sanitize mode (#9112)
Verifier Improvements: Tightened Gluon dialect verifiers and moved checks into C++ (#8981, #9018, #9033)
TensorMemory in to_linear_layout: Allow TM layouts in to_linear_layout for printing (#8682)
More Blackwell Tutorials (#8982)

Layouts & Shared Encodings

LinearEncoding Tightening: Tightened LinearEncoding checks (#9215)
SharedLinearEncoding: Continued lowering generalization (carry-over from 3.6 with backend updates).

Kernels & Benchmarks

Persistent Matmul

Persistent Matmul Heuristics: Fixed and refined heuristics (#8791, #8813)
Hopper HBM Swizzling: Persistent matmul now supports Hopper HBM swizzling (#8917)
Hopper FP4 Swizzled, num_warps=4 (#9029)
Don't Flatten Mixed-Precision Hopper Persistent Matmul (#9279)
High-Occupancy Persistent Matmul: Re-enabled (#9248)
4-Warp Persistent Kernel: Re-enabled after fixes (#9331)
Strided Layout Handling for Persistent: Fixed when setting requires_persistent (#9198)
Mxfp Non-Persistent Strided Layout: Allow non-persistent mx matmul with strided layout (#8808)

Triton Kernels Refactor

Matrix-Multiplication Refactor: Major refactor of triton_kernels matmul (#8765)
Tensor/Layout/Distributed Refactor: Reland of the tensor/layout/distributed refactor; small follow-ups (#9134, #9140, #9186, #9187, #9213)
Closure-Based Output Mapping: For peer shards (#8999)
Distributed Tests: Distributed routing kernels test fix (#9258)
Device Descriptor Allocator: Keep a pool to fix descriptor allocator behavior (#9259)
Reduce Kernel: Unfuse FMA for numeric stability, unpadded batch handling, global scale (#9320, #9332, #9372)
Tensor.clone: Briefly added clone for triton_kernels.tensor.Tensor, then reverted (#9178, #9208)

MXFP / Scaled-Dot Kernels

Force mxfp4→bf16 Conversion via mul.bf16x2 (#8967)
Hopper mxfp4 Swizzled, num_warps=4 (#9029)
swiglu Optimizations: Save instructions, then partial revert; later use of ex2.approx.ftz for swiglu (#8801, #8905, #9164)
matmul Output mxfp Format Fixes (#8865)
Symmetric Memory in Bench: Release symmetric memory between runs (#8900)
distributed.py / bench_utils.py: Extracted common code from bench_mlp.py and distributed.py (#8866)
num_stages Adjustment: For bf16/fp16 × mxfp (#8773)

Other

X Scale Swizzling for Ragged (#8897)
reduce_forward Metadata: Improved performance (#9068)
TF32 Rounding in MoE (#9296)
p_matmul Asserts & Fixes (#9376)
Distributed symm_mem_pool by Argument (#9092, #9155)

Proton Profiling

Highlights

Hardware Trace on Blackwell: Enabled low-overhead hardware trace (#9307)
Significant deactivate / get_data Overhead Reduction: Especially for CUDA-graph profiling; exposed get_data_msgpack (#9030)
Periodic Dumping: Periodic profile dumping; metadata profiling with periodic flushing (#9150, #9236)
Multi-Device Metric Profiling: Fixed metric buffer deadlock and added multi-device support (#8943)
Capture-on-Error: Capture traces even when code exits with an error (#8955)
Vector Metrics: New vector metric type (#9329)

API & Internals

get_data API: Export profile data directly in Python (#8928)
clear_data API: Remove pre-deactivation data (#8971)
finalize Cleanup: Clean up context source after teardown (#9069)
Fewer Locks: Further reduce unnecessary locks (#9257)
Runtime/Metric Correlation: Simplified to reduce overhead (#9132)
Selective Kernel Metadata: Allow Proton to record metadata for selective kernels (#9158)
Metric Type Restrictions: Restrict frontend metric types (#8858)
Init/Final Timestamps: Added to Chrome trace (#8870)
GlobalScratchAllocOp Deprecation: Deprecated Proton's own op in favor of TritonGPU's, with a custom backend (#8976)
Drop Invalid-Time Kernels (#8961)
Ignore Metric-Kernel Timing (#9058)
Documented Experimental APIs (#9056)
HW Trace Default Fix: Fixed default value for TRITON_ENABLE_HW_TRACE in CuptiProfiler (#9324)
Tensor Descriptor & 2-CTA Tests (#9070)
AMD Proton Test Fixes (#8763)

Testing & CI

Gluon TMA + MMA Hopper/Blackwell Test (#8873)
AMD Shadow CI: New AMD runner setup, then reverted (#9032, #9049)
fresh_knobs Default Behavior (#9184)
tl.dot BF16xN Nondeterminism (#8818)
Disable Stack Traces in Performance Remarks (#8884)
Pin pandas < 3.0 (#9273)
Fix pytorch Deprecation Warning in CI (#8857)
Reduce Wheel Size, Pin DOCKER_API_VERSION (release/3.7.x) (#10244)
Increase Release Wheel Timeout (#10250)
Skip Tests for RDNA / gfx1250: Various AMD test skips and enables (#9210, #9176, #9177, #9232, #9343, #9095)
Triton's assert_close: Propagate err_msg to numpy (#9170)
Float8 × MX Tolerance (#9316, #9338)
NumPy 2.4 Compatibility: Explicit numpy-array-to-scalar conversion (#9172)
test_line_info_ir_source Flake Fix (#9161)

Build & Infrastructure

CMAKE_LIBRARY_OUTPUT_DIRECTORY: Fixed build with empty directory (#8810)
llvm_update_compile_flags Removal (#9167)
LLVM_BUILD_SHARED_LIBS Canonicalization (#8933)
actions/checkout v5 → v6 (#8826)
Version Bumps: 3.5.0 → 3.6.0 ; 3.6.0 → 3.7.0 (#8836, #9885, #9888)
TRITON_EXT_ENABLED for Wheels (#9935, #9959)
nvidia-toolchain-version.json Update.
TRITON_DEFAULT_BACKEND: Control driver.active via this env var (#9144)
TRITON_PTXAS_BLACKWELL_PATH: Allow override of ptxas-blackwell binary (#8945)
Release to PyPI (#10251)
topk in Plugin Example: Increment index in plugin example (#9315)
HIP Support in link.py (#9084)

Documentation

Divisibility Reset Logic: Clarified for contiguous dimensions in AxisInfo (#9266)
topk Operation: Added to language documentation (#9345)
Plugin Example README: Added a second pass-plugin README example (#8815)
Conference Materials: Updated README (#9009)
Community Meetup Notes: Added 2026-01-06 meetup notes (#9288)
warp_specialize Docs: Updated gl.warp_specialize docs (#8553)
LinearLayout Output Matrix Comment: Doc fix (#9243)

Breaking Changes

triton_kernels matmul refactor (BC-breaking): The matrix-multiplication refactor introduces a backwards-incompatible API surface. Downstream users of triton_kernels.matmul_* should review call sites (#8765)
tcgen05.cp Lowering Generalization & tcgen05.mma Encoding Acceptance: Continued from 3.6, with new verifier behavior and stricter encoding checks.
Proton GlobalScratchAllocOp Deprecated: Replaced with TritonGPU's GlobalScratchAllocOp + custom backend. Out-of-tree consumers must migrate (#8976)
make_block_ptr Deprecated: A deprecation warning is now emitted; users should migrate to tensor descriptors (#9667)
Default 32-bit Dot Precision Reverted: Default 32-bit dot precision was briefly TF32x3 — the default in 3.7 remains as in 3.6. Note the new "round f32→tf32 in descriptor" option (#9080, #9090, #9295)
AsyncCopy Default for gfx950 / gfx1250: Was enabled by default and then reverted on the release branch. Users must opt in explicitly in 3.7 (#9087, #9445)
SM89 ptxas Workaround Reverted: The ptxas workaround introduced earlier is removed on release/3.7.x (#9756, #7067)

Contributors

This release includes contributions from engineers at:

Meta
AMD
NVIDIA
OpenAI
Intel
Google
And many individual contributors

Special thanks to all contributors who submitted bug reports, feature requests, and code improvements!

triton-lang/triton v3.7.0
Triton 3.7 Release Notes

on GitHub

Table of Contents

Dialect & Frontend

New Features

Bug Fixes

Improvements

Backend & Compiler

LLVM Updates

2CTA / Multicast / TMA

Warp Specialization

Code Generation & Analysis

CONSAN (Concurrency Sanitizer) & Debug

AMD/HIP Backend

Warp Specialization & Warp Pipelining on AMD

gfx1250 / RDNA4 Maturation

Tensor Data Movement (TDM)

Async Copy / LDS

Reorder / Pipelining Cleanup

libdevice / Layouts / Misc

AMD Bug Fixes (selected)

NVIDIA Backend

Blackwell & Newer SMs

TMA

Hopper / WS

Gluon & Layout Improvements

New Features

Layouts & Shared Encodings

Kernels & Benchmarks

Persistent Matmul

Triton Kernels Refactor

MXFP / Scaled-Dot Kernels

Other

Proton Profiling

Highlights

API & Internals

Testing & CI

Build & Infrastructure

Documentation

Breaking Changes

Contributors

triton-lang/triton v3.7.0 Triton 3.7 Release Notes on GitHub

Table of Contents

Dialect & Frontend

New Features

Bug Fixes

Improvements

Backend & Compiler

LLVM Updates

2CTA / Multicast / TMA

Warp Specialization

Code Generation & Analysis

CONSAN (Concurrency Sanitizer) & Debug

AMD/HIP Backend

Warp Specialization & Warp Pipelining on AMD

gfx1250 / RDNA4 Maturation

Tensor Data Movement (TDM)

Async Copy / LDS

Reorder / Pipelining Cleanup

libdevice / Layouts / Misc

AMD Bug Fixes (selected)

NVIDIA Backend

Blackwell & Newer SMs

TMA

Hopper / WS

Gluon & Layout Improvements

New Features

Layouts & Shared Encodings

Kernels & Benchmarks

Persistent Matmul

Triton Kernels Refactor

MXFP / Scaled-Dot Kernels

Other

Proton Profiling

Highlights

API & Internals

Testing & CI

Build & Infrastructure

Documentation

Breaking Changes

Contributors

triton-lang/triton v3.7.0
Triton 3.7 Release Notes

on GitHub