PyTorch 2.11.0 Release Notes
- Highlights
- Backwards Incompatible Changes
- Deprecations
- New Features
- Improvements
- Bug fixes
- Performance
- Documentation
- Developers
- Security
Highlights
| Added Support for Differentiable Collectives for Distributed Training |
| FlexAttention now has a FlashAttention-4 backend on Hopper and Blackwell GPUs |
| MPS (Apple Silicon) Comprehensive Operator Expansion |
| Added RNN/LSTM GPU Export Support |
| Added XPU Graph Support |
For more details about these highlighted features, you can look at the release blogpost. Below are the full release notes for this release.
Backwards Incompatible Changes
Release Engineering
Volta (SM 7.0) GPU support removed from CUDA 12.8 and 12.9 binary builds (#172598)
Starting with PyTorch 2.11, the CUDA 12.8 and 12.9 pre-built binaries no longer include support for Volta GPUs (compute capability 7.0, e.g. V100). This change was necessary to enable updating to CuDNN 9.15.1, which is incompatible with Volta.
Users with Volta GPUs who need CUDA 12.8+ should use the CUDA 12.6 builds, which continue to include Volta support. Alternatively, build PyTorch from source with Volta included in TORCH_CUDA_ARCH_LIST.
Version 2.10:
# CUDA 12.8 builds supported Volta (SM 7.0)
pip install torch --index-url https://download.pytorch.org/whl/cu128
# Works on V100
Version 2.11:
# CUDA 12.8 builds no longer support Volta
# For V100 users, use CUDA 12.6 builds instead:
pip install torch --index-url https://download.pytorch.org/whl/cu126
PyPI wheels now ship with CUDA 13.0 instead of CUDA 12.x (#172663, announcement)
Starting with PyTorch 2.11, pip install torch on PyPI installs CUDA 13.0 wheels by default for both Linux x86_64 and Linux aarch64. Previously, PyPI wheels shipped with CUDA 12.x and only Linux x86_64 CUDA wheels were available on PyPI. Users whose systems have only CUDA 12.x drivers installed may encounter errors when running pip install torch without specifying an index URL.
Additionally, CUDA 13.0 only supports Turing (SM 7.5) and newer GPU architectures on Linux x86_64. Maxwell and Pascal GPUs are no longer supported under CUDA 13.0. Users with these older GPUs should use the CUDA 12.6 builds instead.
CUDA 12.6 and 12.8 binaries remain available via download.pytorch.org.
Version 2.10:
# PyPI wheel used CUDA 12.x
pip install torchVersion 2.11:
# PyPI wheel now uses CUDA 13.0
pip install torch
# To get CUDA 12.8 wheels instead:
pip install torch --index-url https://download.pytorch.org/whl/cu128
# To get CUDA 12.6 wheels (includes Maxwell/Pascal/Volta support):
pip install torch --index-url https://download.pytorch.org/whl/cu126Python Frontend
torch.hub.list(), torch.hub.load(), and torch.hub.help() now default the trust_repo parameter to "check" instead of None. The trust_repo=None option has been removed. (#174101)
Previously, passing trust_repo=None (or relying on the default) would silently download and run code from untrusted repositories with only a warning. Now, the default "check" behavior will prompt the user for explicit confirmation before running code from repositories not on the trusted list.
Users who were explicitly passing trust_repo=None must update their code. Users who were already passing trust_repo=True, trust_repo=False, or trust_repo="check" are not affected.
Version 2.10:
# Default trust_repo=None — downloads with a warning
torch.hub.load("user/repo", "model")
# Explicit None — same behavior
torch.hub.load("user/repo", "model", trust_repo=None)Version 2.11:
# Default trust_repo="check" — prompts for confirmation if repo is not trusted
torch.hub.load("user/repo", "model")
# To skip the prompt, explicitly trust the repo
torch.hub.load("user/repo", "model", trust_repo=True)torch.nn
Add sliding window support to varlen_attn via window_size, making optional arguments keyword-only (#172238)
The signature of torch.nn.attention.varlen_attn has changed: a * (keyword-only separator) has been inserted before the optional arguments. Previously, optional arguments like is_causal, return_aux, and scale could be passed positionally; they must now be passed as keyword arguments. A new window_size keyword argument has also been added.
# Before (2.10)
output = varlen_attn(query, key, value, cu_seq_q, cu_seq_k, max_q, max_k, True, None, 1.0)
# After (2.11) — pass as keyword argument
output = varlen_attn(query, key, value, cu_seq_q, cu_seq_k, max_q, max_k, window_size=(-1, 0), return_aux=None, scale=1.0)Remove is_causal flag from varlen_attn (#172245)
The is_causal parameter has been removed from torch.nn.attention.varlen_attn. Causal attention is now expressed through the window_size parameter: use window_size=(-1, 0) for causal masking, or window_size=(W, 0) for causal attention with a sliding window of size W. The default window_size=(-1, -1) corresponds to full (non-causal) attention.
# Before (2.10)
output = varlen_attn(query, key, value, cu_seq_q, cu_seq_k, max_q, max_k, is_causal=True)
# After (2.11) — use window_size instead
output = varlen_attn(query, key, value, cu_seq_q, cu_seq_k, max_q, max_k, window_size=(-1, 0))Distributed
DebugInfoWriter now honors $XDG_CACHE_HOME for its cache directory in C++ code, consistent with the Python side. Previously it always used ~/.cache/torch. (#168232)
This avoids issues where $HOME is not set or not writable. Users who relied on ~/.cache/torch being used regardless of $XDG_CACHE_HOME may see debug info written to a different location.
Version 2.10:
# C++ DebugInfoWriter always wrote to ~/.cache/torch
Version 2.11:
# C++ DebugInfoWriter now respects $XDG_CACHE_HOME/torch (same as Python code)
# Falls back to ~/.cache/torch if $XDG_CACHE_HOME is not set
DeviceMesh now stores a process group registry (_pg_registry) directly, enabling torch.compile to trace through get_group(). (#172272)
This may break code that skips init_process_group, loads a saved DTensor (constructing a DeviceMesh with no PGs), and later creates PGs separately — during torch.compile runtime the PG lookup will fail. Users should ensure process groups are initialized before constructing the DeviceMesh.
Version 2.10:
# PGs resolved via global _resolve_process_group at runtime
mesh = DeviceMesh(...) # PGs could be created laterVersion 2.11:
# PGs now stored on DeviceMesh._pg_registry; must exist at mesh creation
dist.init_process_group(...) # Must be called before creating mesh
mesh = DeviceMesh(...)Distributed (DTensor)
DTensor.to_local() backward now converts Partial placements to Replicate by default when grad_placements is not provided. (#173454)
Previously, calling to_local() on a Partial DTensor would preserve the Partial placement in the backward gradient, which could produce incorrect gradients when combined with from_local(). Now, the backward pass automatically maps Partial forward placements to Replicate gradient placements, matching the behavior of from_local().
Users who relied on the previous behavior (where to_local() backward preserved Partial gradients) may see different gradient values. To ensure correctness, explicitly pass grad_placements to to_local().
Version 2.10:
# Partial placement preserved in backward — could produce incorrect gradients
local_tensor = partial_dtensor.to_local()Version 2.11:
# Partial → Replicate in backward by default (correct behavior)
local_tensor = partial_dtensor.to_local()
# Or explicitly specify grad_placements for full control:
local_tensor = partial_dtensor.to_local(grad_placements=[Replicate()])_PhiloxState.seed and _PhiloxState.offset now return torch.Tensor instead of int (#173876)
The DTensor RNG internal _PhiloxState class changed its seed and offset properties to return tensors instead of Python ints, and the setters now expect tensors. This makes the RNG state compatible with PT2 tracing (the previous .item() calls were not fake-tensor friendly).
Code that directly reads _PhiloxState.seed or _PhiloxState.offset and treats them as ints will break. Call .item() to get the int value. When setting, wrap the value in a tensor.
Version 2.10:
from torch.distributed.tensor._random import _PhiloxState
philox = _PhiloxState(state)
seed: int = philox.seed # returned int
philox.offset = 42 # accepted intVersion 2.11:
from torch.distributed.tensor._random import _PhiloxState
philox = _PhiloxState(state)
seed: int = philox.seed.item() # now returns Tensor; call .item() for int
philox.offset = torch.tensor([42], dtype=torch.int64) # must pass TensorROCm
caffe2 support is fully removed from ROCm PyTorch's hipify preprocessing. This is known as "hipify v2" behavior. (#174087, #174300, #174388, #174499, #175098)
hipify v1 background
When caffe2 and PyTorch were separate projects, the ROCm support strategies were different. For caffe2, all files and classes would be renamed following the pattern of CUDA to HIP, Cuda to Hip, cuda to hip, and so on. PyTorch did not rename classes, but would create new files following the same renaming pattern (e.g., aten/src/ATen/cuda/CUDABlas.h to aten/src/ATen/hip/HIPBlas.h). As a consequence, caffe2 had a distinct device backend named "HIP" (renamed from "CUDA") while ROCm PyTorch masquerades as the "cuda" device (torch.empty(1, device="cuda")). Once caffe2 and PyTorch projects were merged, this caused a mismatch between caffe2 expecting to use a "HIP" device while PyTorch expecting a "cuda" device. To alleviate this mismatch, "Masquerading" classes were created under aten/src/ATen/hip/impl.
- HIPAllocatorMasqueradingAsCUDA.h
- HIPCachingAllocatorMasqueradingAsCUDA.h
- HIPGuardImplMasqueradingAsCUDA.h
- HIPStreamMasqueradingAsCUDA.h
These classes were often transparently utilized during ROCm PyTorch's hipify preprocessing of source files. All files under c10/ and caffe2/ were hipified using the caffe2 renaming behavior, while all other "PyTorch" files used the other strategy. The Masquerading classes would replace their CUDA counterpart during hipify preprocessing. For example, c10/cuda/CUDAStream.h's CUDAStream would be replaced by aten/src/ATen/hip/impl/HIPStreamMasqueradingAsCUDA.h's HIPStreamMasqueradingAsCUDA. These Masquerading classes call the underlying caffe2 code and create "HIP" devices, and the device would be reset to "cuda" by the Masquerading classes.
hipify v2 new behavior
Hipify v2 (#174087, #174300, #174388, #174499, #175098) makes the following changes:
- "Masquerading" classes are deprecated. Reworked to be thin shells around existing classes, for backward compatibility.
- Do not rename "CUDA" classes to "HIP". Only rename CUDA Runtime APIs. Files are still renamed out of place.
- Removes caffe2 work-arounds for HIP device versus CUDA device.
Great care has been taken to make this change backwards compatible. Though PyTorch today builds cleanly using hipify v2 behavior, downstream PyTorch extension projects that explicitly included Masquerading headers or called Masquerading APIs could be affected, resulting in failed builds. As an example, before backwards compatibility was realized, the xformers project had failed to build using the hipify v2 changes. A PR demonstrates the changes that were initially necessary to work around the build failures, but such changes are no longer necessary after hipify v2 BC-breaking behavior was improved.
torch.export
torch.export.export_for_training has been removed (#171714)
export_for_training was previously available as a separate API for exporting models while preserving training semantics. This function has been removed. Users should use torch.export.export instead, which returns the same graph as the previous export_for_training.
ONNX
Remove the fallback option from torch.onnx.export (#173189)
The fallback parameter has been removed from torch.onnx.export(). Previously, when fallback=True, the exporter would automatically fall back to the legacy TorchScript-based exporter if the dynamo exporter failed. This fallback was removed because it was overly complicated, required different inputs, produced different models, and hid errors from the new exporter.
Migration: Remove fallback=True (or fallback=False) from your torch.onnx.export() calls. If you need fallback behavior, implement it explicitly in your own code by catching exceptions and calling the legacy exporter separately.
# Before
torch.onnx.export(model, args, "model.onnx", dynamo=True, fallback=True)
# After
torch.onnx.export(model, args, "model.onnx", dynamo=True)Remove overload matching logic from the ONNX dispatcher (#165083)
The custom_translation_table parameter in torch.onnx.export() no longer accepts a list of functions for each torch op. Previously, users could pass a list of overloaded ONNX functions (e.g., one for float tensors, another for bool tensors), and the dispatcher would automatically select the correct overload based on input types. This complex type-matching logic has been removed because torchlib no longer uses overloads for the same opset version.
The type of custom_translation_table changed from dict[Callable, Callable | Sequence[Callable]] to dict[Callable, Callable]. Passing a Sequence as a value now raises a TypeError.
Migration: Provide a single function per operator instead of a list of overloads. If you need type-dependent behavior, handle it inside the single function.
# Before
custom_translation_table = {
torch.ops.aten.logical_and.default: [custom_impl_float, custom_impl_bool],
}
# After
custom_translation_table = {
torch.ops.aten.logical_and.default: custom_impl,
}Quantization
The PT2E quantization flow (torch.ao.quantization.pt2e and torch.ao.quantization.quantizer) has been removed from PyTorch and migrated to torchao. (#169151)
The following modules and classes have been removed:
torch.ao.quantization.pt2e(includingDuplicateDQPass,PortNodeMetaForQDQ, export utils, graph utils, numeric debugger, lowering utilities)torch.ao.quantization.quantizer(includingComposableQuantizer,EmbeddingQuantizer,X86InductorQuantizer,XPUInductorQuantizer,XNNPACKQuantizer,QuantizationSpec,QuantizationAnnotation,QuantizationConfig, etc.)
Users relying on the PT2E quantization flow should migrate to the torchao package, which now hosts these APIs.
Version 2.10:
from torch.ao.quantization.pt2e import prepare_pt2e, convert_pt2e
from torch.ao.quantization.quantizer.x86_inductor_quantizer import X86InductorQuantizerVersion 2.11:
# Install torchao: pip install torchao
from torchao.quantization.pt2e import prepare_pt2e, convert_pt2e
from torchao.quantization.pt2e.quantizer.x86_inductor_quantizer import X86InductorQuantizerDeprecations
Linear Algebra
-
The MAGMA backend for linear algebra operations is now deprecated and will be removed in a future release. Setting
torch.backends.cuda.preferred_linalg_library("magma")or retrieving a previously-set MAGMA preference will now issue a deprecation warning. cuSOLVER remains the default backend. (#172823)If you see any errors when using cuSOLVER that did not occur with MAGMA, please file an issue on GitHub. To silence the warning, stop explicitly selecting the MAGMA backend:
Version 2.10:
# No warning torch.backends.cuda.preferred_linalg_library("magma")
Version 2.11:
# Issues a deprecation warning — remove this call to use the default cuSOLVER backend torch.backends.cuda.preferred_linalg_library("magma")
-
torch.linalg.svdno longer dispatches to MAGMA. The MAGMA backend is deprecated and cuSOLVER is now used unconditionally, providing significant speedups (2x–400x depending on matrix size and batch dimensions). (#172824)Previously, setting
torch.backends.cuda.preferred_linalg_library("magma")would route SVD through MAGMA. This setting is now ignored for SVD, and cuSOLVER is always used.Version 2.10:
torch.backends.cuda.preferred_linalg_library("magma") U, S, Vh = torch.linalg.svd(x) # Uses MAGMA
Version 2.11:
# MAGMA preference is ignored; cuSOLVER is always used U, S, Vh = torch.linalg.svd(x) # Uses cuSOLVER
-
torch.linalg.solve_triangularandtorch.triangular_solveno longer dispatch to MAGMA on CUDA. cuBLAS is now used unconditionally, providing speedups of 2x–24x for most matrix sizes (small matrices may see minor regressions of ~0.6x). (#174109)Version 2.10:
torch.backends.cuda.preferred_linalg_library("magma") torch.linalg.solve_triangular(A, B, upper=False) # Uses MAGMA
Version 2.11:
# MAGMA preference is ignored; cuBLAS is always used torch.linalg.solve_triangular(A, B, upper=False) # Uses cuBLAS
-
torch.linalg.lstsqno longer dispatches to MAGMA. cuSOLVER/cuBLAS are now used unconditionally, providing speedups of 1.7x–620x depending on matrix size and batch dimensions. (#174779)Version 2.10:
torch.backends.cuda.preferred_linalg_library("magma") result = torch.linalg.lstsq(A, B) # Uses MAGMA
Version 2.11:
# MAGMA preference is ignored; cuSOLVER/cuBLAS is always used result = torch.linalg.lstsq(A, B) # Uses cuSOLVER/cuBLAS
Distributed
torch.distributed.symmetric_memory.enable_symm_mem_for_group is deprecated. The store can be retrieved directly via ProcessGroup.getStore() in C++, making this call unnecessary. (#172163)
Version 2.10:
from torch.distributed.symmetric_memory import enable_symm_mem_for_group
enable_symm_mem_for_group(group)Version 2.11:
# No longer needed — store is accessed directly from the ProcessGroupNew features
Python Frontend
-
Added
native_handleproperty totorch.Stream, providing a unified way to retrieve the backend-specific opaque stream handle (e.g.,cudaStream_tfor CUDA,sycl::queue*for XPU). This is useful for passing stream handles to third-party libraries such as Triton. (#171040)stream = torch.accelerator.current_stream() handle = stream.native_handle # backend-specific stream handle
Autograd
- Add
Function.clear_saved_tensors_on_accessclass attribute to automatically free saved tensors after they are accessed (#173833)
torch.nn
- Add mechanism to restore default flash attn impl after
activate_flash_attention_impl(#169866) - Add
scalefor softmax to varlen attn (#171199)
Distributed
- Add
start_methodoption totorch.distributed.debug.start_debug_serverto select the multiprocessing start method (fork,spawn, orforkserver), enabling CUDA-safe server startup (#173196) - Add support for periodic dumping in
torch.distributed.debug(#174808) - Non-functional collectives (e.g.
torch.distributed.all_gather) now automatically work withFakeTensorMode— meta implementations are registered atimport torchtime (#162119) - Implement NCCL 2.29 one-sided APIs for symmetric memory (#172425)
- Bind
SymmetricMemoryas a torch class for use in op definitions (#174019) - Enable
torchcomms_BackendWrappershim layer in c10d (#174202) - Expose SymmetricMemory window API (#170740)
CUDA
- Make (pinned) host memory allocations work with memory pools. (#167507)
- Make large segment size configurable for allocation performance tuning (esp. re: Expandable Segments). (#172056)
MPS
- Async error reporting from GPU operations (#170002, #170050)
import torch x = torch.rand(10, 1, 10, device='mps') y = x[:, [1]] torch.mps.synchronize() # will raise index out of bounds error
- Added support for Metal 4 (#172229, #172230)
ROCm
- Expose device properties
clock_rate,memory_clock_rate,memory_bus_width,memory_per_block,shared_memory_per_block. (#170572) - Support for device-side assertions via
TORCH_USE_HIP_DSA. (#172679) - Attention operator support on gfx1151/1152/1153 via AOTriton 0.11.2b update (#174105)
- Enable scaled group mm on gfx950. (#173737)
- Enable group gemm on gfx90a. (#169356)
- Enable MIOpen backend for CTC Loss. (#170749)
- Add hipsparseSpSV and hipsparseSpSM support for triangular solve. (#171097)
- Support for PyTorch's StaticCudaLauncher, which provides static compilation and launching of Triton kernels. (#166492)
XPU
- Introduce XPUGraph, a runtime optimization feature designed to reduce kernel host overhead on XPU devices, detail in: design and usage. (#166285, #174041, #174351, #174059, #174046, #166843)
torch.compile
Dynamo
torch.compilenow supports tracing throughcontextlib.ExitStackandcontextlib.suppresscontext managers, allowing code that uses these patterns to be compiled without graph breaks (#146506, #147990)- Added
torch._dynamo.config.ignore_logging_functionsconfig to skip arbitrary logging callables during tracing without causing graph breaks. Add functions to this set to have Dynamo treat them as no-ops during compilation (#168913) - Added
TORCH_DYNAMO_AUTOMATIC_DYNAMIC_SHAPES=0environment variable to globally disable automatic dynamic shapes without modifying Python code (#172334) - Added
TORCH_COMPILE_OVERRIDE_BACKENDSenvironment variable for per-graph backend override, enabling binary search to find problematic compiled graphs. Supports filter syntax like">10:eager"or"0-5:aot_eager;6-10:inductor"(#172411) - Added initial support for
torch._dynamo.decorators.leaf_function, which allows annotating functions as leaf operations that Dynamo and AOTAutograd will not trace into (#170471) - Added support for tracing backward hooks on intermediate tensors, fixing cases where
register_hookon non-leaf tensors would fail undertorch.compile(#172126)
Inductor
- FlexAttention supports deterministic mode, wired through both Flex and Flash backends (#173126)
- Added range-based autotuning for custom ops, enabling selection of optimal implementations based on runtime tensor dimension values with per-range benchmarking and automatic
torch.conddispatch generation (#167617) - FlexAttention: Added support for low precision K/V inputs in compiled mode. Keys and Values can now be in lower precision than Queries for memory efficiency (#171761)
- Added native
ldexplowering withlibdevice.ldexp(CUDA) andstd::ldexp(CPU) codegen (#171721) - Inductor now supports
pin_memoryfortorch.empty(#172578) - Exposed
triton_metato TritonTemplatemaybe_append_choiceAPI for custom template development (#174292) - Added Async Pipelined Autotuning for
max-autotune-gemm, which overlaps autotuning with lowering/scheduling in a subprocess to reduce compilation overhead (#170407) - FlexFlash: Added BlockSparse backward pass, dynamic shapes, and backward score-mod support (#170397, #170611, #171465)
- Added FP8
(BlockWise128x128, BlockWise1x128)scaling support in Inductor Triton templates (#170748) - Autochunker: Added gradient accumulation support and ability to override number of chunks (#171359, #171477)
- Added NVGEMM backend for GEMM operations using NVIDIA's native matmul library, with support for BMM, grouped GEMM, scaled MM, dynamic shapes (#171205, #171362, #172280, #172283, #172378, #172388, #172391, #172402, #172417, #172525, #172582, #172607, #174827)
torch.export
- Add nested tensor serialization support for
torch.export(#174720) - RNN modules (LSTM, GRU, etc.) can now be exported on GPUs (#169710)
- Add patch to enable tracing LSTM with dynamic shapes (#168095)
ONNX
- Added
ExportableModulewrapper for ONNX export (#170810) - Added
InputObserverto infer dynamic shapes for export (#172838) - Add a parameter to force the first dimension to be dynamic in InputObserver.infer_dynamic_shapes (#173533)
- Implement while_loop (#162645)
- Add invoke_subgraph HOP export support (#174283)
- Expose ONNXProgram.rename_axes for renaming dims (#172032)
- Support custom empty tensor shapes in
InputObserverfor multimodal LLM export (#174964)
Foreach
- Added
torch.linalg._powsumandtorch._foreach_powsumas fused kernels that computesum(abs(x)**ord)(equivalent tovector_normwithout the root extraction) (#172685)
Improvements
Release Engineering
- Upgrade to ROCm 7.2 with new Docker images, magma tarball, and binary builds (#173096, #173106, #173187, #174234)
- Add an option to install cuda if required cuda/cudnn on windows AMI do not match (#177273)
Python Frontend
torch.loadnow produces clearer error messages when encountering miniz errors fromPyTorchStreamReader, explicitly indicating that the checkpoint file is likely corrupt (#170244)torch.load(map_location='meta')no longer reads storage data from the filesystem, improving performance when loading checkpoints onto the meta device (#170619)
Composability
- Add
check_out_variantandto_out_variantutilities for custom operator out variant validation.check_out_variantverifies that a custom op's out variant is compatible with Inductor's out_variant pass, andto_out_variantconverts anOpOverloadto its out variant. (#174473)
torch.nn
- Add
remove_duplicateparameter tonn.Module.modules()function (#174383) - Add support for low precision K/V inputs to
nn.attention.flex_attention(#171744)
C++ Frontend
- Added support for
Float8_e8m0fnuandFloat4_e2m1fn_x2dtypes to stable ABI (#173669) - Added
torch::stable::Tensor::layout()(#174735)
Distributed
- Set thread name for Gloo internal loop for easier debugging (#169979)
- Make
context_parallel_shardmore general (#170200) - Polish NCCL symmetric memory code (#170582)
- Add MemPool support for NCCL symmetric memory backend (#171727)
- Extend symmetric memory barrier to both LSA and GIN (#172701)
- Implement
get_offsetfor symmetric memory (#172044) ProcessGroupNCCL: workaround forreduce_scatterwithworld_size=1(#170922)- Add XCCL backend support for
ProcessGroupWrapper(#171920) - Lazy import
pdbonly when user callsbreakpoint()intorch.distributed(#171818) - Remove MB < PP check for GPipe pipeline schedule (#171462)
- Pass DDP bucket cap size list for finer-grained control (#169026)
- Enable ProcessGroup round-trip through JIT via CapsuleType (#172794)
- Don't repeatedly log environment variables (#170399)
- Set NCCL group desc before creating comm so it propagates (#171159)
ProcessGroupNCCL: use lowest rank as split color (#173687)
DTensor
- Add OpSchema.args_meta, kwargs_meta helpers (#170358)
- Support misc sym ops (#172268)
- DTensor Ops: Add linearity support for neg operation (#172563)
- Add SymInt support for DTensor mesh coordinate computation in PT2 (#169552)
- Enable single-dim strategy for addmm and baddbmm (#172387)
- Support uneven _StridedShard redistribution (#172266)
- Update TP api to support single-dim strategies (#173567)
- Initial support for decomps + sharding prop (#171652)
- Add shard prop cache logging (#173775)
- Optimize redistribute comms using flattened meshes (#174630)
CPU
- Added support for FP16 half-precision GEMM via OpenBLAS on CPU, enabling faster FP16 inference (#169042)
CUDA
- Remove _scaled_mm layout check on Blackwells (#170693)
- Add uint16, uint32, uint64 support to JIT CUDA kernels (#174303)
- Remove fallback paths for pinned memory allocation during CUDA graph capture (#170710)
- Improve numerics of UpSample kernel by using
accscalar_tfor interpolation accumulators (#170661) - Reinstate error message details in CUDA_KERNEL_ASSERT_VERBOSE call in IndexKernelUtils.cu (#170913)
- Switch order of blocked reduce in reduction_template.cuh (#173425)
cuDNN
- Upgrade cuDNN to 9.15.1 for CUDA 13 builds (#169412)
- Upgrade CUDA 13.0 wheels to cuDNN 9.17.1 (#173216)
- Enhance cuDNN tensor shape checks in sdp_utils.cpp to support Blackwell GPUs (#172621)
MPS
- Improved support for distributions operations (#172187, #172675, #173287)
- Enabling
index_fillbackward pass (#174238) - Extended
baddbmmandaddbmmto integer and complex types (#170895) - Improved error messages for distributed ops on MPS (#173954)
- Added MPS support for
torch.special.erfcx(scaled complementary error function) (#172910)
ROCm
addmmbehavior now takes into account preferred BLAS backend instead of forcing hipblaslt. (#174350)- Enable hipBLASLt on gfx1103. (#172180)
Sparse Frontend
torch.view_as_realandtorch.view_as_complexnow support sparse tensors (#164964)- Sparse tensor invariants check warning is now raised only once when the check is disabled, instead of on every operation (#171695)
XPU
- Add
torch.xpu._dump_snapshotAPI (#170186) - Add
torch.xpu._record_memory_historyAPI (#169559) - Add
torch.xpu.memory_snapshot(#169442) - Add
local_mem_sizeto XPU device properties (#172314) - Support
torch.accelerator.get_device_capabilityon XPU (#170747) - Enable Triton online softmax kernels on XPU (#163251)
- Support woq_int8 Inductor pattern on Intel GPU (#163615)
- Add XPU ATen GEMM overloads with output dtype (#170523)
- Support
aot_inductor.emit_multi_arch_kernelon XPU (#171432) - Improve Inductor UT coverage for XPU (#171280, #166376, #169181, #166504)
- Enable Triton mm template
decompose_kchoice for XPU (#170541) - Support AOTInductor standalone compile API for XPU (#171450)
Profiler
- The memory visualizer now has a checkbox to toggle showing the trace, useful for large traces that take
a long time to load (#174717). The memory profiler
now exposes a newskip_actionsflag to filter out specific events (#168183). - The profiler now exposes a
post_process_timeout_sfield to prevent post processing from blocking
further execution (#173957).
torch.compile
Dynamo
- Suppressed repeated "triton not found" messages during import — previously 12 identical warnings were printed (#172614)
fullgraph=Truenow recursively disables dynamo on compiled code to prevent unintentional re-invocation oftorch.compile(#173080)- Miscellaneous smaller tracing support additions:
- Add args print support to hop print (#170880)
- Don't register einops ops with
allow_in_graph(#173611)
Inductor
- Improved heuristics for reduction kernels (#170931)
- CUDAGraph partitioning now supports cudagraph-unsafe symints (#173159)
- MixOrderReduction: Added low precision reduction support, non-strict mode, and avoid recompile (#169978, #171941, #174947)
- Triton compilation timeout is now configurable and defaults to 5 minutes (lowered from previous default) (#172674)
- User stack traces are now reported when a
LoweringExceptionoccurs, making debugging easier (#171846) - Added B300 (Blackwell) support: GPU architecture
120afor.ptxto.fatbincompilation and cpp codegen (#174162, #172263) - Autotune process pool now inherits tf32 options from the parent process (#174742)
- Epilogues can now be statically analyzed for fusion decisions (#170001)
- Added
cvt_e8m0_rceilprim with PTX lowering for SM100+ GPUs (#172497) - Basic comm buffer reuse for Symmetric Memory (#171909)
- Added
launch_cooperative_gridflag for cooperative reduction kernels (#167800) - Updated CUTLASS codegen to support
torch.float8_e5m2, enabling mixed FP8 (e4m3fn x e5m2) matrix multiplication (#171167) - Improved mkldnn convolution layout propagation in Inductor (#169260)
- Optimal Epilogue fusion overlapping with Async Pipelined Autotuning (#171011)
- FlexAttention improvements: Enabled SM90 blocksparse backwards, updated configuration for Thor and DGX Spark hardware, and enabled TMA path by default on Intel GPU (#171685, #173898, #172316)
- Added support for torchcomms lowering in inductor IR (#171634)
- Allow int8 layout dtype for cpp gemm template on CPU (#169161)
- Improved batch matmul codegen (#172678)
- Improved error message in standalone_compile when there are no aot_autograd artifacts (#174086)
- Removed unnecessary synchronize before launcher creation (#169432)
- Removed implicit float64 upcast in Triton codegen, improving performance and reducing unnecessary precision casting (#172143)
- Added torch.compile compatibility to FP8 SDPA using FlashAttention3, including meta registration and inductor lowering fallback for the new
scaled_dot_product_flash_attention.low_poverload (#172622) - Replace
record_functionwith_RecordFunctionFastin CompiledFxGraph for reduced profiling overhead (#163976) - Relaxed restriction on triton template
mutated_inputs, allowing more flexible template usage (#170721) - Added
combo_kernels_pointwise_onlyconfig option to exclude reduction kernels from combo kernel fusion (#174894) - Add a fusion region utility for grouping inductor fusible nodes for aten estimation (#170559)
- Pallas backend: Added support for pooling with strided indexing, masked operations, random, FloorDiv, flattened indexing, welford fallback, ModularIndexing, transpose, im2col gather pattern detection, element-wise pairing, sympy min/max, FMA, automatic padding to WARPGROUP_SIZE, atomic_add store mode, TMA for OOB masking on Mosaic GPU, jax/cuda stream sync, better iter var tracking, and interleaved rope (#170014, #170145, #170221, #170222, #170232, #170595, #170616, #170627, #170738, #170741, #171449, #171475, #171518, #171539, #171567, #172306, #173840, #174249, #174797)
- Add per-graph inductor config override for debugging/bisecting (#174228)
torch.fx
torch.fx.symbolic_tracenow supports tracingHigherOrderOperators that do not take callable arguments (#173839)- Rename
hint_inttosize_hint, supportsize_hintin user code. (#171944) - Add metadata hook for all nodes created in runtime_assert pass (#173970)
- Add
_disable_torch_fn_metadata_modeoption tomake_fxandaot_export_joint_with_descriptors(#172087) - Add nested value-type opaque object support (#169845)
torch.export
from_nodeprovenance information is now preserved when serializing exported programs (#171726)- Bitwise shift operations are now supported in the export serializer (#167913)
- Improve leak detection in non-strict export mode (#172597)
Quantization
- Use
expm1for computing quantized ELU, improving numerical stability (#173968)
ONNX
- Implement torch.sym_sum and torch.sym_ite (#170263)
- Raise an error if there are duplicated input/output names (#173077)
- Refactor optimize and version conversion logic (#173185)
Optimizer
- Optimizer graph capture check now supports XPU devices in addition to CUDA (#172759)
DevX
- The
spin lintcommand now supports pass-through arguments to lintrunner, including--take,--skip, and--tee-jsonflags, giving developers more control over which linters run (#169373)
Ahead-Of-Time Inductor (AOTI)
- Better error message for mixed device tensors (#173982)
- Support mixed-device constants (#169504)
- Change
cpp_kernel_nameto public API to match AOTI shim gen; addmm_type_outto AOTI fallback kernel (#174489)
Bug fixes
Release Engineering
- Fixed macOS wheel metadata where setuptools misinterpreted the platform version string, producing incorrect wheel tags for macOS arm64 builds (#173541)
- Fixed incorrect wheel naming (#173945)
- Fixed macOS arm64 libtorch release upload failure (#175100)
- Fix pep517 release handling (#175635)
Python Frontend
- Fixed a bug where
torch.loadwithFakeTensorModeorskip_datacontext would compute incorrect storage sizes (#170618) - Fixed PrivateUse1 backend aliasing during deserialization so custom backends are correctly recognized when loading checkpoints (#165456)
- Fixed
torch.ops.aten.index.Tensorto properly raise anIndexErrorwhen called with an empty indices list, instead of producing undefined behavior (#174009)
Autograd
- Fixes absolute tolerance scaling for complex backpropagation in
torch.autograd.gradcheckwhenfast_mode=True(#166386)
Complex Frontend
- Fixed
torch.view_as_complex()not working on the memory layout produced by.contiguous()after.transpose()(#169780)
Composability
- Fix
torch.bucketizecrash duringtorch.exportwhentest_elementsis a scalar (#170751) - Fix
MaxUnpoolcrash when input tensors are small (#169359)
Dataloader
- Fix DataLoader to respect overridden
__getitem__in Subset subclasses (#163961)
Nested Tensor (NJT)
- Fix NestedTensor min/max operations for integer dtypes (#167685)
torch.nn
- Fix Illegal Memory Access in FlashAttention 2 by syncing with upstream (#174114)
- Propagate
max_qandmax_kforvarlen_attn(#173681)
C++ Frontend
- Fixed bug in stable ABI shim to handle special types (
SymInt,MemoryFormat,ScalarType,Layout) correctly (#174734)
Distributed
- Add half precision binding for MPI backend (#170074)
- Fix incorrect boolean logic in
std::string::findmethod in c10d (#170057) - Fix
_set_pg_timeoutnot working for Gloo backend (#167052) - Fix DeviceMesh corner case for coalesce in cute layout and mesh slicing (#169454)
- Fix Context Parallel
flex_input_fnargument unwrapping issue (#170201) - Fix FSDP
_unshard()passingStreaminstead ofEvent(#170525) - Ensure threadblock size >= world size in CUDA symmetric memory barrier (#170785)
- Fix
ProcessGroupGlooCUDA tensor stream handling with futures (#170812) - Fix env variable to retrieve HCA list for NVSHMEM (#170891)
- Fix FSDP
split_with_sizes_copy()missingdimargument (#169173) - Fix cross-thread work registry lookup in
wait_tensor(#171614) - Fix
fully_shardarg typehint inconsistency (#171574) - Fix Flight Recorder default buffer size inconsistency (#172843)
- Remove mixed dtype rejection for
clip_grad_normto align with documentation (FSDP) (#173641) - Fix all-reduce strides in compiled code (#171616)
- Fix
ProcessGroupWrappermissing method forwarding (#173599) - Fix syntax for suppression comments. (#167088)
Distributed Checkpoint (DCP)
- Fix TypedStorage deprecation warning in distributed checkpoint (#170759)
- Fix typo in variable name from 'statetful_sd' to 'stateful_sd' (#171292)
DTensor
- Preserve
Partial(max/min)reduce op type ontorch.max/torch.minoutput DTensors (#170203) - Prevent pointwise operations between
PartialDTensors with different reduce ops (#170209) - Fix OpInfo.schema type and add asserts (#170790)
- Fix _StridedShard(sf=) bug in single dim strategy (#171942)
- Fix incorrect Tensor Meta Population (#172304)
- Single_dim fix symint + _create_expanded_strategy (#172421)
- Single dim fix inplace op expansion (#172477)
- Fix single-dim output_meta validation (#172293)
- Fix redistribute cost crashing on non-participating ranks (#172478)
- Fix t() sharding strategy for 1D tensors (#173964)
- Fix unsupported op error (#170889)
- Fix DTensor honor single-dim RuntimeSchemaInfo (#174312)
- Fix device_mesh extraction from kwargs (#173489)
- Fix StridedShard usage conflict with shard order (#174831)
- Fix bucketize with Partial inputs (#173937)
- Fix embedding_dense_backward cache key missing num_weights (#174727)
FullyShardedDataParallel2 (FSDP2)
- Fix mixed DTensor error with nested FSDP and activation checkpoint (#171779)
CUDA
- Don't false-positive on record_stream and account for 0-element tensors in CUDA stream sanitizer (#172562)
- Fix failures due to launch bounds for
ctc_loss_gpu_templateon SM12+ (#172447) - Fix the torch.Stream context manager reentrance (#176568)
cuDNN
- Disable a cuDNN Convolution engine preemptively (#171747)
MPS
- Fixed non-contiguous grid sampler on MPS (#171619)
- Fixed large reductions when compiling for MPS (#171479)
- Fixed MPS Inductor
tanhimplementation (#172406) - Fixed complex to real power scalar on MPS (#174147)
- Fixed masked op logic in MPS Inductor (#170134)
- Fixed
orgqrrace condition on MPS (#174143) - Fixed 2-pass SDPA memory corruption by forcing float accumulators, resolving nondeterministic/corrupt results with bf16/fp16 and GQA when seq_len > 1023 (#174945)
- Fix half-precision type mismatches in Metal shader codegen (#176436)
ROCm
- Sliding window attention NaN issue is fixed by AOTriton 0.11.2b. (#173204, #174105)
- Increase the event_name attribute of autograd's profiler_util.py to avoid truncation of long HIP events. (#174366)
- Cholesky operator via MAGMA was missing a sync operation. (#172112)
- Updated patched libdrm in bundled release wheels to avoid missing amdgpu.ids warning and properly return AMDGPU marketing names. (#174811)
- Fix fake_quantize undefined behavior with inf. (#171777)
- Fix deterministic scan kernel edge case. (#170763)
- Use torch's caching allocator for CK workspaces, for better memory behavior and hipGraph capture. (#172311)
- Fixes grouped gemm 2d2d for edge cases with uninitialized data. (#174314)
Sparse Frontend
- Fixed
torch.sparse.spdiagscrashing with zero-dimension shapes (#174052)
torch.func
- Fixed
GradTrackingTensor.tolist()not working on MPS (macOS) device tensors when used insidetorch.func.grador other function transforms (#171317)
XPU
- Fix T5 model SDPA pattern matcher on XPU (#171774)
- Fix
torch.xpu.memory_allocated/torch.xpu.memory_reservedreporting incorrect memory sizes (#171453)
torch.compile
Dynamo
- Fixed memory leaks: cleared weakrefs from memos/guards after compilation (#165367), cleared weak references from
FakeTensorModeafter compile (#171209), fixed CUDA memory usage for CPU-only compile (#163841) - Fixed overguarding on
OrderedSet,set, andfrozensetwith activation checkpointing (#169535, #170291) - Fixed
MATCH_MAPPING,MATCH_KEYS, andMATCH_SEQUENCEopcodes for Python pattern matching (match/case) support (#173085, #173086, #173087) - Handle List/Dict comprehension graph breaks for Python 3.12+, including nested comprehensions (#173558, #174413)
- Fixed support for self-referential lists and dicts (#173672, #174498)
- Fixed
share_memory_compile failure (#171162) - Fixed
defaultdictdefault factory and union functionality (#168028) - Fixed property setter on
MutableMappingsubclasses (#173184) - Various tensor subclass fixes: subclass handling (#170871), sequence conversion (#172103), metadata propagation for in-place ops (#167583)
- Correctly pass
is_inferencein the cudagraphstorch.compilebackend (#174713)
AOTDispatcher
- Fixed effect token handling when a graph contains subgraphs without tokens (#173226)
Inductor
- Avoid TMA assert and honor descriptor min block sizes in mix-order (#170949)
- Prevent fusing atomic scatter ops with dependent reads (#172301)
- Combo kernel fixes: missing store masks,
persistent_reductionheuristics,FusedMixOrderReductionsgrouping, and scalar store broadcast shape mismatch (#168939, #169509, #169721, #172658) - Fix
torch.condstride mismatch when subgraph outputs are padded (#169963) - Fix
view_copydecomposition to handle dtype size changes (#171442) - Fix output shape mismatch for
aten.isinwith scalar inputs (#171272) - Fix segfault with DeviceContext and python scalar (#172660)
- Fix WeakDeps for clone ops in scheduler (#173703)
- Fix strides for
aten.uniform(#170794) - Handle negative scaling factors in
upsample_nearest(#171151) - FlexAttention fixes: non-power-of-2 head dimensions in decode, score mod captured grad dtype, GQA/MQA, force last dim contiguous, deterministic semantics, and TMA availability fallback (#164931, #170842, #171271, #172690, #174251, #174427)
- Fix compatibility with new TF32 API by removing legacy
allow_tf32usage from inductor internals (#173731) - Fix out of shared memory issue with too high
num_stages(#170071) - Fix cooperative reduction deadlocks due to too many CTAs (#170162)
- Fix
select_scatterdtype assertion error (#171311) - Fix bug in
aten.powlowering (#170960) - Make
siluoutput match eager mode in inductor (#171723) - Fix constant folding crash on 0-d complex tensors (#172181)
- Fix
bucketizecrash in compile mode (#171595) - Fix
OverflowErrorwhen truncating infinity numbers (#166636) - Fix non-contiguous valid reshape handling for dynamic shapes (#173062)
- Fix strides for
narrow_copydecomposition (#173043) - Fix race condition in FxGraphCache when reading temp files (#172144)
- Fix dynamic shapes infinity check (#173426)
- Fix DDE in inductor/scheduler fusion (#173412)
- Fix clone removal for conjugated tensors in
remove_noop_ops(#172160) - Fix edge case with
next_power_of_2(#174330) - Fix conditions to apply TMA (#174480)
- Fix CUTLASS import error (#174924)
- Fix persistent reduction codegen (#174928)
- Fix lowering fallback for
_pdist_forwardand_pdist_backward(#170959) - Fix coalesce analysis to work with symints (#161819)
- Fix
libdevice.powtype mismatch in Triton codegen (#173685) - Fix missed symbol propagation in cudagraphs partition (#174729)
- Fix dtype issue in
exclusive_scan_decoupled_lookback_64(#171153) - Skip fuse attention on fp32 if not tf32 (#172951)
- Fix
RuntimeErrorinFxGraphCachePickler.dumps()for unpickleable pybind11 objects (#173577) - Fix loop split optimization when nodes have mismatched
var_ranges(#173607) - Fix error parameter handling (#170952)
- Guard
n_spillsNone before threshold comparison in Triton heuristics (#169940) - Autotuning fixes: TMA workspace pickling in async-pipelined mode, synthetic offsets for grouped MM, and GEMM inputs that are views of the same buffer (#173089, #171316, #171363)
- Fix reduction hint for inner-dimension reductions (#167472)
- Fix crash when Triton compile fails in
identify_mutated_tensors()(#170808) - Handle unbacked symbols in
should_use_persistent_reduction(#172534) - Pallas backend: Multiple bug fixes including store for scalar values, index_put, scatter, cat, mean reduction codegen, negative index handling, iteration variable ordering, batch norm broadcasting, scalar store shape mismatch, square buffer transpose detection, strided tensor handling, and rope (#170026, #170027, #170028, #170464, #170593, #170594, #170653, #171215, #171476, #171504, #171571, #171581, #171612, #174791)
- Fix
use_compute_typesparameter name mismatch in CPP backendto_dtype()that causedTypeErrorwhen enablingemulate_precision_casts=Trueon CPU (#168125) - Fix Invalid Memory Access in TMA mm kernel by limiting GEMM to valid MNK dimensions (#171871)
- Guard against zero timing in combo kernel benchmarking to prevent division by zero (#172918)
- XPU: Fix SDPA pattern matcher for T5 models (#171774)
- Fix extern kernel kwargs with IR nodes not being materialized in FXIR backend, causing errors with ops like
torch.segment_reduce(#172981) - Fix
torch.isincompile shape for scalartest_elements(#172531) - Fix kernel benchmark runner (#169976)
torch.fx
- Fix
torch.fx.symbolic_traceto_folderwithtorch.nn.Sequentialmodules (#169279) - Fix
Node.typepickling intorch.fx(#169172) - Raise ValueError for invalid fusion strategy and add test (#171573)
- Fix typos (#171042)
- Fix input mutation handling for subclasses (DTensor copy_) (#170467)
torch.export
- Fix graph signature mutation when unlifting exported programs (#170461)
- Fix tensor name inconsistency when round-tripping through
torch.export.saveandtorch.export.load(#171954) - Fix handling of incomplete tensors (cuDNN packed format) in
torch.export(#172805)
Quantization
- Fix incorrect dilation check in quantized
MaxPool3dthat could produce negative values (#171790) - Fix incorrect values displayed in quantization error messages and log strings (#171868)
- Add validation for
out_channelsin quantizedConvTransposemodules (#171628) - Fix crash when creating quantized tensors with empty dimensions (#163487)
- Fix context cache for FP8 quantized linear operations where weight scales could become invalid (#172553)
ONNX
- Handle complex initializers (#170231)
- Fix export of torch.cdist with dynamic axes (#172758)
- Fix InputObserver.infer_arguments with empty caches (#174205)
Foreach
- Fixed
_foreach_copy_producing incorrect results when destination tensors have mixed dtypes (#173531) - Fixed
_foreach_maxreturning incorrect results when tensors contain negative values, by initializing output withlowest()instead of zeros (#173241) _foreach_normnow properly raises an error when computing infinity norm on empty tensors, instead of returning undefined results (#173238)- Fixed
_foreach_normto support symbolic tensors with dynamic shapes (#174026)
Ahead-Of-Time Inductor (AOTI)
- Fix import error in aoti load (#173751)
- Fix mixed-device zero-size constant indexing error (#172748)
Performance
Python Frontend
- Added
__slots__to pytreeTreeSpecdataclasses, reducing memory usage and improving attribute access speed (#172172)
Composability
- Multiple optimizations to symbolic shape reasoning, including faster symbol sorting, reduced redundant hint computations, and optimized construction of relational expressions (#174615, #174655, #174664, #174652, #174665, #174662)
Distributed
- Sort mempool registrations via allocation-time counter for CUDA mempools (#167662)
- Improve
_get_param_to_fqnsfrom O(N^2) to O(N) in FSDP (#174675)
CUDA
- Speedup
grouped_mma bit (#170802) - Add fast memory snapshot option which skips traces (#173949)
- Allow all 10.x compute capabilities to use vec8 kernel for higher realized memory bandwidth (#174362)
MPS
- Migrated
atan2to native MPS Metal kernel (#173405) - Migrated
pow_tensor_scalarandreciprocalto Metal shaders (#170077) - Reimplemented Cauchy distribution with native Metal kernel (#174062)
- Rewrote
log_normalandgeometricdistributions as Metal shaders (#174189) - Migrated
grid_sampler_2dto Metal (#174343)
ROCm
- Revert MIOpen channels last support back to opt-in using the environment variables PYTORCH_MIOPEN_SUGGEST_NHWC=1 and PYTORCH_MIOPEN_SUGGEST_NHWC_BATCHNORM=1. (#170780)
- New fx pass to reduce atomic contention. (#168073)
- TopK performance improvements; single-block warp-level compaction (#171940), warp merge sort (#170029).
- Enable fastSpecializedAtomicAdd for gfx950, improving performance of index-based operators like embedding bag, sampling, and scatter/gather. (#170330)
- Optimize radix select by caching data on shared memory. (#172517)
- Optimize reduction operator launch configuration for better performance. (#173576)
- Improvements to inductor reduction kernel heuristics for MI350. (#170931)
XPU
- Optimize
int_mmperformance on Intel GPU whenmat2is non-contiguous (#169555) - Enable static Triton kernel launcher for XPU backend to reduce model compilation time (#169938)
torch.compile
Dynamo
- Various compile time improvements: caching for
inspect.signature,var_getattr, attr source construction, and higher-order ops; fast paths forbind_args,GET_ITERon tuples, andtree_mapon namedtuples; lazy variable tracker optimizations (#170100, #169959, #173582, #174437, #174438, #174141, #174020, #174130, #174901, #174598)
Inductor
- FlexAttention performance: Improved flex decoding by avoiding unbalanced splitting, support multi BLOCK_M for flex-decoding, elide score-mod when not needed, and avoid materializing lse grad (#167935, #170343, #172380, #173481)
- Compile-time improvements: optimized reorder/sink passes, removed expensive
dynamo_timedcalls, used_has_symbolic_sizes_stridesmore pervasively, overlapped template fusion with async_compile, and optimized Triton template heuristics (#169370, #169661, #169662, #170444, #170565, #172249, #174408) - Apply gumbel max trick for sampling (#171143)
- Improved combo kernels design, add per-sub-kernel block dimensions for combo kernels and flattened grid dispatch (#171671, #172527)
- Rewrote
avg_pool2das a reduction for improved performance (#167228) - Optimized tensor broadcasting (#170772)
- Reduced autotune search space for pointwise Triton kernels (#169508)
- Added BLOCK_K=64 and 128 autotuning configs for Triton conv, improving performance for larger channel sizes (#174752)
torch.fx
- Improve node index lookup performance in FX graphs by using an index lookup map (#173385)
Quantization
- Use oneDNN primitive for FP8 quantized convolution on x86, replacing the previous reference kernel (#172551)
Optimizer
- Use fused multiply-add (FMA) instructions for fused Adam and AdamW on CUDA, improving numerical accuracy and performance (#173224)
- Improve performance of fused Adam, AdamW and SGD by not using double compute (#173227)
Foreach
- Enabled fast path for
_foreach_addcmuland_foreach_addcdivwhentensor1is a 0D (scalar) tensor (#172731)
Documentation
Release Engineering
- Fixed stable C++ docs rendering (#171957)
- Updated pytorch_sphinx_theme2 version to 0.4.3 (#174806)
- Updated pytorch_sphinx_theme2 version to 0.4.6 (#177562)
Python Frontend
- Clarified
torch.uniquebehavior when using thedimparameter (#171608) - Clarified
torch.as_tensorsignature to document thatdtypeanddeviceare keyword-only arguments (#173073) - Fixed
torch.tensordotdocumentation (#173893) - Added dedicated docstring for
torch.Tensor.permuteto clarify it accepts variadic arguments unliketorch.permute(#170689)
Autograd
- Improve
torch.autograd.set_multithreading_enableddocs (#170204)
Nested Tensor (NJT)
- Add warning about inactive development to NJT docs (#172645)
C++ Frontend
- Added C++ docs for
torch::stablenamespace (#170912)
Sparse Frontend
- Fixed incorrect gradient support documentation for
torch.sparse.mmandtorch.sparse.addmm(#174039)
XPU
- Update XPU Get Started guide with new client GPU and formatting (#169810)
- Document previous version of Torch XPU installation (#174453)
- Update previous version 2.10 installation in XPU Get Started guide (#176141)
torch.fx
- Add documentation for previously undocumented functions (#170581)
- Remove outdated jit files (#173015)
Quantization
- Update
tensor_attributes.rstwith additionaltorch.float4_e2m1fn_x2dtype documentation (#170448)
ONNX
- Change warning to debug log for missing annotations (#172247)
Optimizer
- Improved EMA (Exponential Moving Average) equation documentation in optimizer docs (#172423)
Security
Python Frontend
- Fixed a ZipSlip directory traversal vulnerability in
torch.hubthat could allow malicious zip files to extract files outside the target directory.torch.hubnow validates all extracted paths and raises aValueErrorif an archive attempts to write outside the expected folder. (#171754)
Developers
Release Engineering
- Clarified needs-repro guidelines and edge cases for issue triage (#174028)
- Updated tlparse instructions in PT2 bug report template (#172392)
- Remove +ptx from CUDA 13.0 builds (#175567)
- Update inductor CI jobs to CUDA 13.0 (#175826)
- Upgrade ROCm CI to 7.2 (#173188)
- Switch vLLM test and benchmark workflows to CUDA 13.0 (#175393)
- Windows override AMI pre-installed cudnn (#177027)
- Unpin cuda-bindings dependencies (#176042)
- Stop using G3 runners (#175938)
Autograd
- Fix grad_outputs type hints and parameter descriptions for a few autograd APIs (#164838)
Distributed
- Fix
USE_NCCL=0build failure innccl_dev_cap.hpp(#171694)
DTensor
- Add DTensor performance benchmarks for collectives,
from_local/to_local, and backward passes (#171576) - Add DTensor benchmarks for miscellaneous dispatch paths (#171847)
CPU
- Fixed compilation failure with GCC 14.2.0 on ARM SVE targets (e.g.,
-march=armv8-a+sve+bf16on Debian 13) by broadening the GCC version workaround for SVE compilation (#174647)
CUDA
- Cleanup
at::numeric_limits(#171111) - Move EventPool::Event to c10 (#158220)
- Reuse CUDAEventPool in CUDA caching host allocator (#168345)
- Move CUDAEvent to c10 (#158219)
cuDNN
- Upgrade cuDNN to 9.19 for 12.8 and 13.0 wheels (#174310, #175547)
MPS
- Migrated
_local_scalar_dense_mpsto DispatchV2 (#172967)
ROCm
- Fix unused-result warning in UniqueCub.cu (#174203)
- Use rocm_sdk preloaded libraries for hiprtc and amdhip64. (#169855)
- Fix torch.utils.cpp_extension build folder accelerator detection ROCm (#170784)
- Unify hipBLASLt architecture lists into common hook methods. (#172791)
XPU
- Switch Intel Triton compiled kernel format from
spvtozebin(#167972)
torch.compile
Inductor
- Improved cudagraph logging: updated partition format, added tree logging, and
[compile_id]prefix style (#170217, #170218, #174110) - Added Proton profiling support for inductor (#171192)
- Added support for dumping input tensors for Triton kernels generated with Inductor backend (#171575)
- Inductor compiler bisector: added run mode for debugging and cudagraph application (#170717, #172461)
- Autotune configurability: number of choices displayed is now settable via environment variable, and rep options are easily overridable (#171429, #171629)
- Added template codegen/emission override hooks for external template buffers (Helion integration) (#174148)
fx_graph_runnableimprovements: fixed single quotes in envvars, missing symints, nested triton kernels, and global constexpr support (#173704, #174035, #174038, #174533)- Added perfetto trace events to graph pass application with envvar to disable (#172460)
torch.fx
GraphModule.print_readable()improvements: newadditional_metaargument for displaying additional node metadata (#173734), long annotations are now truncated for readability (#173119), and fix trailing whitespace with inner graphs (#172644)GraphPicklerimprovements: support for custom ignored metadata field keys (#172587), adebug_dumpsmethod for debugging pickle failures (#173675), respecting__getstate__forGraphModuleserialization (#173810), and automatic fallback to dill if available (#173801)- Stack traces on
invoke_subgraphnodes now point to the original model code for easier debugging (#170927) - Add process group support to
fx_graph_runnable(#173932) - Strict type coverage in dispatch and partial subclass (#171808)
torch.export
- Add strobelight profiling support for the export process (#174606)
Mobile
- Fix spin lint on arm (#172965)