torch 2.6.0 on Python PyPI

Highlights
Tracked Regressions
Backwards Incompatible Change
Deprecations
New Features
Improvements
Bug fixes
Performance
Documentation
Developers

Highlights

We are excited to announce the release of PyTorch® 2.6 (release notes)! This release features multiple improvements for PT2: torch.compile can now be used with Python 3.13; new performance-related knob torch.compiler.set_stance; several AOTInductor enhancements. Besides the PT2 improvements, another highlight is FP16 support on X86 CPUs.

NOTE: Starting with this release we are not going to publish on Conda, please see [Announcement] Deprecating PyTorch’s official Anaconda channel for the details.

For this release the experimental Linux binaries shipped with CUDA 12.6.3 (as well as Linux Aarch64, Linux ROCm 6.2.4, and Linux XPU binaries) are built with CXX11_ABI=1 and are using the Manylinux 2.28 build platform. If you build PyTorch extensions with custom C++ or CUDA extensions, please update these builds to use CXX_ABI=1 as well and report any issues you are seeing. For the next PyTorch 2.7 release we plan to switch all Linux builds to Manylinux 2.28 and CXX11_ABI=1, please see [RFC] PyTorch next wheel build platform: manylinux-2.28 for the details and discussion.

Also in this release as an important security improvement measure we have changed the default value for weights_only parameter of torch.load. This is a backward compatibility-breaking change, please see this forum post for more details.

This release is composed of 3892 commits from 520 contributors since PyTorch 2.5. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve PyTorch. More information about how to get started with the PyTorch 2-series can be found at our Getting Started page.

Beta	Prototype
torch.compiler.set_stance	Improved PyTorch user experience on Intel GPUs
torch.library.triton_op	FlexAttention support on X86 CPU for LLMs
torch.compile support for Python 3.13	Dim.AUTO
New packaging APIs for AOTInductor	CUTLASS and CK GEMM/CONV Backends for AOTInductor
AOTInductor: minifier
AOTInductor: ABI-compatible mode code generation
FP16 support for X86 CPUs

*To see a full list of public feature submissions click here.

BETA FEATURES

[Beta] torch.compiler.set_stance

This feature enables the user to specify different behaviors (“stances”) that torch.compile can take between different invocations of compiled functions. One of the stances, for example, is

“eager_on_recompile”, that instructs PyTorch to code eagerly when a recompile is necessary, reusing cached compiled code when possible.

For more information please refer to the set_stance documentation and the Dynamic Compilation Control with torch.compiler.set_stance tutorial.

[Beta] torch.library.triton_op

torch.library.triton_op offers a standard way of creating custom operators that are backed by user-defined triton kernels.

When users turn user-defined triton kernels into custom operators, torch.library.triton_op allows torch.compile to peek into the implementation, enabling torch.compile to optimize the triton kernel inside it.

For more information please refer to the triton_op documentation and the Using User-Defined Triton Kernels with torch.compile tutorial.

[Beta] torch.compile support for Python 3.13

torch.compile previously only supported Python up to version 3.12. Users can now optimize models with torch.compile in Python 3.13.

[Beta] New packaging APIs for AOTInductor

A new package format, “PT2 archive”, has been introduced. This essentially contains a zipfile of all the files that need to be used by AOTInductor, and allows users to send everything needed to other environments. There is also functionality to package multiple models into one artifact, and to store additional metadata inside of the package.

For more details please see the updated torch.export AOTInductor Tutorial for Python runtime.

[Beta] AOTInductor: minifier

If a user encounters an error while using AOTInductor APIs, AOTInductor Minifier allows creation of a minimal nn.Module that reproduces the error.

For more information please see the AOTInductor Minifier documentation.

[Beta] AOTInductor: ABI-compatible mode code generation

AOTInductor-generated model code has dependency on Pytorch cpp libraries. As Pytorch evolves quickly, it’s important to make sure previously AOTInductor compiled models can continue to run on newer Pytorch versions, i.e. AOTInductor is backward compatible.

In order to guarantee application binary interface (ABI) backward compatibility, we have carefully defined a set of stable C interfaces in libtorch and make sure AOTInductor generates code that only refers to the specific set of APIs and nothing else in libtorch. We will keep the set of C APIs stable across Pytorch versions and thus provide backward compatibility guarantees for AOTInductor-compiled models.

[Beta] FP16 support for X86 CPUs (both eager and Inductor modes)

Float16 datatype is commonly used for reduced memory usage and faster computation in AI inference and training. CPUs like the recently launched Intel® Xeon® 6 with P-Cores support Float16 datatype with native accelerator AMX. Float16 support on X86 CPUs was introduced in PyTorch 2.5 as a prototype feature, and now it has been further improved for both eager mode and Torch.compile + Inductor mode, making it Beta level feature with both functionality and performance verified with a broad scope of workloads.

PROTOTYPE FEATURES

[Prototype] Improved PyTorch user experience on Intel GPUs

PyTorch user experience on Intel GPUs is further improved with simplified installation steps, Windows release binary distribution and expanded coverage of supported GPU models including the latest Intel® Arc™ B-Series discrete graphics. Application developers and researchers seeking to fine-tune, inference and develop with PyTorch models on Intel® Core™ Ultra AI PCs and Intel® Arc™ discrete graphics will now be able to directly install PyTorch with binary releases for Windows, Linux and Windows Subsystem for Linux 2.

Simplified Intel GPU software stack setup to enable one-click installation of the torch-xpu PIP wheels to run deep learning workloads in an out of the box fashion, eliminating the complexity of installing and activating Intel GPU development software bundles.
Windows binary releases for torch core, torchvision and torchaudio have been made available for Intel GPUs, and the supported GPU models have been expanded from Intel® Core™ Ultra Processors with Intel® Arc™ Graphics, Intel® Core™ Ultra Series 2 with Intel® Arc™ Graphics and Intel® Arc™ A-Series Graphics to the latest GPU hardware Intel® Arc™ B-Series graphics.
Further enhanced coverage of Aten operators on Intel GPUs with SYCL* kernels for smooth eager mode execution, as well as bug fixes and performance optimizations for torch.compile on Intel GPUs.

For more information regarding Intel GPU support, please refer to Getting Started Guide.

[Prototype] FlexAttention support on X86 CPU for LLMs

FlexAttention was initially introduced in PyTorch 2.5 to provide optimized implementations for Attention variants with a flexible API. In PyTorch 2.6, X86 CPU support for FlexAttention was added through TorchInductor CPP backend. This new feature leverages and extends current CPP template abilities to support broad attention variants (e.x.: PageAttention, which is critical for LLMs inference) based on the existing FlexAttention API, and brings optimized performance on x86 CPUs. With this feature, it’s easy to use FlexAttention API to compose Attention solutions on CPU platforms and achieve good performance.

[Prototype] Dim.AUTO

Dim.AUTO allows usage of automatic dynamic shapes with torch.export. Users can export with Dim.AUTO and “discover” the dynamic behavior of their models, with min/max ranges, relations between dimensions, and static/dynamic behavior being automatically inferred.

This is a more user-friendly experience compared to the existing named-Dims approach for specifying dynamic shapes, which requires the user to fully understand the dynamic behavior of their models at export time. Dim.AUTO allows users to write generic code that isn’t model-dependent, increasing ease-of-use for exporting with dynamic shapes.

Please see torch.export tutorial for more information.

[Prototype] CUTLASS and CK GEMM/CONV Backends for AOTInductor

The CUTLASS and CK backend adds kernel choices for GEMM autotuning in Inductor. This is now also available in AOTInductor which can run in C++ runtime environments. A major improvement to the two backends is improved compile-time speed by eliminating redundant kernel binary compilations and dynamic shapes support.

Tracked Regressions

torch.device(0) makes CUDA init fail in subprocess

There is a known regression (#144152) that torch.device(0) makes CUDA init fail in subprocess since PyTorch 2.5.0.
There was an attempt to fix the regressions, but it caused some complications and was reverted.

An easy workaround is to use torch.device('cuda') or torch.device('cuda:0') instead.

Regression in the compilation of the torch.all operation with out= variant

A regressions (#145220) was reported for PyTorch 2.6.0 with
compilation of the out= variant of the torch.all operator. This should be a rare use case, a workaround can be
rewriting the model code to avoid the out= variant.

Backwards Incompatible changes

Flip default torch.load to weights_only (#137602, #138225, #138866, #139221, #140304, #138936, #139541, #140738, #142153, #139433)

We are closing the loop on the deprecation that started in 2.4 and flipped torch.load to use weights_only=True by default.

When this flag is set, instead of using the usual pickle module, torch.load uses a custom unpickler constrained to call only functions and classes needed for loading state dictionaries and basic types.

While this change is disruptive for users serializing more than basic types, we expect the increased security by default is a tradeoff that is worth it. Do note that, even though this default is safer, we still recommend only loading trusted checkpoints and rely on more constrained (and even safer) formats like safetensors for un-trusted checkpoints.

For full details, please refer to this dev-discuss post.

Anaconda deprecation in CD. Remove anaconda dependency in Magma builds (#141024) (#141281) (#140157) (#139888) (#140141) (#139924) (#140158) (#142019) (#142276) (#142277) (#142282)

PyTorch will stop publishing Anaconda packages that depend on Anaconda’s default packages. We are directing users to utilize our official wheel packages from download.pytorch.org or PyPI, or switch to utilizing conda-forge (pytorch) packages if they would like to continue to use conda. For more details refer to this announcement

Added Manylinux 2.28 prototype support and CXX11_ABI=1 for following binaries: Linux CUDA 12.6, Linux aarch64 CPU, Linux aarch64 GPU CUDA 12.6, ROCm 6.2.4, Linux XPU (#139894) (#139631) (#139636) (#140743) (#137696) (#141565) (#140681) (#141609) (#141704) (#141423) (#141609)

The PyTorch binaries shipped with CUDA 12.6.3 are built with CXX11_ABI=1 and are using the Manylinux 2.28 build platform. If you are building PyTorch extensions with custom C++ or CUDA extensions, please update these builds to use CXX_ABI=1 as well and report any issues you are seeing. For the next PyTorch 2.7 release we plan to switch all Linux builds to Manylinux 2.28 and CXX11_ABI=1, please see [RFC] PyTorch next wheel build platform: manylinux-2.28 for the details and discussion.

Deprecations

Releng

Removed CUDA 12.1 support in CI/CD (#141271) (#142177)

The full release compatibility matrix matrix can be found in release.md

Deprecated `c10d::onCompletionHook` (#142390)

In PT 2.5 and before, users can do:

pg = dist.init_process_group()
def hook(work_info: torch._C._distributed_c10d.WorkInfo):
  # do something
pg._register_on_completion_hook(hook)

# The hook will be triggered after the collective complete
pg.broadcast([tensor]).wait()

Starting from PT 2.6, when users write the code above, they will get get a warning message “ProcessGroupNCCL OnCompletion hook will be deprecated in favor of Flight Recorder”

Inductor

Deprecate TORCHINDUCTOR_STACK_ALLOCATION (#139147)

Instead of setting TORCHINDUCTOR_STACK_ALLOCATION, update your torch.compile call: torch.compile(options={"aot_inductor.allow_stack_allocation": True})(foo).

New features

Python Frontend

Introduce a device-agnostic runtime API design (#132204)
Add validation for ambiguous behavior in Tensor.dim_order() (#141632)
Add type check for ord argument for torch.linalg.{vector,matrix}_norm() (#137463)
FlexAttention support for NJT (#136792, #140723)

Miscellaneous

Enable forward AD in functional.affine_grid (#135494)
Added SVE support for ARM CPUs (#119571)
User buffer registration via MemPool API (#133603)
Add in_order flag for data loader, allowing out-of-order dataloading (#141833)

Optim

Add Support for Tracking Parameter Names (named_parameters) in Optimizer State Dict (#134107)
Support tensor betas in Adam and AdamW (#134171)

Distributed

c10d
- Made ProcessGroup initialization non-blocking when device_id is given #138527)
- Allowed sub group to be eagerly inited even if default one is not (#138665)
- Supported group_dst/group_src in c10d collectives (#140460, #139677, #140827, #140843, #140847)
- Enabled Flight Recorder buffer for all users (#142260)
- Registered Intel distributed Backend (XCCL) in PyTorch distributed package (#141856)
Pipeline
- Performed shape inference at runtime using user-provided real tensors (#136912)
- Added ZBV schedule (#142084)
FSDP2
- Moved FSDP2 to public (#141868)

Dynamo

Add torch.compiler.set_stance to dynamically change torch.compile behavior without needing to re-apply torch.compile. (#137504)
Profile guided optimization for automatic_dynamic - automatically save and load automatic dynamic decisions to reuse on future runs (#139001)
skip_guard_eval_unsafe compiler stance option for power users - skip guard checks when it is known to be safe to do so (#140251)

Releng

Added support for CUDA 12.6 in CI/CD (#142335) (#136321) (#138417) (#138563) (#138562) (#139909) (#138899) (#141365) (#141433) (#141805) (#141976) (#139988) (#140143) (#141377) (#142064)
Intel GPU enablement in CI/CD. Upgrade XPU support packages to Intel® Deep Learning Essentials 2025.0. Add prototype Linux and Windows binary builds with XPU runtime pypi packages dependencies. (#138189) (#139050) (#139604) (#139775) (#140373) (#141546) (#141775) (#141135) (#142210) (#135638) (#142298)
Added Python 3.13 in CI/CD support and prototype support for Python 3.13t in CD (Only Linux and Linux aarch64 torch binaries) (#136001) (#137396) (#138037) (#138629) (#140137) (#138095) (#141572) (#140733) (#141264) (#142294) (#137142) (#137127) (#139533) (#140733)

ROCM

Added AMDSMI support for UUID input (#129741)
Added faster HW support for packed bfloat16 and fp16 for MI300 (#135770)
Improved performance of reductions on 1D and 2D tensors. (#137737)

XPU

Add torch.xpu.mem_get_info API: Introduces a new API to retrieve memory information for XPU devices. (#141230)
Add architecture property to XPU device: Adds new properties to XPU devices to query architecture details. (#138186)
Add elapsed_time method for XPU events: Introduces a method to measure elapsed time between XPU events. (#140865)
Add torch.xpu.get_arch_list and torch.xpu.get_gencode_flags: Introduces new APIs to retrieve architecture lists and code generation flags for XPU. (#137773)
Add quantized convolution support for XPU backend (#133080)
Enable XPU device support for LSTMCell operators (#140246)

Profiler

Hide ProfilerStep Alignment behind Experimental Config (#137668)
Add functionality to call dump function of NCCL profiler plugin (#137523)

Export

Add torch.export.export_for_training() API to perform export that can run training. Note that this replaces the non-documented capture_pre_autograd_graph feature (#135374, #135918, #135549, #143224)
New packaging APIs for AOTInductor torch._inductor.aoti_compile_and_package
- Previously, AOTInductor (through torch._export.aot_compile), would return a path to a .so. However, this does not have a great user experience as actually there are other files that are used along with the .so, for example .cubin files and serialized extern kernels. So, we introduce a new package format, “PT2 archive”, which is what we intend to have AOTInductor return. This essentially contains a zipfile of all the files that need to be used by AOTInductor, and allows users to send to other environments. There is also functionality to package multiple models into one artifact, and to store additional metadata inside of the package.
AOTInductor Minifier. If you encounter an error while using AOT Inductor APIs such as torch._inductor.aoti_compile_and_package, torch._indcutor.aoti_load_package, or running the loaded model of aoti_load_package on some inputs, you can use the AOTInductor Minifier to create a minimal nn.Module that reproduces the error. (#139351,#140999, #141159, #141156)
AOTInductor: ABI-compatible mode code generation. In order to guarantee ABI backward compatibility, we have carefully defined a set of stable C interfaces in libtorch and make sure AOTInductor generates code that only refers to the specific set of APIs and nothing else in libtorch. We will keep the set of C APIs stable across Pytorch versions and thus provide BC guarantees for AOTInductor-compiled models.
export.export_for_inference and export.exported_program.core_aten_decompositions API. export_for_inference returns a functional, post-dispatch ATen IR. (#135912).

Inductor

Move stack allocation related configs in AOTI (#139093). All stack allocation related configs now have a aot_inductor prefix, so torch.compile(options={"use_minimal_arrayref_interface": True})(foo) is now torch.compile(options={"aot_inductor.use_minimal_arrayref_interface": True})(foo) and torch.compile(options={"allow_stack_allocation": True})(foo) is now torch.compile(options={"aot_inductor.allow_stack_allocation": True})(foo).
Move torch._utils.is_compiling to torch.compiler.is_compiling (#127690) Rewrite torch._utils.is_compiling() to torch.compiler.is_compiling().
Added option autotune_num_choices_displayed to control number of kernel options displayed (#138788)
Added option force_pointwise_cat concat support through inductor using pointwise kernels (#141966). This forces concat to be generated as a pointwise op with masked loads.
New config option annotate_training that adds Inductor annotations to NVTX. (#130429)
Introduces an option triton_kernel_default_layout_constraint to tweak stride settings for user-defined Triton kernels, enhancing customization and flexibility (#135530).
User can patch inductor config to enable strict custom kernel layout constraints by changing torch.compile(options={"triton_kernel_default_layout_constraint": "needs_fixed_stride_order"})(foo) (#135581).
External callable registration API register_external_matmul for Matmul tuning candidates in Inductor (#130774).
Adds support for Windows Arm64 to enhance platform compatibility (#133088).
Integrates support for AMD triton stream pipeliner in ROCm to enhance performance (#139881).
Adds support for TRITON_INTERPRET in Inductor (#140841).
Adds update_constant_buffer pybind support in AOTInductor (#140755).
Provides an option package_constants_in_so to exclude weights from .so files in AOTInductor (#141997).
Adds load_constants to the package API (#142246).
Enables auto functionalize v2 by default (#136685).
Adds raise_error_on_ignored_optimization to the aoti config (#138035).
Adds stats summary (mean/min/max, etc) for jit inductor tensor value printing (#135887).

Improvements

Python Frontend

Add support for fp16 and bf16 to torch.special.i1 (#137899)
Add option to disable checksum computation in torch.save (#137735)
Speed up fp16 tensors printing (#141927)
Add support for fp16 for torch.adaptive_pool3d on cpu (#136091)
Add support for fp8* to torch.masked_select (#141928)
Add support for complex fp16 to fill_empty_deterministic_ (#137488)
Remove dependency on numpy for serialization for XLA/open registration devices without numpy (#137444, #137600)
Fix torch.{linalg.}norm complex half support (#133661)

NN Frontend

Allow global module hook to accept keyword arguments (#137403)
Add APIs to separate norm calculation and gradient scaling in nn.utils.clip_grad_norm_ (#139662)
Add Half support for reflection and replication padding on CPU (#135931)
Add weight argument to MSELoss, HuberLoss and L1Loss (#132049)
Gaussian nll loss scalar variance support (#138931)
Added validation for input types for torch.nn.Linear and torch.nn.Bilinear (#135596)

Optim

Improve ReduceLROnPlateau and Optimizer.add_param_group interaction by auto-updating min_lrs (#137637)
Allow SequentialLR to include ChainedScheduler (#133450)

Composability

Decompositions, FakeTensor and meta tensors

Operator decompositions, FakeTensors and meta tensors are used to trace out a graph in torch.compile and torch.export. They received several improvements:

Several operator decomps received improvements/bugfixes:
- aten.split_with_sizes (#135728)
- aten.max_unpool2d/aten.max_unpool3d (#133146)
- aten.dot (#138596)
- aten.layer_norm (#140557)
- aten.scaled_dot_product_attention (#135297)
- aten.matmul (#134568)
- aten._embedding_bag (#136774)
- aten.native_group_norm/aten.native_layer_norm (#137079)
- aten.to(..., non_blocking=True) (#136513)
- Aten.addmm (#138520)
- General fixes:
  - out= dtype checks for unary ops (#140288)
New decompositions for a few pytorch operators:
- aten.diagonal_copy (#136730)
Several meta implementations of operators received improvements/bugfixes:
- Aten.triangular_solve (#140186)
- Aten.log_softmax (#140289)
New meta tensor implementations for a few pytorch operators:
- aten._segment_reduce_backward (#137442)
- Aten._add_relu (#140009)

Dynamic shapes

We made many improvements and bugfixes to dynamic shapes in torch.compile

Minor error message improvements (#136671, #138310)
Make native_layer_norm_backward work with unbacked SymInts (#136798)
Make masked_fill work with unbacked SymIntsl (#137060)
Improve tracing speed of torch.cat with large numbers of symbolic variables (#139653)
Improve performance of canonicalize_bool_expr (#135621)
Improve performance of sympy_generic_le (#135622)
Simplify expr before getting implications in _maybe_evaluate_static (#135499)
use a fast expand algorithm (#135999, #136163)
Fix calling Add._from_args and Mul._from_args (#136143)
Dynamic shape logging improvements in tlparse (#136508, #141068, #140867)
Avoid some quadratic behavior of dynamic shapes involving aliasing + mutation of graph inputs (#136857)
Tensorify compute on Python scalars (#136674)
Delay mul/pow expansion for _SympyT to enable more folding (#138235)
Fix bug in unbacked_bindings for a*u0 (#138136)
Remove parallel_and and parallel_or (#138135)
Explicitly avoid recording when should_record_events is false in record_shapeenv_event (#138965)
Better support for dynamic shapes with tensor subclasses (#125941)
support symfloats in translation validation (#139457)
Add trunc to z3 validator (#140886)
Refactor ShapeGuardPrinter for future C++ additon (#140968)
Fix another item memo loss location + bool specialization bug (#139587)
Optimize increment summations (#140822)
Only compute new_untracked_symbols and new_unbacked_bindings if needed. (#140083)
Use has_free_unbacked_symbols instead of bool(free_unbacked_symbols) (#140027)
Try to simplify FloorDiv axioms implications when needed during evaluations. (#141267)
Fix AttributeError: 'int' object has no attribute 'node' due to constant prop (#141250)
Update tensorify pass to specialize symfloats we didn't tensorify away (#139564)
Add TORCHDYNAMO_EXTENDED_ADVICE (#137159) (#137196)
Do not try to optimize new implications in get_implications (#139738)

Custom operators

We improved the existing torch.library APIs and added new ones.

Add new torch.library.triton_op API (#141880)
Fix partitioner behavior on user triton kernels (#136878)
Add links to new Custom Ops Landing Page (#137933, #139634)
Fix torch.library.register_vmap to work with nested vmap (#137306)
No-op torch.library.custom_op APIs on torch.deploy (#139509)
Optimize mutable torch.library.custom_op overhead (#139513)
Improve torch.library.opcheck and register_autograd docs (#141883)

Distributed

c10d
- Added FP8 support to NaN checker (#135891, #135961, #136115)
- Added support for cuStreamWriteValue32 (#136488)
- Improved the detection robustness in CudaDMAConnectivityDetector (#137530)
- Simplified barrier implementation and further decouple CPU/GPU synchronization (#137516)
- Threw value error if passing world_size=0 to TCPStore (#137792)
- Performed retry connection timeout failures in socket (#138003)
- Added an API to get the future result(success or failure) of a collective and customized error handling (#137799)
- Disabled watchdog thread in blockingWait mode (#138001)
- Added default value for nccl_nonblocking_timeout (#138374)
- Ensured nccl comm is ready before all accesses (#138384)
- Used a promise to delay watchdog shutdown (#138828)
- Supported optional backend if device_id provided (#140963)
- Supported group ranks in P2POp and batch_isend_irecv (#141054)
- Enabled CudaEventCache by default and add multi device support (#140975)
- Added an API to retrieve default distributed backend from device (#140536)
- Supported rank, world size, group name/desc overrides for PyProcessGroup (#141529)
- Added the detect of accelerator type when backend is not specified (#142216)
- Used task submitter TLS in gloo working threads (#142184)
- Added _reduce_scatter_base to c10d::ProcessGroupUCC (#138021)
DDP
- Made DDPOptimizer work with HOPs (#138787)
- Made DDP Quantization hooks backend Agnostic (#138816)
- Used device-agnostic runtime API in DDP/FSDP instead of cuda device specific. (#137678)
FSDP
- Updates real device in FSDP state_dict_utils (#134994)
- Generalized of FSDP common for non-cuda execution (#133209)
FSDP2
- Added _set_unshard_async_op (#135523)
- Added module, mp policy to fsdp_pre_all_gather (#136129)
- Added check for contiguous parameters (#137000)
- Relaxed even sharding requirement for all-gather extensions (#137005)
- Used stream and event based on device (#136843)
- Added shard_placement_fn arg (#137496)
- Added set_unshard_in_backward(bool) (#137922)
- Made module-to-state mapping use weakrefs (#139650)
- Removed CUDA-like device check in fsdp2. (#139539)
DTensor
- Allowed user to manual_seed different seed on device mesh and only synced RNG state in WORLD when manual_seed has not been called (#141223)
- Supported matmul in inference_mode (#142197)
Pipeline
- Made PipelineStage support meta initialization (#136243)
- Allowed non-0 stages to accept kwargs (#136416)
- added schedule simulator and chrometrace dump (#138134)
- Supported separate dI / dW and V-schedules (#131762)
- Updated schedules to use I, B actions. (#138886)
- Added type checking to _backward functions (#140019)
- Allowed multiple backward grads (#140981)
- Improved schedule csv loading (#142009)
TorchElastic
- Added TryExcept when decoding healthcheck port (#136574)
- Skipped store barrier and store get in host assign (#136865)
Checkpoint
- Throw an error when state_dict and saved tensors are different sizes (#141571)

Profiler

Create Auto-Trace Frontend for Trace ID (#139310)
Add skip_first_wait to profiler.schedule (#141512)
Add CUDA Overhead to Auto-trace (#142271)

Nested Tensor

Added NJT operator support: rms_norm(), embedding_bag(), record_stream(), rad2deg(), embedding() backward, activation functions (#135872, #135888, #140736, #138627, #137099, #140290)
Mixed NJT, dense binary pointwise broadcasting support (#133021)
Allow any single non-batch dim to be ragged for NJT (#137125)
Add bfloat16 support to torch.bmm(NST, NST) (#141380)
Add missing fp classification functions for NST (#139890)

Functorch

Add vmap support for torch.scatter_reduce (#135547)
Add vmap support for native_dropout_backward (#140140)
Allow optional positional arguments for torch.func.functional_call (#134643))

Quantization

Add uint16 support for observer (#136238)
change flatten recipe for X86InductorQuantizer (#136298)
Update choose_qparams_per_token op to output correct shape for scales and zp (#136807)
Make QAT Fused modules torchscriptable (#136285)
Add missing mappings to support torch.uint16 in quantization and export (#136547)
Default to use training IR (#137804)
Remove Redundant Method in X86 Quantizer (#139161)
Add bfloat16 support for per tensor/channel cpu/cuda fake quantize ops (#139306)
add linear_dynamic_fp16 ops for OneDNN (#140376)
annotate and convert for linear_dynamic_fp16 for x86 (#141480)

Releng

Updated CUDNN to 9.5.1.17 for CUDA 12.6 builds, Linux and Windows (#137978)
upgrade CI/CD to 6.2.4 for ROCm (#141423)

Cuda

Extend cuda_flip to unsigned types (#137781)
SDPA Priority Manager accepts ordering (#140467)
cuDNN Attention memory layout handling improvements (#141147) (#138354)

Mps

Add native im2col (#135706)
Add upsample_bicubic2d as Metal op (#136123)
Add scatter_reduce.two (#141948)
Add i0 op (#137849)
Add torch.special.i1 op (#140196)
Add unfold_backward on MPS (#135411)
Add isposinf and isneginf (#136689)
Add MetalShaderLibrary::getFunctionNames() (#141499)
Add tri[lu]_indices (#137648)
Fix Gamma for bfloat16 dtypes (#136981)
Extend fmin/fmax/copysign and nextafter to bfloat16 (#136982)
Enable bucketization for bfloat16 (#136983)
Fix bfloat16 to complex casts (#137070)
Enable arange to bfloat16 (#136754)
Enable torch.linalg.cross for bfloat16 (#136984)
Enable Renorm for bfloat16 (#136985)
Enable nan_to_num for bfloat16 (#136986)
Add support for bfloat16 autocast (#139390)
Eliminate c10::value_or_else (#138818)
Compile kernels into Metallib (#138636)
Write/Invoke Metal shaders from C++ (#141547)
Support torch.Event for MPS (#142468)
Add CompileShader method (#141478)
Reintroduce support for convolutions with output_channels > 65536 (#140726)

ROCM

Improve PyTorch build speed in ROCm environment by Downloading AOTriton from GitHub unless AOTRITON_INSTALL_FROM_SOURCE=1 is set (#136603)
enable gfx110x architecture for hipblaslt (#137317)

XPU

Improves the device index bound checking mechanism for XPU. (#120768)
Use default context on Windows for Intel GPU: Improves XPU device handling on Windows by using the default context. (#138049)
Add device guard for XPU structured operators in torchgen (#138802)
Generalize device-bias code to align XPU unroll reduction with CUDA (#142348)
Generalize CUDA C++ wrapper for reuse by XPU (#135312)

Miscellaneous

Add torch.float8e4m3fn dtype support to semi-structured sparse (#136397)
Faster BatchSampler (#137423)
Init threadpool with user defined num_threads before default (#136793, #137051)

Dynamo

automatic_dynamic_shapes_mark_as - adds an option to cause automatic dynamic shapes to trigger unbacked SymInts rather than backed SymInts (#141415)
Propagate detailed source location information of shape guards to guards/recompiles output (#136917)
torch.compile support for Python 3.13 (#139533)
Trace through dynamic callables on tensor variables (#137940)
Trace through dataclasses (#141294)
Graph region tracking for deduplication (i.e. common subgraph extraction) (#141381)
Scan higher order op (#134102)
Trace subclasses of namedtuple type (#140534)
Trace dict subclasses (#143548)

Export

Preserve preserve the call signature for a module when it was called multiple times (#137999, #138669)
Let export preserves node.meta["custom"] field (#138266)
Add neg and pos operator to serde/serialize (#138309, #143343)
Update min_val and max_val to Optional[int] in serialization and allow the schema to express infinity (#139394)

Fx

Bypass custom setattr in Node.init (#135733)
Add new replacement_callback to materialize a replacement just in time (#135553)
Minor optimization in create_arg (#135821)
Replace _snake_case with a regexp (#135822)
Update _inline_module util function to work with both args and kwargs (#136631)
Fx graph always return tuple in fuse_as_graphmodule (#139236)
Change fx graph _replace_hook to a list of Callable (#142006)
Avoid generation of empty merge cpu submodule by splitter v2 (#140794)
Make split_module work with keep_original_order=True and no-op graph (#141340)
Add output_node util function to fx.Graph (#139770)
Fix stride in TensorMetadata to always be a Tuple[int, ...] (#141106)
Enhance from_node node meta to track source recursively (#142066)
Support linear/BN fusion and follow the API guideline (#141585)
Enable fuse_by_partitions to always return output as tuple (#142056)
Add safer check for isatty in fx/_utils.py (#140876)

Inductor

Switch GPU codegen to one-pass in AOTI (#141980)
Fix multi-kernel codegen when using one-pass in AOTI (#142333)
Fix an issue when fallback op does not return a value in AOTI (#142339)
Improve the stride preservation logic of user-visible outputs (#136732)
Add workspace to TritonTemplates (#138050)
Enable Cpp wraper for Intel GPU. (#135318)
Flip custom_op_default_layout_constraint in Inductor to optimize tensor layout (#135239).
Enables coordinate descent tuning with max-autotune in Inductor (#136867).
Adds relu_nan_to_num option for handling NaNs in pre-grad passes in AOTInductor (#138545).
Enables cooperative and persistent reductions in Inductor (#138533).
Introduces multi-kernel support alongside cooperative reductions in Inductor (#138893).
Adds new configs env_name_default and env_name_force for better configuration management (#138956).
Adjusts loop split optimization heuristic (#137550).
Enhances numerical precision for fp32 in FlexAttention on ROCm devices using IEEE (#135702).
Enables SDPA pattern matching in Inductor for CUDA, enhancing optimization capabilities (#137085).
Updates Inductor's support for Triton AttrsDescriptor (#137757).
Update C++ runner API to take a const vector (#139955)

Bug fixes

Python Frontend

Fix torch.mean(..., out=) for fp16 and bf16 on CPU (#135174)
Fix serialization for torch.uint16, torch.uint32, torch.uint64 (#137184)
Fix Tensor preservation logic to not lose user-defined attributes in some cases (#137267)
Fix memory leak in torch.utils.module_tracker.ModuleTracker (#141960)

NN Frontend

Fix nn.functional.softshrink returning 0 on NAN input (#138421)
Fix flex_decode to build offsets off of strides (#139516)

Autograd Frontend

Fix torch.nn.EmbeddingBag when per_sample_weights is differentiable but embedding weights are not (#142338)
Determine autograd engine ready queue based on InputMetadata instead of InputBuffer (#135633)

Composability

Fixed a correctness issue when torch.compiling torch.scaled_dot_product_attention, in the case where the scale argument is a dynamic shape (#141728)
Fixed a correctness issue when torch.compiling torch.rrelu, in the case where it mutates any module buffers (#136008)

Distributed

c10d
- Fixed extra context on device 0 (#135273)
- Fixed bugs in non-blocking mode (#137741)
- Fixed P2P data corruption in non-blocking mode (#138860)
- Made sure not use split for P2P comm creation (#139013)
- Used long/short wait for different non-blocking calls (#142291)
- Recorded device index for GPU guarding during NCCLComm method calls (#141270)
- Fixed the behavior of destroy_process_group (#141510)
- Reworked NCCLComm destructor to avoid clash with CUDA driver shutdown (#141511)
- Removed Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931) (#135653)
- Fixed prefix store segmentation fault (#136872)
- Fixed a race condition in one-shot all-reduce (#137257)
- Enforced contiguity for all-reduce (#137345)
- Fixed data corruption bug after CUDAEventCache is enabled (#138040)
- Enforced contiguity for alltoall (#141816)
- Fixed sequence numbers for coalesced operations (#135132)
- Fixed color value for comm split being negative (#137855)
- Fixed a logic of using ncclCommSplit (#138781)
- Caught tensor.numel() == 0 in NaN detector (#140741)
- Fixed a breakage in IntraNodeComm::rendezvous() (#141200)
- Fixed _are_we_tracing() in dynamo for functional collectives (#142075)
DeviceMesh
- Fixed from_group when passing a tensor mesh (#137713)
DTensor
- Fixed 2D DTensor mm with mesh_shape (1, n) or (n, 1) (#139134)
- Removed the adhoc DTensor RNG tracker TensorParallelRNGTracker since it does not match FSDP2+TP (#141220)
DistributedStateDict (DSD)
- Initialize lr as a tensor if it is originally a tensor (#141620)
FSDP2
- Fixed 2D mismatched grad placements (#136237)
- Fixed test_all_gather_extensions_monkey_patch (#136130)
- Fixed mistargeted backward prefetch (#137348)
- Fixed incorrect tensor meta after .to(dtype) (#137593)
- Gated dynamo import for torch deploy (#137203)
- Fixed CUDA sync for bf16 HSDP AR, fp32 params (#140044)
- Fixed backward-compatible imports (#142419)
- Gated PT2 code for torch deploy (#142456)
Pipeline
- Fixed py ref cycle in stage_backward (#136507)
- Fixed more leaks and check leaks in tests (#136584)
- Removed modifications to autograd nodes in Zero Bubble schedule (#136678)
- Fixed extra memory usage in Zero Bubble (#138119)
- Fixed last backward counting for dI / dW (#139415)
- Forward fixed for _validate_schedule (#142211)
- Allowed schedules to run with single stage (#138925)
- Freed memory usage earlier in last stage (#138504)
TorchElastic
- Fixed store prefix race in rendezvous (#136768)
- Fixed rendezvous error due to EtcdStore get method not waiting in some cases (#137056)
- Fixed the bug caused by wrong host address in creating TCPStore server inside dynamic rendezvous (#139702)
Checkpoint
- Fix fsspec transaction failure cleanup in multithreaded environments (#135541)

Dynamo

Fix tracing of NumPy 2 ops (#138686)
Don’t graph break on inner torch.compile (#135819)
Various closure/cell variable/mutation related fixes (#136891, #139339, #140155)
Stop importing some third party libraries (#136334, #142502, #142503)

Nested Tensor Frontend

Fix NJT operator support: sum(), unsqueeze(), to() on non-contiguous NJTs, where(), select(), chunk(), reductions (#131945, #141392, #137124, #141500, #139317, #141506, #141604)
Fix NJT linear_backward() memory usage using a more efficient formula (#141163)
Fix NJT serialization (#137031)

Cuda

Add missing boundary checks to cunn_SoftMaxForward (#140682)
Fix CTC cuda backend out-of-bound access (#141607)
Fixed cuda sanitizer and as_subclass calls (#138218)

Mps

Allow nan mean reduction in nll_loss (#135434)
Fix AvgPool2d for float16 (#136822)
Error checking/bfloat16 support for torch.normal (#136863)
Fix reduction ops outputs for empty tensors (#139446)
Restrict MSELoss to floating types (#139960)
Fix conv backward pass for channels last (#141009)
Add autocast rule for SDPA (#141776)
Release MetalShaderLibrary cached resources (#142053)
Fixes SiLU on non-contiguous tensors (#139006)
Fix channels_last_3d in nn.Conv3d (#141780)
Guard on flash attention SymFloat scale instead of incorrectly casting to float (#141725)
Fix memory leak from unreleased NSProcessInfo (#142052)

ROCM

Fixed out of memory errors on AMD triton backend (#139883)
Correct numerical issues in layer norm backwards kernel (#140259)

XPU

Resolves an issue with duplicated build environments in XPU Linux CI. (#141546)
Fix XPU support packages version: Corrects the versioning of XPU support packages. (#138189)
Fix c10::Event unit test failure on XPU backend (#141800)
Fix mismatched tensor metadata between FakeTensor and XPU concrete tensor in F.logsigmoid (#141333)
Fix memory stats error on XPU: Corrects an error in memory statistics for XPU devices. (#135818)
Fix XPU CMake typo: Corrects a typo in XPU CMake configuration. (#140374)
Fix an issue causing endless code regeneration in non-XPU environments. (#140438)
Fix incorrect device check before skipping concat linear in Inductor XPU. (#140916)

Profiler

Clear Out Dangling AppendOnlyLists after collection (#137450)
Fix UnicodeDecodeError: 'utf-8' codec can't decode byte (#139062)
Fix ASAN Overflow Issues (#140441)
Fix devices Parameter Type in benchmark_utilization Function (#138774)

Quantization

Pass ideep:lowp_kind to matmul_forward::compute on cache misses (#135058)
fix re-export custom metadata (#135282, #135634, #135720)
moving eps in torchao/quantization/utils.py to targeted device to avoid device mismatch issue (#135204)
Add type check for dilation in torch.quantized_max_pool3d() (#137845)
Pass all arguments when quantizing embedding bag from float (#137697)
Fix for split gates enabled quantizable LSTM subclass (#140818)
Fix ReLU fusion when conv/linear has > 1 user for XNNPACK (#140846)
Fix RecursionError when prepare_pt2e graph with concat of the same node (#141651)

Sparse Frontend

Fix memory leak in MaskedTensor when using autograd (#137890)
Fix bmm(COO, dense) illegal memory access for some shapes (#131977)
Fix MaskedTensor binary ops for sparse_csr layout (#134335)

Miscellaneous

Fix PyBind 2.10.4 compatibility issue (#141456)
correctly keep track of processed tensors for foreach reductions (norm, max) (#140103)
Fixes to torch.package for 3.13 (#141409)

Export

Do not deserialize arguments with default values as kwargs (#136036)
Fix _get_non_persistent_buffers for duplicate submodules (#136552)
Fix lifted constants order for 0-input graphs (#136658)
Handle attribute assignment detection and registered buffer assignments in make_fx(#137240)
Fix specialization bug in unflatten + preserve_module_call_signature (#137363)
Fix export for constant outputs (#137547, #137993)
Fix param and buffer mapping for state_dict when there are state_dict hooks (#137609)
Fix export retracing (#137733)
Fix non-strict retracing with kwargs (#138927)
Fix assigning tensor with requires_grad as constant in export (#137997)
Fix issue with runtime_assertions with in export_for_training (#138292)
Fix issue in move pass for copying Parameter (#138855)
Fix unflatten with HOPs (#138978)
Fix unflattening to handle multiple specialized graphs corresponding to multiple calls to the same submodule (#137013)
Allow autocast in training ir export (#137287)
Fix unlift to preserve aliased constants (#137310)
Handle AttrProxy._modules when module is overwritten as None (#139957)
Fix joint graph metadata (#136011)
Fix mapping issue with torch.Size (#137465)
Fix test_lazy_module_kwargs (#137705)
Propagate ShapeEnv during lowering (#138362)
Plumb is_export flag to FunctionalTensorMode in analysis pass (#138836)

Fx

Add __init__.py to shape inference folder. (#135461)
Handle sympy.oo in bitwise_and/or value_ranges (#141522)
Fixes issue with enums in a tuple for dynamo (#133123)
Add output node to split_module subgraphs (#139275)
Fix deep copy of empty graph (#141660)

Inductor

Fix a bug with not enabling the Python dispatcher in AOTInductor (#135933)
Don't run reshape pattern match on dynamic shape size tensor (#136100)
Make DtypeView work with cpp_wrapper without abi_compatible (#136233)
Check size hints to determine indexing dtype in Triton (#137234)
Fix an error in _dynamo.compiled_autograd.reset() (#137889)
Fix out-of-bounds array access in atomic_add_vec (#138744)
Update zero size computation in clone_preserve_strides (#139224, #139458)
Fix for gcc10 torch.compile compiler error when march=aarch64+sve (#137795)
Fix a cubin file path issue (#139848)
Fix caching issue with AOTI packaging (#140022)
Fix a two-pass kernel missmatch in AOTI (#141041)
Fix performance bug by removing copy_misaligned_inputs from AOTI (#142136)
Fix mask bug in torch.cat kernel (#140838)
Fixed max-autotune in FlexAttention to reset kernel options appropriately (#138733)
Don't set XBLOCK larger than xnumel (#138730)
Fix inductor CPU masked() body codegen when result dtype is bool and operator is where (#138486)
Fix typo in codegen_dynamic_scalar (#138760)
Fix ReinterpretView call in TMADescriptor IR (#138759)
Fix free symbol handling in FlexAttention (#138794)
Fix codegen for tl.constexpr globals (#138757)
Force strides for efficient attention backward (#138879)
Make AOT inductor treat None args correctly (#139114)
Fix a bug with arg ordering in handling dynamic shapes (#139777)
Fixing missing ck package warning when the backend is disabled (#139790)
Force contiguous layout for implicit fallback (#140996)
Fix another IMA with captured buffers (#141164)
Inductor dtype propagation fixes (#141495)
Fix broadcast logic for Triton (#141027) (#141693)
Fix grid codegen for configs with empty kwargs (#141824)
Fix issue in CPP GEMM Template Prune Tensor (#141798)
Fix max-autotune bug with captured buffer grads (#141531)
TritonTemplate dtype fixes (#141991)
Fix device error for NopKernelSchedulerNode (#141372)
Resolves an issue where try_solve fails when both symbols are unknown and their product is zero (#137919).
Resolves an issue where a fallback operation returned None, preventing potential errors in AOTI initialization (#135997).
Resolves test failures following the update of pybind11 to version 2.13.6 (#136280).
Corrects the maximum autotuning for single-thread dynamic shapes in Inductor (#136418).
Fixes FMA codegen for Halide backend to ensure correct operation behavior (#136810).
Corrects max-autotune behavior when dealing with View nodes in FlexAttention (#137204).
Adjust BlockMask handling when reused from a larger sequence length (#137255).
Corrects triton_reshape by properly expanding the Min keyword in code generation (#137357).
Corrects reduction_hint behavior for single-element sums (#137754).
Resolves a codecache write_atomic issue on Windows (#138331).
Fixes AOTI data type codegen for symbolic integers (#138106).
Resolves an issue where passing None arguments to user-defined Triton kernels caused errors (#138472).
Correctly sets keyword arguments when creating Buffers in ROCmTemplate for proper initialization (#138521).

Jit

Unbreak vec128_half_neon comparison without FP16 hardware support (#139558)
Isolate the locale for NNC’s IRPrinter (#136458)
Fix misuse of offset param in seek (#140633)

Performance

Dynamo

Attempt to use previously compiled code when Dynamo cache limit is hit (#136655)
Don’t convert Python frame local C buffer into Python dict until necessary #140063

Mps

Dispatch to SDP-math-mps for non-contiguous Tensors (#139791)
Avoid creating spurious instances of FUSED_ADAM_OPS (#141090)

ROCM

Improve torch.sum performance by increasing max_values_per_thread (#135397)
Turn on fast path for index_put on new ROCm version (#136136)

Sparse Frontend

Speedup broadcasting of sparse_coo Tensors (#142364)
Speedup addmm(dense, BSR) for some int8 shapes on A100 (#136088)
Fuse scaling with addmm(dense, BSR) for some int8 shapes on A100 (#136104)
Fuse dtype conversion with addmm(dense, BSR) for some int8 shapes on A100 (#136626)

Miscellaneous

Speed up fp16/bf16 AMP casts on H100+ (#137053)
c10d
- Improved efficiency of NaN checker (#135414)
Improves performance by avoiding atomic add operations in scatter_add for XPU. (#137966)

Inductor

Turn on TORCHINDUCTOR_REORDER_FOR_PEAK_MEMORY by default (#137205). If old behavior is desired, add "reorder_for_peak_memory": False to options in your torch.compile call.
Cache weight tiles in L1D for AMX int8 WoQ GEMM (#136688)
Add and use borrow_arrayref_tensor_as_tensor (#142183)
Support for accelerated sorting with x86-simd-sort (#127936)
Enable extended MMA shapes in CUTLASS. (#133686)
Port ExecuTorch bfdot improvement back to ATen BlasKernel (#136331, #137377)
Build ReducedPrecisionFloatGemvFastPathKernel & entry points for non-ARM architectures too (#137917)
Hook up fp16_gemv_trans to gemv fast path for non-aarch64 architectures (#138005)
Add Vectorizedc10::BFloat16 specialization for ARM (#139090)
Build bf16 gemv fast path & entry points for non-ARM architectures too (#139208)
Hook up bf16_gemv_trans to x86 bf16 GEMM (#139220)
Don't go through dispatch for *_dot_with_fp32_arith (#140834)
Add efficient isnan for NEON float/half (#139082, #139083)
Hook up fp16_gemv_trans to x86 fp16 GEMM (#137918)
Support non-zero beta in fp16_gemv_trans (#138275)
Port X86_F16 from executorch half to PyTorch half (#140720)
Reserve vector for NT GEMM Matmul (#141130)
add CK grouped conv2d fwd kernels to ROCm codegen (#137947)
expand quantization conv-binary(-unary) pattern fusion inside inductor (#138051)
Stop force realizing to prevent recursion errors unless it's much bigger (#138881)
Constant folding for lifted graph (#135060)
Add host-side TMA support to AOTInductor (#138878)
Allow inplacing buffer when other users are inconsequential (#138383)
Don't fuse two nodes if likely increase peak memory (#138756)
Add oneDNN BRGEMM config for Half cpp gemm template (#136255)
Enable the oneDNN Linear fusion for special case (#139172)
Remove uses of deleted operations (#139447)
Enable scaled mm with bias in gemm max autotune with CK backend (#140674)
Support linear+binary folding for freezing path (#138807)
Simplify & rectify dequantized B buffer loading for AMX GEMM micro-kernel for WoQ int8 case (#140258)
Improve parallelization by collapsing vectorized loop (#128812)
qconv at XPU backend (#133080)
Dont use constant mask if y numel potentially overflows y grids (#139751)
Add batched gemms into gemm max autotune with CK backend (#141520)
Adding lowering to persistent-tma device kernel for _scaled_mm (#142045)
Add fusion pass for linear_dynamic_fp16 with RELU (#141556)
Reverts runtime numeric check in Inductor to reduce compilation time (#137324).
Optimizes ARM64 performance by utilizing 128-bit vectors (#137426).
Adjusts score_fusion_memory_threshold application strategy in Inductor (#138970).
Enhances reduction operations with cooperative multi-kernel support in Inductor (#138893).
Disables sanitize_overflow in Inductor kernels (#139502).
Implements caching for get_operation_names and get_buffer_names (#135446).
Reorders scheduler nodes after fusion to reduce peak memory usage (#134874).
Optimize WOQ INT8 weight dequantization in AMX GEMM template (#136630).
Uses scalar for f64 constants in Triton codegen (#136858).
Reduces block sizes for improved performance when using the Triton CPU backend (#136612).
Optimizes CPU copies during autotuning by restricting them to CUDA devices (#137509).
Adds host-side Triton TMA support (#137950).
Optimizes the can_fuse_vertical() function (#135788).

Documentation

Distributed

c10d
- Added some code documents for TCPStore and TCPStoreLibUvBackend code (#130496)
- Added more examples for c10d collectives gather and scatter (#130427)
- Fixed comments in ProcessGroupGloo (#137746)
- Added more inline comments to CUDAEventCache code (#138079)
- Added documentations for PG APIs with some cleanups (#140853)
- Updated backend arg documentation (#142404)
DTensor
- Updated DTensor readme to use the new import path (#138625)
FSDP2
- Better error msg for cpu offloading (#135156)
- Added current FSDP2 path to old composable FSDP1 warning (#139759)
Pipeline
- Added small comments and variable renames (#138735)
c10d
- Added some code documents for TCPStore and TCPStoreLibUvBackend code (#130496)
- Added more examples for c10d collectives gather and scatter (#130427)
- Fixed comments in ProcessGroupGloo (#137746)
- Added more inline comments to CUDAEventCache code (#138079)
- Added documentations for PG APIs with some cleanups (#140853)
- Updated backend arg documentation (#142404)
DTensor
- Updated DTensor readme to use the new import path (#138625)
FSDP2
- Better error msg for cpu offloading (#135156)
- Added current FSDP2 path to old composable FSDP1 warning (#139759)
Pipeline
- Added small comments and variable renames (#138735)
TP
- Updated link in distributed.tensor.parallel.rst (#136103)
Checkpoints
- Add links to tutorial and TorchTitan checkpointing to DCP docs (#139776)

Inductor

Update the OSS tutorial (#139956)
Add README for torch._inductor.runtime (#141492)
Improve OSSProxyExecutor error messages (#141501)
Enhances documentation for the bundled autotune cache to provide clearer guidance (#138298).

Mps

Update MPS_ERROR_RUNTIME_TOO_LOW message (#139427)
Fixing MPS conv1d error message for output 2**16 (#134770)
Modify missing op message (#141314)
Update error message for supported autocast type (#139192)

NN Frontend

Fix formula in RMSNorm documentation (#136727)
Remove incorrect bias initialization in RMSNorm documentation (#139620)
Add reference to pad_packed_sequence in pack_padded_sequence documentation (#137294)
Improve documentation of register_module_forward_hook (#140379)
Correct reference link for triplet margin loss (#142071)
Changed 'standard-deviation' to 'variance' in normalization documentation (#141982)
Fix broadcasting error in example in nn.functional.scaled_dot_product_attention documentation (#135427)
Point to transformer building blocks tutorial in transformer documentation (#144425)

Optim

Removes confusing note about closure grad modification (#137535)
Minorly reorder optim kwargs in docs (#137531, #137528)
RMSprop docs: add missing input "epsilon" (#137854)
Add missing input "eps" to adam docs (#135191)
Corrected AMSGrad max equation in Adam and AdamW (#142051)
Documentation Update: Fix Missing Whitespace in Optimizer Docs (#138321)

Python Frontend

Fix return type of torch.nansum example. (#135435)
Fix torch.cat doc (#135698)
Fix multiple function parameters docstring (#136097, #140089)
Clarify that NaNs are not equal to each other (#137386)
Fix description in torch.save docs to show default for pickle_protocol instead of variable name (#138153)
Fix docs for logcumsumexp formula (#139768)
Clarify meaning of rate parameter in Gamma distribution (#134847)
Updated docstrings referring to torch.expand to point to torch.Tensor.expand (#140045)
Update documentation for torch.mean() to note behavior with empty tensors (#142039)
Improve torch.squeeze parameter type in docstring (#137485)
Improve torch.isclose docstring (#138459, #139724)
Clarify torch.sum dtype promotion behavior (#140939)
Clarify torch.arange floating-point rounding behavior (#141655)
Fix torch.trapezoid docstring (#141459)
Clarify when the optional opt-einsum dependency is used (#137596)
Clarify torch.linalg.vector_norm input aliasing behavior (#136921)
Fix torch.linalg.svd V* shape (#142037)

Miscellaneous

Small rendering fix to our torch.compile FakeTensor documentation (#138281)
Document that load_inline requires having a C++ compiler installed (#137521)
Fix error message in torch._scaled_mm (#140343)
Revamp torch.compile troubleshooting doc (#138620)
Fix doc for export.export() API (#135551)
Fix the example in fx/interpreter (#139368)
Add new PT2 troubleshooting doc (#138620)
Update "Getting Started with XPU" documentation. (#137479)

Developers

Composability

Make maybe_aliasing_or_mutating proper tag (#131990)

Distributed

c10d
- Added wait counter for nccl abort (#136067)
- Added wait counter for time spent in object to tensor and tensor to object (#140414)
- Added trace operations for TCPStoreLibUvBackend (#136320)
- Cast device index to int before logging (#135405)
- Logged WorkNCCL exception string to C10dLogger (#137736)
- Made Formatter avoid throwing exceptions in socket.cpp (#137745)
- Recorded world size in the log of flight recorder (#138044)
- Differentiated timeout errors from nccl errors (#138240)
- Added more appropriate socket errors and debug messages (#130347)
- Reordered cpp stack dump and FR dump and add log prefix to loggings (#138368)
- Reordered GIL checker and c++ stack trace print with comments (#138734)
- Enabled watchdog to print call-time traceback when reporting NCCL watchdog timeout (#139659)
- Added type information for FakeProcessGroup (#133211)
- Added a wait counter for dump function (#140823)
- Switched all timer logging in c10d to wait_counter (#141154)
- Improved Flight Recorder efficacy (#142178)
- Changed back vlog(2) to LOG(INFO) for Flight Recorder (#142441)
- Added better profiling title for “NCCL barrier, nccl:all_reduce” to “nccl:all_reduce_barrier” (#140785)
- Adopted better error message for flight recorder status (#142505)
- Fixed the wrong error msg in ProcessGroupNCCL (#135423)
- Added some missing spaces in barrier msg (#137721)
- Added thread-safety initialization warning (#139638)
- Added the log of started work numel (#139773)
- Improved messaging of ProcessGroupNCCL destructor (#142297)
TorchElastic
- Passed FileTimerRequests.to_json() to log_debug_info_for_expired_timers for a better debugging experience (#135913)

Export

Prototype _swap_modules API that can be used to swap submodules of an exported program (#136190, #139126)
Avoid debug name crash for dim hints (#139104)

Inductor

Remove the non-ABI-compatible mode (#138009, #138047)
Move use_minimal_arrayref_interface logic (#138250)
Refactor ir.Layout into ir.OutputSpec (#140910)
Refactor dependencies.extract_loop_body_with_args (#141404)
Modest code motion in compile_fx (#141574)
Move post compile steps into post_compile1/post_compile2 method (#141656)
Inline FxGraphCache.load into its sole call site (#141681)
Hoist set_feature_use out of conditional, rename some variables (#141683)
Unify cache disable and cache bypass paths (#141685)
Unify post_compile1 and CompiledFxGraph constructor (#141689)
Inline compile_to_fn at its only call site (#141691)
move block pointer analysis to a new module (#141733)
Factor _fx_graph_cache_key and _time_taken_ns to common base class (#141878)
codecache: pull out some Graph serialization code into common helpers (#141502)
Refactor optional graph module into CompiledFxGraphConstants (#141897)
Adds a compiler bisector tool to aid in debugging and development processes within PyTorch (#131936).

Optim

Add back optim type hints that were lost when *.pyi files were removed (#136185)
Ensure SWA boundary conditions w.r.t. definition (#133773)

Quantization

Add unaligned attributes to q8gemm/4x4c2-sse2.c (#140188)
Adding more support QuantizedPrivateuse1 backends (#139860)
Make move_exported_model_to_train/eval idempotent (#142239)

Releng

Deprecate usage of pytorch/builder repository (#142156) (#142277) (#142282) (#142482) (#138103) (#139815) (#140020) (#142382)
Add inductor micro benchmark on x86 metal runner (#135042) (#136052) (#135780)
Migrated PyTorch Dev Infra Runners to Amazon Linux 2023 (#136540) (#136544)
Migrated HUD backend database from Rockset to Clickhouse (#139296) (#139322) (#137207) (#139922) (#140574)
Release engineering tooling, CI fixes and additional CI tests . Workflows, Trymerge, Bot Labeler, Mergebot (#136060) (#140185) (#135582) (#135644) (#136061) (#135342) (#136043) (#134356) (#136208) (#136610) (#136791) (#136239) (#135342) (#136794) (#137104) (#137168) (#137176) (#137170) (#137169) (#135390) (#137614) (#137802) (#137791) (#138178) (#138054) (#138232) (#138263) (#138178) (#138752) (#138204) (#138714) (#138874)

XPU

Remove unnecessary Triton dependencies for XPU wheel builds. (#143983)
Update Docker builds workflow with a new XPU image name. (#142298)
Restore Triton build support for XPU. (#141775)
Update Triton XPU version pinning. (#135638)
Improve exception handling for XPU device initialization. (#141658)
Enhance unit tests for XPU memory allocation. (#141325)
Make XPU libraries publicly accessible for developers. (#136974)
Improve code formatting for XPU oneDNN integration. (#139721)
Make XPU oneDNN headers publicly available for documentation purposes. (#139177)
Ensure XPU compiler version control in CMake for backward compatibility. Users should align their XPU compiler version with supported versions in PyTorch. (#139258)

torch 2.6.0 PyTorch 2.6.0 Release on Python PyPI

Highlights

BETA FEATURES

[Beta] torch.compiler.set_stance

PROTOTYPE FEATURES

Tracked Regressions

torch.device(0) makes CUDA init fail in subprocess

Regression in the compilation of the torch.all operation with out= variant

Backwards Incompatible changes

Flip default torch.load to weights_only (#137602, #138225, #138866, #139221, #140304, #138936, #139541, #140738, #142153, #139433)

Anaconda deprecation in CD. Remove anaconda dependency in Magma builds (#141024) (#141281) (#140157) (#139888) (#140141) (#139924) (#140158) (#142019) (#142276) (#142277) (#142282)

Added Manylinux 2.28 prototype support and CXX11_ABI=1 for following binaries: Linux CUDA 12.6, Linux aarch64 CPU, Linux aarch64 GPU CUDA 12.6, ROCm 6.2.4, Linux XPU (#139894) (#139631) (#139636) (#140743) (#137696) (#141565) (#140681) (#141609) (#141704) (#141423) (#141609)

Deprecations

Releng

Removed CUDA 12.1 support in CI/CD (#141271) (#142177)

Deprecated c10d::onCompletionHook (#142390)

Inductor

Deprecate TORCHINDUCTOR_STACK_ALLOCATION (#139147)

New features

Python Frontend

Miscellaneous

Optim

Distributed

Dynamo

Releng

ROCM

XPU

Profiler

Export

Inductor

Improvements

Python Frontend

NN Frontend

Optim

Composability

Decompositions, FakeTensor and meta tensors

Distributed

Profiler

Nested Tensor

Functorch

Quantization

Releng

Cuda

Mps

ROCM

XPU

Miscellaneous

Dynamo

Export

Fx

Inductor

Bug fixes

Python Frontend

NN Frontend

Autograd Frontend

Composability

Distributed

Dynamo

Nested Tensor Frontend

Cuda

Mps

ROCM

XPU

Profiler

Quantization

Sparse Frontend

Miscellaneous

Export

Fx

Inductor

Jit

Performance

Dynamo

Mps

ROCM

Sparse Frontend

Miscellaneous

Inductor

Documentation

Distributed

torch 2.6.0
PyTorch 2.6.0 Release

on Python PyPI

Deprecated `c10d::onCompletionHook` (#142390)