PyTorch 2.10.0 Release Notes
- Highlights
- Backwards Incompatible Changes
- Deprecations
- New Features
- Improvements
- Bug fixes
- Performance
- Documentation
- Developers
- Security
Highlights
Python 3.14 support for torch.compile(). Python 3.14t (freethreaded build) is experimentally supported as well.
|
| Reduced kernel launch overhead with combo-kernels horizontal fusion in torchinductor |
| A new varlen_attn() op providing support for ragged and packed sequences |
| Efficient eigenvalue decompositions with DnXgeev |
torch.compile() now respects use_deterministic_mode |
| DebugMode for tracking dispatched calls and debugging numerical divergence - This makes it simpler to track down subtle numerical bugs. |
For more details about these highlighted features, you can look at the release blogpost. Below are the full release notes for this release.
Backwards Incompatible Changes
Dataloader Frontend
- Removed unused
data_sourceargument from Sampler (#163134). This is a no-op, unless you have a custom sampler that uses this argument. Please update your custom sampler accordingly. - Removed deprecated imports for torch.utils.data.datapipes.iter.grouping (#163438).
from torch.utils.data.datapipes.iter.grouping import SHARDING_PRIORITIES, ShardingFilterIterDataPipeis no longer supported. Please import fromtorch.utils.data.datapipes.iter.shardinginstead.
torch.nn
- Remove Nested Jagged Tensor support from
nn.attention.flex_attention(#161734)
ONNX
fallback=Falseis now the default intorch.onnx.export(#162726)- The exporter now uses the
dynamo=Trueoption without fallback. This is the recommended way to use the ONNX exporter. To preserve 2.9 behavior, manually setfallback=Truein thetorch.onnx.exportcall.
Release Engineering
- Rename pytorch-triton package to triton (#169888)
Deprecations
Distributed
- DeviceMesh
- Added a warning for slicing flattened dim from root mesh and types for _get_slice_mesh_layout (#164993)
We decided to deprecate an existing behavior which goes against the PyTorch design principle (explicit over implicit) for device mesh slicing of flattened dim.
Version <2.9
import torch
from torch.distributed.device_mesh import
device_type = (
acc.type
if (acc := torch.accelerator.current_accelerator(check_available=True))
else "cpu"
)
mesh_shape = (2, 2, 2)
mesh_3d = init_device_mesh(
device_type, mesh_shape, mesh_dim_names=("dp", "cp", "tp")
)
mesh_3d["dp", "cp"]._flatten()
mesh_3["dp_cp"] # This comes with no warningVersion >=2.10
import torch
from torch.distributed.device_mesh import
device_type = (
acc.type
if (acc := torch.accelerator.current_accelerator(check_available=True))
else "cpu"
)
mesh_shape = (2, 2, 2)
mesh_3d = init_device_mesh(
device_type, mesh_shape, mesh_dim_names=("dp", "cp", "tp")
)
mesh_3d["dp", "cp"]._flatten()
mesh_3["dp_cp"] # This will come with a warning because it implicitly change the state of the original mesh. We will eventually remove this behavior in future release. User should do the bookkeeping of flattened mesh explicitly.Ahead-Of-Time Inductor (AOTI)
- Move
from/tototorch::stable::detail(#164956)
JIT
torch.jitis not guaranteed to work in Python 3.14. Deprecation warnings have been added to user-facingtorch.jitAPI (#167669).
torch.jit should be replaced with torch.compile or torch.export.
ONNX
- The
dynamic_axesoption intorch.onnx.exportis deprecated (#165769)
Users should supply the dynamic_shapes argument instead. See https://docs.pytorch.org/docs/stable/export.html#expressing-dynamism for more documentation.
Profiler
- Deprecate
export_memory_timelinemethod (#168036)
The export_memory_timeline method in torch.profiler is being deprecated in favor of the newer memory snapshot API (torch.cuda.memory._record_memory_history and torch.cuda.memory._export_memory_snapshot). This change adds the deprecated decorator from typing_extensions and updates the docstring to guide users to the recommended alternative.
New Features
Autograd
- Allow setting grad_dtype on leaf tensors (#164751)
- Add Default Autograd Fallback for PrivateUse1 in PyTorch (#165315)
- Add API to annotate disjoint backward for use with
torch.utils.checkpoint.checkpoint(#166536)
Complex Frontend
- Add
ComplexTensorsubclass (#167621)
Composability
- Support autograd in torch.cond (#165908)
cuDNN
- BFloat16 support added to cuDNN RNN (#164411)
- [cuDNN][submodule] Upgrade to cuDNN frontend 1.16.1 (#170591)
Distributed
-
LocalTensor:
LocalTensoris a powerful debugging and simulation tool in PyTorch's distributed tensor ecosystem. It allows you to simulate distributed tensor computations across multiple SPMD (Single Program, Multiple Data) ranks on a single process. This is incredibly valuable for: 1) debugging distributed code without spinning up multiple processes; 2) understanding DTensor behavior by inspecting per-rank tensor states; 3) testing DTensor operations with uneven sharding across ranks; 4) rapid prototyping of distributed algorithms. Note that LocalTensor is designed for debugging purposes only. It has significant overhead and is not suitable for production distributed training.LocalTensoris atorch.Tensorsubclass that internally holds a mapping from rank IDs to local tensor shards. When you perform a PyTorch operation on aLocalTensor, the operation is applied independently to each local shard, mimicking distributed computation (LocalTensorsimulates collective operations locally without actual network communication.).LocalTensorModeis the context manager that enablesLocalTensordispatch. It intercepts PyTorch operations and routes them appropriately. The@maybe_run_for_local_tensordecorator is essential for handling rank-specific logic when implementing distributed code.- To get started with
LocalTensor, users import fromtorch.distributed._local_tensor, initialize a fake process group, and wrap their distributed code in aLocalTensorModecontext. Within this context, DTensor operations automatically produce LocalTensors. - PRs: (#164537, #166595, #168110,#168314,#169088,#169734)
-
c10d:
- New
shrink_groupimplementation to exposencclCommShrinkAPI (#164518)
- New
Dynamo
torch.compilenow fully works in Python 3.14 (#167384)- Add option to error or disable applying side effects (#167239)
- Config flag (
skip_fwd_side_effects_in_bwd_under_checkpoint) to allow eager and compile activation-checkpointing divergence for side-effects (#165775) torch._higher_order_ops.printfor enabling printing without graph breaks or reordering (#167571)
FX
-
Added node metadata annotation API
-
Disable preservation of node metadata when
enable=False(#164772) -
Annotation should be mapped across submod (#165202)
-
Annotate bw nodes before eliminate dead code (#165782)
-
Add logging for debugging annotation (#165797)
-
Override metadata on regenerated node in functional mode (#166200)
-
Skip copying custom meta for gradient accumulation nodes; tag with is_gradient_acc=True (#167572)
-
Add metadata hook for all nodes created in runtime_assert pass (#169497)
-
Update
gm.print_readableto include Annotation (#165397) -
Add annotation to assertion nodes in export (#167171)
-
Add debug mode to print meta in fx graphs (#165874)
Inductor
- Add experimental Pallas TorchInductor backend. (#166822)
- Add Pallas TPU backend support. (#167774)
- Add Flash Attention support to FlexAttention. (#161118)
- Add deterministic mode for Inductor compilation. (#163589) (#165950) (#164532)
- Enable custom op autotune decompositions and parameter tuning. (#164212) (#167193)
- Expose
torch.compiler.config.force_disable_cachesas a public API. (#166699) - Add HOP for additional control dependencies to enforce explicit scheduling. (#164568)
- Add Inductor Lite Mode (#167115)
- Add distributed autotuning support (#163369)
- Add Native matmul support to inductor (#157743)
Ahead-Of-Time Inductor (AOTI)
MPS
- MPS sparse backend is functional
(#162349, #162349, #162007, #162910, #162885, #163011, #163694, #164961, #165102, #166708, #166711, #167013, #169125, #165232, #166708, #168154, #169368, #167908, #168112)
torch.nn
- Add
nn.functional.scaled_mm(#164142) - Add
nn.functional.scaled_grouped_mm(#165154) - Add
nn.attention.varlen_attn(#164502, #164504) - Add
nn.functional.grouped_mm(#168298)
ONNX
- A new testing module
torch.onnx.testingwith a testing utilityassert_onnx_program(#162495)
Profiler
- Add scope for
RecordFunctionFast(#162661)
Quantization
-
Add
_scaled_mm_v2API (#164141) -
Add
scaled_grouped_mm_v2and python API (#165154) -
Add
embedding_bag_byte_prepack_with_rowwise_min_maxandembedding_bag_{2/4}bit_prepack_with_rowwise_min_max(#162924) -
Add
MXFP4support for_scaled_grouped_mm_v2via. FBGEMM kernels (#166530)
Release Engineering
- Enabled auto-revert on PyTorch CI (#163858, #164911, #165459)
- Add PEP 517 compliant Python source distribution package to release process (#157815)
- Add Pallas CI testing infrastructure with CPU and GPU test (#167143, #167428, #169687, #169494, #169802)
ROCm
- Enable grouped GEMM via regular GEMM fallback (#162419)
- Enable grouped GEMM via CK (#166334, #167403)
- Enable ATen GEMM overload for FP32 output from FP16/BF16 inputs (#162600)
- Support torch.cuda._compile_kernel (#162510)
- Enhanced Windows support
- load_inline (#162577)
- Enable AOTriton runtime compile (#165538)
- AOTriton scaled_dot_product_attention (#162330)
- Add gfx1150 gfx1151 to hipblaslt-supported GEMM lists (#164744)
- Add scaled_mm v2 support. (#165528)
- Add torch.version.rocm, distinct from torch.version.hip (#168097)
XPU
- Support ATen operators
scaled_mmandscaled_mm_v2for Intel GPU (#166056) - Support ATen operator
_weight_int8pack_mmfor Intel GPU (#160938) - Extend SYCL support in PyTorch CPP Extension API to allow users to implement new custom operators on Windows (#162579)
- Add API
torch.xpu.get_per_process_memory_fractionfor Intel GPU (#165511) - Add API
torch.xpu.set_per_process_memory_fractionfor Intel GPU (#165510) - Add API
torch.xpu.is_tf32_supportedfor Intel GPU (#163141) - Add API
torch.xpu.can_device_access_peerfor Intel GPU (#162705) - Add API
torch.accelerator.get_memory_infofor Intel GPU (#162564)
Improvements
Build Frontend
- Abort explicitly requested CUDA build if toolkit could not be found (#166982)
- RISC-V build improvements (#166602, #167071, #165717)
- Allow building with arbitrary BLAS library (#166333)
- Allow building with LeakSanitizer (#158686)
Composability
- If you are using the
torch.compile(backend="aot_eager")backend, it should now give bitwise equivalent results in eager. Previously it sometimes would not due to extra compile-only decompositions running (#165910) - Some dynamic shape errors were changed to recommend using
torch._checkovertorch._check_is_size(#164889, - Some unbacked (dynamic shape) improvements (#162652, #169612)
- Some bugfixes for symbolic float handling in compile (#166573, #162788)
C++ Frontend
- Changed
TORCH_CHECK_{COND}behavior to be non-fatal (#167004) - Migrated
TypeTraits,TypeList,Metaprogramming,DeviceType,MemoryFormat,Layout,version.h, andCppTypeToScalarTypetotorch::headeronly(#167386, #163999, #168034, #165153, #164381, #167610) - Bumped
libfmtsubmodule version to12.0.0(#163441)
CUDA
- Make
torch.cuda.rng_set_stateandtorch.cuda.rng_get_statework in CUDA graph capture. (#162505) - Enable templated kernels (#162875)
- Enable pre-compiled kernels (#162972)
- Add CUDA headers automatically (#162634)
- Remove outdated
header_codeargument (#163165) - Prevent copies of std::vector in CUDA ForeachOps (#163416)
- Implement cuda-python CUDA stream protocol (#163614)
- Remove outdated checks and docs for cuBLAS determinism (#161749)
- Cleanup old workaround code in
launch_logcumsumexp_cuda_kernel(#164567) - Add a compile-time flag to trigger verbose logging for device-side asserts (#166171)
- Support SM 10.3 in custom CUTLASS matmuls (#162956)
- Enable CUTLASS matmuls on Thor (#164836)
- Add
per_process_memory_fractionoption toPYTORCH_CUDA_ALLOC_CONF(#161035) - Support nested memory pools (#168382)
- Upgrade cuDNN to 9.15.1 for CUDA 13 builds (#169412)
Distributed
-
c10d
-
Context Parallel
- Introduced ContextParallal plan for
parallelize_module(#162542) - Replaced context_parallel context manager with functional APIs (#164500)
- Introduced
flex_cp_forwardcustom op for FlexAttention CP (#163185) - Add
_templated_ring_attentionto the backward compatility stub (#166991) - Added
_LoadBalancerclasses, and load-balance interface to Context Parallel APIs with process-time based Round-Robin load-balance (#161062, #163617) - Added python bindings for NCCL CTA policies (#164309)
- Introduced ContextParallal plan for
-
DeviceMesh
- Adopted CuTe layout for DeviceMesh internal bookkeepings with a shared 1D _rank_map tensor and related code cleanups (#162413, #162534, #163212, #163288, #163928, #163930, #164750, #164954, #164510, #166264, #167581, #162690, #163367, #166614)
- Implemented
_unflattenon top of CuTe layout bookkeeping (#161224, #165521) - Added support of
_rankfor use with non-global PGs (#162439)
-
FullyShardDataParallel (FSDP1 and FSDP2)
-
DTensor
- Extended conv ops to 3D (#165241, #167402)
- Added an explicit mode (ExplicitRedistributionContext) for DTensor redistribute (#166593, #167370, #169452)
- Reduced DTensor CPU overhead by moving logic into c++ and more optimizations to sharding propagation and cache (#162508, #163820, #162990, #166750, #166989, #166990, #167051, #166372, #166808, #167475, #167588, #168264, #169519, #168051, #168983, #166132, #167580, #168269)
- Enable per-rank RNG state collect/set for XPU devices in DTensor (#169410)
- Added
_foreach_pow,logsumexpandmasked_fill_.Scalarto sharding propagation list. (#162895, #163879, #169668)
-
SymmetricMemory
- Added MemPool support to CUDA backend and get_mem_pool API (#169740, #170008, #169739)
- Added op
multimem_one_shot_reduce_out(#164517) - Added op
multi_root_tile_reduce(#162243, #164757) - Added op to get remote tensors (#167779)
- Added
symm_mem_syncTriton kernel totorch.ops.symm_mem(#168917) - Added a NVSHMEM based one side API (#159837, #163194)
- Skipped multicast initialization if it fails (#163750)
- Supported copy engine based all-gather and all-to-all (#170344, #170265)
- Added
set_signal_pad_sizeAPI for SymmetricMemory (#169156)
-
Pipeline Parallelism
- Made runtime dbg log print custom actions (#167113)
- Moved profiler record_function in schedule and improved visualizer (#164976, #160474)
- Enabled inspect of schedule IR with comms (#162996)
- Use default export mode (non-strict) for pipeline parallelism (#164045)
- Enabled PP split BlockMask into micro-BlockMask (#164111)
- Migrate other schedules to use
PipelineScheduleRuntime(#164777) - Improvement the composability with FSDP with FSDP reduce scatters moved to end of step and backward_counter updated to schedule class (#165106, #165513)
- Added optional argument to not save outputs (#165822)
- Added PP Runtime Features for supporting Graph Based execution (#167277)
- Used same dtype for receive and send tensor when initializing p2p communication. (#165539)
- Support
OVERLAP_F_Bin schedule (#161072) - Support custom callback functions in schedule (#162016)
-
torchelastic
Dynamo
-
Turn on
capture_scalar_outputsandcapture_dynamic_output_shape_opswhenfullgraph=True(#163121, #163123) -
Improved tracing for
dictkey hashing (#169204) -
Tracing support for
torch.cuda.stream(#166472) -
Improved tracing of
torch.autograd.Functions (#166788) -
Miscellaneous smaller tracing support additions:
-
Extend
collections.defaultdictsupport with*args,**kwargsand customdefault_factory(#166793) -
Support for bitwise xor (#166065)
-
Support
repron user-defined objects (#167372) -
Support new typing union syntax
X | Y(#166599)
Export
- Improved fake tensor leakage detection in export (#163516)
- Improved support for tensor subclasses (#163770)
FX
- Add tensor subclass printing support in fx/graph.py (#164403)
- Update Node.is_impure check if subgraph contains impure ops (#166609, #167443)
- Explicitly remove call_mod_node_to_replace after inlining the submodule in const_fold._inline_module` (#166871)
- Add strict argument validation to Interpreter.boxed_run (#166784)
- Use stable topological sort in fuse_by_partitions (#167397)
Inductor
- Pruned failed compilations from Autotuning candidates (#162673)
- Extend triton_mm auto-tune options for HIM shapes (#163273)
- Various fixes for AOTI-FX backend
- Solve for undefined symbols in dynamic input shapes (#163044)
- Support symbol and dynamic scalar graph inputs and outputs (#163596)
- Support unbacked symbol definitions (#163729)
- Generalize FloorDiv conversion to handle more complex launch grids. (#163828)
- Don't flatten constant args (#166144)
- Support SymInt placeholder(#167757)
- Support torch.cond (#163234)
- Add tanh, exp, and sigmoid activations for Cutlass backend. (#162535) (#162536)
- Hardened the experimental horizontal fusion
torch._inductor.config.combo_kernels(#162442) (#166274) (#162759) (#167781) (#168127) (#168946) (#168109) (#164918) - Enable TMA store for TMA matmul templates on Triton. (#160480)
- Add Blackwell GPU templates (persistent matmul, FP8 scaled persistent + TMA GEMMs, CuTeDSL grouped GEMM, FlexFlash forward, FlexAttention configs). (#162916) (#163147) (#167340) (#167040) (#165760)
- Support
qconv_pointwise.tensorandqconv2d_pointwise.binary_tensorquantized operations. (#166608) - Support
out_dtypeargument for matmul operations. (#163393) - Add support for bound methods in pattern matcher. (#167795)
- Add way to register custom rules for graph partitioning. (#166458) (#163310)
- Add codegen support for
fast_tanhfon ROCm. (#162052) - Support deepseek-style FP8 scaling in Inductor. (#164404)
- Enable int64 indexing in convolution and matmul templates. (#162506)
- Add SDPA patterns for T5 variants when batch size is 1. (#163252)
- Add mechanism to get optimal autotune decision for FlexAttention. (#165817)
- Add fallback config
fallback_embedding_bag_byte_unpack. (#163803) - Expose config for FX bucket all_reduces. (#167634)
- Add in-kernel NaN check support. (#166008)
- Enable
pad_mmanddecompose_mm_passpass on Intel GPU. (#166618) (#166613) - Improve CUDA support for int8pack_mm weight-only quantization pattern. (#161680) (#161848) (#163461)
- Improve heuristics for pointwise kernels on ROCm. (#163197)
- Enable mix-order reduction fusion earlier and allow fusing more nodes. (#168209)
- Make mix order reduction work with dynamic shapes (#168117)
- Better use of memory tracking (#168121)
- Turn on LOAF (for OSS) by default. (#162030)
- Log kernel autotuning results to CSV. (#164191)
- Add warning for CUDA graph re-recording from dynamic shapes. (#162696)
- Quiesce triton compile workers by default. (#169485)
- Support masked vectorization for tail loops with integer and bool datatypes. (#165885)
- Support tile-wise (1x128) FP8 scaling in Inductor. (#165132)
- Support fallback for all GEMM-like operations. (#165755)
- Enable Triton kernels with unbacked inputs. (#164509)
- Add AVX512-VNNI-based micro kernel for CPU GEMM template. (#166846)
- Support mixed dtype in
native_layer_norm_backwardmeta function. (#159830) - Add tech specs for MI350 GPU. (#166576)
- Add
assume_32bit_indexinginductor config option. (#167784) - Wire up mask_mod and blockmask to FlexFlash implementation. (#166359)
- More aggressive mix order reduction for better fusion. (#166382)
- Mix order reduction heuristics and tuning. (#166585)
- CuteDSL flat indexer needs to be colexigraphic in coordinate space (#166657)
MPS
- Add
embedding_bagoperator (#163012, #163931, #163281) - Continue ops migration to Metal and add complex support ( #169478, #166903, #167755, #167826m #166216, #166670, #169407, #166210, #166090, #168120, #167569)
- Asynchronously report out-of-bounds access errors for
embedding_bagandindex_selectops (#166615, #168930, #166468)
Nested Tensor (NJT)
- Added NJT support for
share_memory_(#162272)
torch.nn
- Support batch size 0 for flash attention in
scaled_dot_product_attention(#166318) - Raise an error when using a sliced
BlockMaskinnn.functional.flex_attention(#164702)
ONNX
- Improved graph capture logic to preserve dynamic shapes and improve conversion success rate
- Cover all FX passes into backed size oblivious (#166151)
- Set prefer_deferred_runtime_asserts_over_guards to True (#165820)
- Various warning and error messages improvements (#162819, #163074, #166412, #166558, #166692)
- Improved operator translation logic
- Update weight tensor initialization in RMSNormalization (#166550)
- Support enable_gqa when dropout is non-zero (#162771)
- Implement
tofile()in ONNX IR tensors for more efficient ONNX model serialization (#165195)
Optimizer
- Make
Adam,AdamWwork with nonzero-dim Tensor betas (#149939)
Profiler
- Expose Kineto event metadata in PyTorch Profiler events (#161624)
- Add
user_metadatadisplay to memory visualizer (#165939) - Add warning for clearing profiler events at the end of each cycle (#168066)
Python Frontend
- Improved
torch.libraryand custom ops to support view functions (#164520) - Rework PyObject preservation to make it thread safe, significantly simpler and better handle some edge cases (#167564)
- Remove reference cycle in torch.save to improve memory usage (#165204)
- Add
generatorarg torand*_likeAPIs (#166160) - support negative index arguments to torch.take_along_dim negative (#152161)
Quantization
halfandbf16support forfused_moving_avg_obs_fake_quant(#162620, #164175)bf16support forfake_quantize_learnable_per_channel_affine(#165098)bf16support for backward oftorch._fake_quantize_learnable_per_tensor_affine(#165362)- Add
NVFP4two-level scaling toscaled_mm(#165774) - Add support for
fp8_input/fp8_weight/bf16_biasandbf16_outputfor fp8 qconv in CPU (#167611) - Make the
torch.float4_e2m1fn_x2dtype support equality comparisons (#169575) - add
copy_support fortorch.float4_e2m1fn_x2dtype (#169595)
Release Engineering
- Add support for CUDA 13.0 in CI/CD including binary builds, inductor benchmarks, and upgrade to CUDA 13.0.2 (#162455, #162425, #163787, #164383, #164607, #163239, #165029, #168091, #169902, #163988)
- Add B200 GPU support with symmetric memory testing and smoke tests (#162988, #168990)
- Improve CUDA builds for aarch64, Windows, and legacy driver support (#162566, #163956, #164470, #165013, #163029, #167769, #167046)
- Upgrade to ROCm 7.0 and 7.1 (#163860, #163883, #163937, #163140, #164201, #165756, #166665, #166730, #166693, #167390, #166764)
- Add support for MI355, MI300, gfx1100, gfx1150, gfx1151, and gfx950 GPU architectures (#160215, #167587, #148355, #165103, #165326, #165658, #165699, #167299, #167225, #169427, #166544)
- Migrate ROCm CI to Ubuntu noble images and expand CI coverage (#168230, #168202, #168088, #167593, #167379, #162649, #162721, #163014, #163339, #163776, #164244, #164585, #164279, #164616, #164769, #165674, #165821, #166575, #166645, #166870, #166915, #166961, #167220, #167262, #167483, #168359, #169300, #169679, #168104)
- Upgrade XPU support package to 2025.3 (#166829)
- Upgrade XPU build infrastructure to GCC 13 and Ubuntu 24.04 (#162474, #162475, #164127)
- Expand XPU testing coverage with profiler tests, inductor benchmarks, and additional unit tests (#166289, #166954, #166047, #165423, #169799)
- Improve vLLM integration with nightly builds, aarch64 support, and test infrastructure (#162371, #162664, #163232, #163383, #166146)
ROCm
- Allow custom OpenBLAS library name for CMake build (#166333)
- Add gfx1150 gfx1151 to binary build targets (#164782, #164854, #164763)
- hipSPARSELt support - Update cuda_to_hip_mappings.py (#167335)
- New implementation of upsample_bilinear2d_backward (#164572)
- Remove env var HIPBLASLT_ALLOW_TF32 from codebase, TF32 always allowed (#162998)
- Enable multi-arch compilation and unit tests for AOT Inductor (#166357)
- Fix miopen batchnorm changing output format (#162112)
- [ROCm] Enable multi-arch compilation and unit tests for AOT Inductor ([#166357](https://github.com/pytorch/pytorch/- pull/166357))
- [ROCm][inductor] autotune support for persistent reduction kernels ([#163908](https://github.com/pytorch/pytorch/- pull/163908))
Sparse Frontend
- Add MPS support sparse_mask backward and sparse sum backward (#166260, #169240)
- Add exp support for COO on CPU, CUDA and MPS (#166801)
- Remove old CUDA 11 sparse code (#166048, #164531, #164199)
XPU
- Support
--nproc-per-nodetorchrun option for Intel GPU (#159474) - Support complex dtype of Aten operator Matmul for Intel GPU (#160867)
- Add SYCL-TLA implementation for aten flash attention (#169101)
Bug Fixes
Autograd
- Fix custom autograd Function memory leak when saving mutated view (#164407)
- Fix unused gradient tracking to respect create_graph (#168295)
- Fix NaN gradients in atan2_backward when both inputs are zero (#166787)
- Bugfix to forward autodiff causing different datatype 2 (#165784)
Build Frontend
- Fix build targets order (#169905,#169994, #164165)
- Do not restrict optimization flags (#164894)
- Fix linking issue for Linux-aarch64 target (#169723)
C++ Frontend
- Fixed C++ extension distributed warning spew (#162764)
CPU
- Fix clang-21 warnings (#166859)
CUDA
- Handle python floats as double in CUDA C++ (#162626)
- Use libnvrtc.so path based on CUDA version used by torch (#163642)
- Handle python floats as double in CUDA C++ (#162626)
- Use libnvrtc.so path based on CUDA version used by torch (#163642)
- Fix
torch.nonzero_staticcrash on CUDA when the input is a empty tensor (#162578) - Fix caller source location in
C10_CUDA_CHECKerror messages (#162808) - Fix channels-last dimension mapping in CUDA
parallel_cat(#165023) - 64-bit indexing on CUDA:
- Remove erroneous
const_castin CUDAmemcpycall (#168165) - Handle large shared memory in
torch.cuda._compile_kernel(#162647) - Fix
torch.unique_consecutivecrash on CUDA (#162950) - Fix correctness of
parallel_cat(#165446) - Fix race condition and make
torch.kthvaluedeterministic on CUDA (#165762) - Fix shared memory race in
reduce_kernel(#162995) - Fix
Tensor.__dlpack__(stream=None)support during CUDA Graph capture (#163242) - Remove explicit casting of complex nansum during accumulation to avoid precision loss (#165494)
- Disable jiterator for complex tan and tanh due to numerical issues (#165250)
- Fix a few issues with
out_dtypeoverload for addmm/baddbmm (#167931) - Fix safety issues when calling cuBLAS from multiple threads (#167248)
cuDNN
- Disable cuDNN for 3D convolutions with kernel size != 1 for cuDNN 9.8+ due to a numerical isssue (#163581)
Dataloader Frontend
- Fix pin memory return type when input is a tuple (#169690)
Distributed
-
c10d
-
Context Parallel
- Fixed cuDNN Context Parallel LSE dimension bug (#163231)
-
DistributedDataParallel: (DDP)
- Fixed complex datatype handling in ddp (#166863)
-
DistributedStateDict
- Fixed keyerror when loading parameter with unsaved optimizer state (#165228)
-
DTensor
- Fixed
foreach_maxop (#169667)
- Fixed
-
FullyShardDataParallel (FSDP1 and FSDP2)
-
Pipeline Parallelism:
-
SymmetricMemory
- Fixed memory allocation hold-up (#162680)
Distributed Checkpointing
- Avoid multiple storage writer resets in async save (#159448)
- DTensor slice dequantization with proper block alignment (#163532)
- Add option to use PrefixStore to create checkpoint background process (#166560)
Dynamo
- Fixed
cProfileusage withtorch.compilein Python 3.12+ (#170013) - Fix memory leak in tensor subclass metadata guard (#167352)
FX
- Fix splitter for empty subgraph case (#161716)
- Use tuples to have a deterministic ordering in shape prop. (#164851)
Inductor
- Fix some edge cases (#162295)
- Fix TMA transpose logic to handle 1D shapes + string differences (#163966)
- fix flex attention eager: dont round down scores to low-precision (closes #163588) (#163986)
- Fix a condition error in torch/_inductor/codegen/debug_utils.py (#165033)
- Thread deterministic config vars to subproc compilation (#165729)
- Fix identity expansion. (#165066)
- Fix FP8 activation quantization for duplicate forward outputs. (#163364)
- Fix decomposition issues (
repeat_interleaveout-of-bounds indices,divmoderror,alpha/betahandling). (#165368) (#163482) (#167317) - Fix dynamic shaped heads check in FlexFlash. (#165866)
- Fix
argmin/argmaxreturning incorrect indices for non-contiguous tensors. (#165983) - Fix unbacked float symbol handling in kernel codegen. (#166890)
- Fix
static_input_indicessubclass remapping under training. (#167127) - Fix
torch.condHOP device handling in inductor. (#167354) - Fix
CppTile2DKernelfor FP8 datatype. (#167451) - Fix user-defined Triton kernel output with
.cpu()correctness issue. (#168281) - Fix viewed outputs getting padded incorrectly. (#163398)
- Fix lowering issues (
as_stridedwith.view(dtype)inputs, symbolic shapes in FlexAttention,sym_size_/sym_stride). (#163319) (#168383) (#167565) - Fix error from custom CUDA allocators. (#163422)
- Fix
copy_for scalar in inductor. (#164167) - Fix bug with serialization after AOTAutogradCache hit. (#165474)
- Fix
searchsortedfor non-dense tensors. (#165064) - Fix constant folder issues. (#166655)
- Fix constant creation issues. (#167398)
- Fix picking wrong contiguous node. (#168371)
- Fix inner reduction decision logic. (#168391)
- Fix device determination logic in Conditional. (#169199)
- Fix pattern matcher
FailedMatchformat string. (#169611) - Fix SyntaxError from truncated Unicode escape in Windows compile-time auto-tuning block. (#169286)
- Optimize sum reduction heuristics. (#163144)
- Optimize scalar
welford_reduce. (#162709) - Disable mixed-order reduction for cpp-wrapper. (#169859)
- Capture Triton timeout errors without crashing the job. (#169064)
- Correctly set
max_numwarpsin coordinate descent tuner. (#159146) - Fix Triton
group_mconfig. (#169514) - Fix convolution autotune check when
groups != 1. (#163094) - Fix constant shape for float constants. (#164241)
- Fix Diode/exhaustive autotune crash on AMD. (#169225)
- Fix
get_raw_streamundefined error. (#163707) - Fix runtime error in context on cpu-only machine when compile for GPU. (#165220)
- Fix AMD CPU max-autotune breakage. (#168079)
- Fix bad merge duplicate pre pass. (#165917)
- Fix layout constraint for
weight_normbackward. (#167667) - Fix cross-process group overlap. (#169384)
- Fix WeakDeps (WAR deps) handling during fusion. (#162316)
- Fix unbacked replacement where LHS is purely backed expr and RHS is unbacked expr. (#164013)
- Fix stride rounding on Blockwise128x128 to accommodate for small shapes. (#164953)
- Fix loop pipelining for 2D/2D case of Triton grouped MM. (#165265)
- Fix persistent rblock statically_known_leq error cases. (#165657)
- Fix bugs in emulate_precision_casts (#163520)
- Support python slicing with tensor inputs. (#165074)
Ahead-Of-Time Inductor (AOTI)
- Bugfix for doing negative padding (#161639)
- Fix unbounded number of substitutions when equality checks contain Max expr (#163685)
- Use atomic API when trying to apply size hints to input tensor strides. (#163660)
- Fix a mixed-device bug for scatter_add (#167341)
- Fix a small buffer mutation issue (#169347)
- Fix
aot_compiletyping. (#168320)
MPS
- Fix empty tensors handling for
median/nanmedian/mv,dot(#162846, #166561), #165237) - Fix dlpack exports/imports of sliced tensors (#169272)
- Fix large tensors silent correctness for
fillandcatoperation (#164108, #165373, #166556, #164416) torch.compilebugfixes (#169648, #163021, #162776, #163452)- Silent correctness/input validation fixes (#163036, (#165254, (#165267, (#167777, (#165058, (#167961, (#169261, (#165871, (#163507, (#168332)
Nested Tensor (NJT)
- Fixed NJT min / max operations on integer dtypes (#162273)
torch.nn
- Fix silent correctness when backpropagating to
score_modinnn.functional.flex_attention(#163677) - Fix bug in
nn.Module.load_state_dictfor singleton tensor (#166335)
ONNX
- Native ONNX ops (
torch.onnx.ops) - Fix rotary_embedding_23 implementation (#162865)
- Create fake implementations for onnx ops; fix boolean mask in attention (#165780)
- Fix onnx export on big endian machines (#167816)
Optimizer
- Fix
SWALR.state_dictandload_state_dictto serialize properly withweights_only=True(#163122) - Prevent problematic tensor aliasing in LRScheduler (#163098, #163120)
- Fix
LBFGSwolfe max iteration (#161488)
Profiler
- Fix
ProfilerStatetypo ('Disable' → 'Disabled') and exposePRIVATEUSE1inActiveProfilerType(#169166)
ROCm
- Fix hardsigmoid op (#162758)
- Fix GEMM carveout feature (#164303)
- Disable
__builtin_amdgcn_rcpffor gfx90a (#166454) - ROCm 7.0 BC-breaking preparations in JIT support (#160587, #166147)
Sparse Frontend
- Fix mul(COO, COO) on MPS for hybrid COO variants (#166164)
- Update torch.sparse_coo_tensor error message to include more information about input tensor properties (#161900)
- Fix GradTrackingTensor sparse layout propagation (#165765)
XPU
- Fix OneDNN deconvolution with
output_paddingon Intel GPU (#169176) - Fix conv1d precision error on Intel GPU (#162944)
- Fix incorrect FLOPs counting of
convolution_overrideableon Intel GPU(#166839) - Fix performance drop in AOTI on Intel GPU (#163315)
Performance
Benchmark
- Add attention benchmarking numbers to pytorch operator microbenchmarks (#164155)
CPU (AArch64)
- Improved aarch64 performance with optimizations for type conversions (bfloat16, FP16, bool), erf function, and autovectorization enhancements (#166049, #166262, #166306, #166330, #166594, #166641, #166739, #166880, #166958)
CUDA
- Integrate NVIDIA cuSolver backend into ATen/Linalg (initial implementation for eig/eigval) (#166715)
- Reduce register pressure in
radix_sort_pairsto improve torch.sort performance (#167094) - Add Flash Attention 4 to sdpa (#167348)
- Vectorize stores in cat for all dtypes on CUDA (#162440)
- Expose
pinned_reserve_segment_size_mbto speed up pinned memory allocation (#164501) - torch.topk: refactor global histogram/cumsum into a dedicated kernel to improve performance on CUDA (#164459)
- Vectorize 8 elements on 16=bit data types for sum/mean to improve performance (#165055)
- Switch order of blocked reduce when vectorize loads to improve performance (#165178)
- Reduce register pressure to improve
torch.EmbeddingBagperformance (#167834) - Make clamp kernel branchless to improve performance (#167889)
cuDNN
- Reenable cuDNN for 64-bit depthwise convolutions (#168364)
Distributed Checkpointing
- Add timeout for checkpoint background process join (#162828)
- Disable GC in process based async checkpointing (#169613)
- Optimize global save-plan validation (#166820)
- state dict staging fixes (#166025)
Dynamo
- Faster tracing of some pytree functions (#168342)
FX
- Move Node._prepend/Node._remove_from_list to C++ (#165882)
- Optimize torch.fx.Node.replace_all_uses_with (#165889)
Inductor
- Naive foreach autotune support (#162053)
- Invert unary read and write for better fusion. (#161404)
- Generate fused RMS/layer norm backward. (#165370)
- Optimize cold compile time when cudagraphs-partition is enabled. (#167132)
- Reduce cold compilation time caused by duplicated user-defined Triton kernels. (#168292)
- Optimize identity permute in
empty_permuteddecomposition. (#169731) - Properly enlarge XBLOCK/set num_warps=1 for B200 inner persistent reductions. (#168335)
- Improved heuristic for operator reordering for peak memory. (#161810)
- Add more configs for pointwise kernels on ROCm. (#166470)
- Improve A16W8 performance for CPU GEMM template. (#162479)
- Make mix-order-reduction split size not depends on split-reduction heuristics (#166461)
- Less aggressive persistent reduction when it could induce large masking with dynamic shapes. (#163365)
- Improve FlexAttention backward configs for B200 (#163318)
Quantization
- Make prepare and convert faster by caching (#162550)
- Add onednn context cache for CPU qlinear to improve performance (#168150)
Release Engineering
- Add operator microbenchmarks for attention, convolution, and optimizer operations to CI (#165915, #166331, #168101)
- Add HuggingFace LLM benchmarks and cleanup benchmark model configurations (#156967, #164815, #164816)
ROCm
- Use hipSolver instead of MAGMA for Cholesky (#163977)
- Layer norm now uses __builtin_amdgcn_rcpf(x) instead of 1.f/x (#165589)
- OffsetCalc Unroll Optimization (#161700)
- Improve perf for elementwise broadcast with mixed dtype (#163562)
- Implement float32 copy kernel (#163869)
- Improve non stride-one backwards indexing for small index sets (#164409)
- Adjust grid size for non-unit stride backwards indexing (#165026)
- Normalization update to block size (#165941)
- Deserialize loads in planer sum portion of reduce() of norm. (#165927)
- Deserialize loads in planer sum portion of stats() of norm (#166021)
- Specialized binary elementwise broadcast kernel for mixed dtypes with float/bfloat16/half (#167233)
- Roll kernel as grid stride loop (#169474)
- Inductor performance improvements via configs, heurstics, and/or codegen (#163908, #161280, #166470, #162052, #163197)
torch.func
- 20x less memory use and 37.25% speedup in min_cut_rematerialization_partition when using the new dp knapsack solver, compared to existing default one (dp) (#160914)
Documentation
Autograd
- Add
inference_modehint message to useevalwith inference. (#163619)
CUDA
- Add Documentation for Device APIs (#162834)
- Adding aliases for CUDA and XPU API documentation (#162984)
- Clarify safety of CUDA graph memory pool sharing across graphs in documentation (#166975)
Distributed
- c10d
- Complete documentations for all distributed c10d apis (#165194)
Dynamo
- Updated documentation for
tlparse(#171339).
tlparseis a compilation report tool that processesTORCH_TRACElogs to generate interactive HTML reports showing how your model was compiled.
When reporting bugs to PyTorch developers, we encourage you to attach the trace log ortlparseoutput to provide critical debugging information to help us bisect the issue.
FX
- Add docs for torch.fx.experimental.unification (#167334)
- Fix the split_module tutorial code (#166154)
Inductor
- Updated documentation for
tlparse(#171339) (#162975).tlparseis a compilation report tool that processesTORCH_TRACElogs to generate interactive HTML reports showing how your model was compiled. When reporting bugs to PyTorch developers, we encourage you to attach the trace log ortlparseoutput to provide critical debugging information to help us bisect the issue. - Update FlexConfig documentation. (#162533)
Ahead-Of-Time Inductor (AOTI)
- [AOTI] Update AOTInductor tutorial (#163808)
torch.nn
-
Update CTCLoss docs float32 input required for CuDNN (#162042)
-
Update LPPool docs to clarify ceil_mode padding semantics when ceil_mode=True (#163186)
ONNX
- Update export docstring (#162622)
- Fix incorrect attention example in ONNX exporter docstring (#167646)
Profiler
- Add documentation for
FunctionEvent(#167688)
Quantization
- Document some quantization public apis (#165160)
- Add missing method docstrings for pytorch quantization classes (#165199)
XPU
- Add new supported client GPU Panther Lake in "Get Started with XPU" page (#170517)
Security
Developers
Composability
- Removed guard_size_oblivious from internal code, replacing most usages with guard_if_{false|true}. Both APIs are used in framework code that gets traced through to make it more friendly to unbacked symints, but the new APIs are more intuitive (#164664, #164665, #167232)
Distributed
-
c10d
- Added TCPStore based debug page and fr trace analysis with py-spy support (#169095, #169144, #169147, #167871)
- Modernized c10d code base with python code older than 3.10 removed (#163613, #163456, #163440, #167173)
- Enabled FlightRecorder for torchft with dynamic dumping path and a reset API (#164752, #164988, #164591, #165639, #166970,#166182)
- Improvement to FakeProcessGroup: direct construction error and error out if comms are invoked (#162841, #163665)
-
DTensor
-
torchelastic
FX
- Refactor proxy_tensor (#165266)
- Fix invalid symbol definition emitted in fx_graph_runnable.py (#166529)
- Add debug-level logging to Interpreter.run_node (#117351) (#166622)
- Fix an unsafe indexing in fx exception handling (#169140)
- Type annotations for torch/_higher_order_ops/flat_apply.py (#168933)
- Add recompute tags (from AC) into GraphModule.print_readable() by default (#167735)
- Apply ruff UP035 rule (#165214, #163744)
- Add model code stack trace to cuda.memory._snapshot (#166676)
- Add model code stack trace to torch.profile (#167110)
- Move enrich_profiler_metadata config import out of gm.recompile() (#167114)
Inductor
- Add API for scheduling overlap from inductor configs. (#169693)
- Make
LOCK_TIMEOUTin codecache configurable. (#165030) - Add debug output for specific pattern matching. (#169603)
- Add overridable env var for disabling FX graph cache. (#166138)
- Add subsystem support to pattern matcher. (#163922)
- Add pre-grad graph bisecting support. (#166344)
- Decouple flags for optimization and debug symbols. (#167385) (#167575)
- Introduce HOP for inductor compiled regions to allow torch dispatch. (#167844)
- Record triton kernels, run-to-run determinism checks (#167028)
Release Engineering
- Migrate from setup.py to modern Python build tools (pip install and python -m build) (#156711, #156712)
ROCm
- Add Rocm to Operator Microbenchmark CI (#164173)
- Enable TD for all ROCm default and distributed config workflows (#168225)
- Expand trunk.yml coverage for ROCm (#168162)
- cudagraph trees ut fixes (#163592)
- test_convolution.py uses miopen immediate mode (#164598)
- Keep amdgpu-coerce-illegal-types flag if rocm version is less than 7.2 (#165789)
- Use a ROCm version string without hash. (#166336)
- Dynamo benchmarks: remove outdated flaky models and enable deterministic algorithms (#169024)
XPU
- Upgrade Intel GPU software stack package to intel-deep-learning-essentials-2025.3 (#166829)