Highlights
ExecuTorch v1.2.0 expands on-device AI to more models and more hardware. This release adds real-time speech inference with Voxtral Realtime, promotes Cortex-M to a first-class embedded target, delivers major backend improvements, and reduces binary size for resource-constrained deployments.
Aligned with PyTorch 2.11, TorchAudio 2.11, TorchVision 0.26, TorchCodec 0.11, TorchAO 0.17, and PyTorch-Tokenizers 1.2
New model support - Voxtral Realtime (streaming speech), NVIDIA NeMo Sortformer (speaker diarization), Silero VAD (voice activity detection), and Qwen3.5 now exportable and runnable across multiple backends.
Cortex-M as a first-class target - Dedicated backend with CMSIS-NN integration, quantized int8 batch matmul and pad, improved pattern matching, and portable kernel usage for broader operator support.
Metal backend - 4-bit quantized inference via MLX-derived GEMM kernels, native causal SDPA with GQA, GPU buffer pool with LRU eviction, dispatch pipelining for lower latency, and mmap weight prefetching for faster model loading.
CUDA backend - SlimTensor integration for lightweight tensor management, CUDA stream sharing across methods for skip-copy optimization, and pybind integration with automatic CUDA detection.
Vulkan backend - Comprehensive int8 quantized inference with new linear, convolution, and fused operators, layout-flexible shaders, and improved device compatibility including fp16 fallback and Vulkan 1.0 support.
Arm backend (Ethos-U / TOSA) - LLM support via TOSA and Ethos-U backends, VGF image classification flow, Ethos-U SDK 26.02 with Vela 5.0.0, aarch64-linux-musl build support, and new operators including grouped transposed convolutions and boolean attention masks.
Qualcomm AI Engine Direct (QNN) - Backend-awareness quantizer for hardware-targeted quantization, CDSP Direct Mode for lower-latency DSP dispatch, SLC allocator, and attention sink for long context support.
NXP backend - Migration to eIQ Neutron SDK, selective kernel registration for smaller footprint, weight prefetching from external memory to SRAM, QAT support, and 14 new operators.
Faster serialization - New flatbuffer Program serialization in EXIR improves export performance.
Smaller binaries - Compile-time optimizations (.eh_frame suppression, constexpr kernel constructors, log disabling) reduce binary size for embedded targets.
LoRA and multi-method - Structured LoraConfig and MultimethodLoraConfig enable on-device model personalization. Multi-method support is now wired through export, runner, and ETRecord.
MPS backend deprecated - The MPS backend is deprecated and will be removed in a future release (v1.4.0).
New Models
- Voxtral Realtime - Mistral's real-time speech model with true streaming inference on XNNPACK, Metal, and CUDA. Includes int4 quantization and fp32/bf16 support, live microphone input, and pre-exported
.ptefiles for quick start. - Qwen3.5 - Export support for 0.8B, 2B, and 4B variants. Also added Qwen2.5-Coder.
- NVIDIA NeMo Parakeet - Now XNNPACK by default, with int4 quantization, Metal int4 via HQQ, bfloat16, and CUDA support.
- NVIDIA NeMo Sortformer - Speaker diarization model on XNNPACK and CUDA.
- Silero VAD - Voice Activity Detection on XNNPACK for speech endpoint detection.
Core Components
EXIR / Export
- Flatbuffer Program serialization for faster export and model loading.
- CSE (Common Subexpression Elimination) pass eliminates redundant computations in exported models, reducing model size and improving inference speed. Includes hierarchical infrastructure for nested graph optimization.
- Unified fusion passes across backends for more consistent optimization.
- QAT ConvBN fuse pass and quantize fused ConvBN bias pass for improved quantization-aware training workflows.
- Multi-method support in export and ETRecord for edge dialect programs.
- PTE diff utility for comparing exported programs across versions.
- Various other bugfixes and improvements.
Runtime
MmapUseMlockIgnoreErrorsas default load mode for more robust memory-mapped loading.LoadBackendOptionsMap- New API for passing backend-specific configuration (performance hints, memory limits) when loading models, without modifying the exported program.- Shared state in Module for more efficient multi-method execution.
- MemoryFormatOpsPass fix - Preserves input dim_order for
clone/to_copywith nomemory_formatkwarg.
Kernels & Operators
- Constexpr
KernelandKernelKeyconstructors for compile-time initialization and smaller binaries. - Direct function pointers for
et_copy_indexandet_viewin prim_ops, reducing dispatch overhead. - ARM embedded platform compatibility added to all portable operators.
- Fix for int overflow in
Tensor.numel(). - Fix for uninitialized memory in portable batch norm kernel.
Binary Size
- About a 10 - 15kb reduction in core runtime code size depending on toolchain.
- Suppress
.eh_framegeneration whenEXECUTORCH_OPTIMIZE_SIZEis ON. - Export
ET_LOG_ENABLED=0as public compile definition on CMake targets. - Constexpr constructors enable compile-time initialization, eliminating static constructor overhead.
SDK / Pybindings
- Pybindings for
TextLLMRunnerfor Python-based LLM inference. - Custom data loader hooks - allowing you to provide custom dataloaders in Python.
- Fixed and completed
pybindings.pyitype stubs for better IDE support.
Backend Delegates
Arm (Ethos-U / TOSA)
- Ethos-U SDK 26.02, Vela 5.0.0, FVP version bumps for Corstone-300/320.
- VGF image classification flow with Ethos-U support and end-to-end documentation.
- LLM extension - Added TOSA and Ethos-U backend support for running LLMs on Arm.
- New operator support -
aten.erfinv, PAD dialect op, TOSA shape ops, boolean attention masks, grouped transposed convolutions. - aarch64-linux-musl build support for lightweight Linux deployments.
- Zephyr examples reorganized under
zephyr/samples/with STM Nucleo board example. - Operator documentation for all supported ops.
Cadence/HiFi
- Channel-last conv kernels, im2row and transpose kernel fixes.
CoreML
- Deprecation warning added for
to_edge+to_backendworkflow.
Cortex-M
- Cortex-M as first-class target in
aot_arm_compilerwith dedicated documentation. - Quantized int8 batch matmul and pad via CMSIS-NN.
- Improved PatternMatcher algorithm for more efficient quantizer pattern matching.
- Enabled portable kernel usage to broaden operator support.
CUDA
- SlimTensor integration - Replaced ETensor with SlimTensor as the internal tensor representation for cuda backend for lightweight tensor management and improved memory operations.
- CUDA stream sharing - Added use_shared_cuda_stream to share a single stream across multiple methods (e.g., encoder, decoder, sampler) to ensure proper ordering for skip-copy optimization and removal of unnecessary synchronization.
- Pybind integration - Integrated CUDA backend into the pybind build system with automatic CUDA detection.
- Bug fixes and platform support - Fixed Triton SDPA NaN with sparse boolean masks, output stride mismatch during delegate copy-back, and improved Windows/MinGW cross-compilation support.
Metal
- 4-bit quantized linear - Ported quantized GEMM kernels from MLX for int4 inference.
- Causal SDPA - Native causal scaled dot-product attention with GQA and hoisted mask computation, reducing memory and improving throughput for LLM inference.
- Buffer pool - Best-fit matching with LRU eviction for reduced memory allocation overhead.
- Dispatch pipelining -
commitAndContinueoverlaps GPU command encoding and execution, reducing latency for multi-op models. - Linear with bias decompose pass for broader coverage.
- Voxtral Realtime with streaming mode and int4, bf16, and fp32 support.
- Build support for macOS SDK < 15.
- mmap weight prefetching - Prefetches weight blobs to eliminate page fault bottleneck, significantly reducing load time for large models.
NXP
- Neutron Kernel Registration: The Neutron backend now supports selective kernel registration, allowing you to register only the kernels needed for your specific models to decrease the memory footprint of the Neutron kernel library.
- Migration to eIQ Neutron SDK: The backend has transitioned to the self-contained eIQ Neutron SDK (eiq_neutron_sdk) package, replacing the legacy neutron_converter_SDK_<MCUX-ver>. The new SDK includes both the Neutron Converter and Neutron runtime for all NXP-supported platforms, simplifying backend deployment and removing the dependency on MCUXPresso SDK.
- Weight Prefetching: The Neutron backend supports fetching weights from external memory (such as flash) on NeutronC platforms. This feature enables model deployment on MCU-class SoCs where models are too large to fit in SRAM.
- Quantization Aware Training (QAT): The Neutron backend now supports Quantization Aware Training.
- Example Runtime for Neutron Backend: Introduced nxp_executor_runner, an example runner for the Neutron backend utilizing the Neutron simulator (eIQ NSYS).
- Channel-Last dim-order: Minor improvements in channel-last dim-order support.
- New Operations: `aten.avg_pool1d.default`, `aten.clamp.default`, `aten.div.Tensor`, `aten.leaky_relu.default`, `aten.neg.default`, `aten.prelu.default`, `aten.slice.default`, `aten._softmax.default`, `aten.squeeze.default`, `aten.squeeze.dim`, `aten.squeeze.dims`, `aten.unsqueeze`, `aten.upsample_bilinear2d.vec`, `aten.upsample_nearest2d.vec`
OpenVINO
- NNCF data-aware compression algorithms for
OVQuantizer.
Qualcomm AI Engine Direct (QNN)
- Backend-awareness quantizer - Makes quantization decisions based on target hardware capabilities for better accuracy/performance trade-offs.
- CDSP Direct Mode - Direct compute DSP dispatch for lower-latency inference.
- SLC allocator for optimized memory management on Qualcomm SoCs.
- Attention Sink - Support attention sink for long context use case.
- Bug Fixes
- Fix KeyError in InsertIOQDQ pass for LLM quantization
- Remove the workaround used to clean the kwargs for the constant operation
Samsung
- Exynos 2600 (E9965) support documented.
Vulkan
- New Operators
- Added limited support for
aten.index.Tensorop (1D self tensor with a single index tensor) - Added
aten.where.selfandaten.bitwise_and.Tensorops; extended binary ops to support integer data-types and comparison ops to correctly output bool dtype
- Added limited support for
- Static Int8 Quantized Linear (q8ta_linear)
- Added
q8ta_linearoperator for int8 quantized linear inference - Added
q8ta_linear_gemvspecialized op for batch-1 int8 linear, with tree reduction for improved throughput
- Added
- Static Int8 Quantized Convolution (q8ta_conv2d)
- Rewrote quantized conv2d with layout-flexible shaders for depthwise, pointwise, im2col, and general convolution paths
- Added auto-selection logic to prefer im2col vs. general path based on kernel parameters
- Added dynamic
PACKED_INT8_CONV2Dmemory layout for device-adaptive conv2d performance - Enabled im2col to handle grouped convolution and non-unit stride widths
- Added software fallback for
dotPacked4x8AccSatEXTon devices lacking the extension
- Static Int8 Quantized Fusion & Unary Ops
- Added fused
q8ta_reluunary operator for int8x4 tensors - Added
apply_relufusion support in q8ta conv operators - Added layout-flexible quantize/dequantize operators
- Added layout-flexible quantized binary ops with support for mixed input layouts
- Added fused
- Device Compatibility
- Added float16 → float32 fallback for devices without 16-bit buffer support
- Added Vulkan 1.0 compatibility for extension feature querying
- Added fix for Raspberry Pi 5 GPU
- Performance & Profiling
- Added additional profiling blocks for finer-grained performance analysis
- Bug Fixes
- Fixed mixed-dtype binary ops and comparison op padding bugs
- Fixed softmax NaN and depthwise conv correctness bugs
- Fixed missing memory barrier for first-use writes on aliased tensors
- Export & Partitioning
- Added per-operator dtype constraints to op registry for more accurate partitioning
- Improved partitioning to skip unsupported dtypes
- Added support for
auto_functionalized_v2in export flow
XNNPACK
- Voxtral Realtime, Sortformer, Silero VAD, and Parakeet model support.
- Un-fused
batchnorm1d/batchnorm2dvia decomposition. - Fix for
mul + ReLUfusion. - Fix for infinite loop in
XNNExecutor::resize_outputs.
Platforms
Android
- Zero-copy ByteBuffer image prefill API for
LlmModule, reducing memory copies during multimodal inference. - Java ASR Module binding for on-device speech recognition.
- Thread safety fix for LLM token buffer in JNI.
- Fix for JNI scalar EValue input reading and string serialization.
NThreadsGuardsupport for controlling thread count.- Removed deprecated
jcenter()repository.
Windows
- Windows wheel builds with OpenSSL submodule support.
- End-to-end CUDA Windows CI for Voxtral Realtime and Sortformer.
- Windows schannel and git SSL backend fixes.
LLM & Model Enablement
LLM Runners
- LoRA - New
LoraConfigandMultimethodLoraConfigfor structured on-device model personalization. - Multi-method support in export and runner for serving multiple model methods from a single
.ptefile. - Unified
MultimodalRunnerunderIRunnerwith multimodal prefill support. --num_bos/--num_eoswired intoGenerationConfig.- Transform passes configurable via
to_edge_transform_and_lowerin etllm.
Speech / Audio
- Consolidated
AsrModuleAPI - Single synchronoustranscribe()method simplifies on-device speech recognition. - Fix for mel spectrogram preprocessor allocating gigabytes of planned memory.
DevTools
- Devtools debuggability tutorial for getting started with model debugging.
- Support inplace-op intermediate output recording for debugging in-place operations.
- Stacktraces information available in
calculate_numeric_gapAPI result. - XNNPACK backend support in devtool example runner.
- ETDump out-of-memory message improvements.
- Updated torch-export guide.
Security Fixes
Users processing untrusted inputs (especially audio) should upgrade promptly.
- Heap-buffer-overflow in WAV file loader - Could cause crashes or memory corruption when loading malformed WAV files.
- Stack buffer overflow in
prepare_input_tensors()- Unchecked memcpy size could overflow stack buffers with oversized inputs. - Infinite loop in
XNNExecutor::resize_outputs()- Could cause hangs during model execution with certain input shapes. - Int overflow in
Tensor.numel()- Could cause incorrect behavior for very large tensors.
Deprecations
- MPS backend deprecated - The MPS backend is deprecated and will be removed in a future release (v1.4.0).
- CoreML
to_edge+to_backendworkflow - A deprecation warning has been added. Migrate to the new workflow.
Documentation
- Cortex-M backend documentation added to the docs site.
- Arm backend operator support documentation for all supported ops.
- Ethos-U and VGF image classification examples documented.
- NXP backend documentation updated for v1.2, including QAT guide.
- Voxtral Realtime README with CPU/CUDA/Metal workflows and demo videos.
- Devtools debuggability tutorial.
Contributors
We welcome 19 first-time contributors to ExecuTorch in this release:
@ares89, @corporateshark, @sarah-blades, @NickJLange, @rezimeta, @kamalkraj, @mohammed-saalim, @abdelaziz-mahdy, @GeorgeTzoupis, @annietllnd, @elpdumont, @tjamula, @julianchan-meta, @NickCao, @KamilMatejuk, @ahmet-f-gumustas, @nefainl, @Phineas1500, @sneakyHulk
Full Changelog: v1.1.0...v1.2.0