executorch 1.2.0 on Python PyPI

Highlights

ExecuTorch v1.2.0 expands on-device AI to more models and more hardware. This release adds real-time speech inference with Voxtral Realtime, promotes Cortex-M to a first-class embedded target, delivers major backend improvements, and reduces binary size for resource-constrained deployments.

Aligned with PyTorch 2.11, TorchAudio 2.11, TorchVision 0.26, TorchCodec 0.11, TorchAO 0.17, and PyTorch-Tokenizers 1.2

New model support - Voxtral Realtime (streaming speech), NVIDIA NeMo Sortformer (speaker diarization), Silero VAD (voice activity detection), and Qwen3.5 now exportable and runnable across multiple backends.

Cortex-M as a first-class target - Dedicated backend with CMSIS-NN integration, quantized int8 batch matmul and pad, improved pattern matching, and portable kernel usage for broader operator support.

Metal backend - 4-bit quantized inference via MLX-derived GEMM kernels, native causal SDPA with GQA, GPU buffer pool with LRU eviction, dispatch pipelining for lower latency, and mmap weight prefetching for faster model loading.

CUDA backend - SlimTensor integration for lightweight tensor management, CUDA stream sharing across methods for skip-copy optimization, and pybind integration with automatic CUDA detection.

Vulkan backend - Comprehensive int8 quantized inference with new linear, convolution, and fused operators, layout-flexible shaders, and improved device compatibility including fp16 fallback and Vulkan 1.0 support.

Arm backend (Ethos-U / TOSA) - LLM support via TOSA and Ethos-U backends, VGF image classification flow, Ethos-U SDK 26.02 with Vela 5.0.0, aarch64-linux-musl build support, and new operators including grouped transposed convolutions and boolean attention masks.

Qualcomm AI Engine Direct (QNN) - Backend-awareness quantizer for hardware-targeted quantization, CDSP Direct Mode for lower-latency DSP dispatch, SLC allocator, and attention sink for long context support.

NXP backend - Migration to eIQ Neutron SDK, selective kernel registration for smaller footprint, weight prefetching from external memory to SRAM, QAT support, and 14 new operators.

Faster serialization - New flatbuffer Program serialization in EXIR improves export performance.

Smaller binaries - Compile-time optimizations (.eh_frame suppression, constexpr kernel constructors, log disabling) reduce binary size for embedded targets.

LoRA and multi-method - Structured LoraConfig and MultimethodLoraConfig enable on-device model personalization. Multi-method support is now wired through export, runner, and ETRecord.

MPS backend deprecated - The MPS backend is deprecated and will be removed in a future release (v1.4.0).

New Models

Voxtral Realtime - Mistral's real-time speech model with true streaming inference on XNNPACK, Metal, and CUDA. Includes int4 quantization and fp32/bf16 support, live microphone input, and pre-exported .pte files for quick start.
Qwen3.5 - Export support for 0.8B, 2B, and 4B variants. Also added Qwen2.5-Coder.
NVIDIA NeMo Parakeet - Now XNNPACK by default, with int4 quantization, Metal int4 via HQQ, bfloat16, and CUDA support.
NVIDIA NeMo Sortformer - Speaker diarization model on XNNPACK and CUDA.
Silero VAD - Voice Activity Detection on XNNPACK for speech endpoint detection.

Core Components

EXIR / Export

Flatbuffer Program serialization for faster export and model loading.
CSE (Common Subexpression Elimination) pass eliminates redundant computations in exported models, reducing model size and improving inference speed. Includes hierarchical infrastructure for nested graph optimization.
Unified fusion passes across backends for more consistent optimization.
QAT ConvBN fuse pass and quantize fused ConvBN bias pass for improved quantization-aware training workflows.
Multi-method support in export and ETRecord for edge dialect programs.
PTE diff utility for comparing exported programs across versions.
Various other bugfixes and improvements.

Runtime

MmapUseMlockIgnoreErrors as default load mode for more robust memory-mapped loading.
LoadBackendOptionsMap - New API for passing backend-specific configuration (performance hints, memory limits) when loading models, without modifying the exported program.
Shared state in Module for more efficient multi-method execution.
MemoryFormatOpsPass fix - Preserves input dim_order for clone/to_copy with no memory_format kwarg.

Kernels & Operators

Constexpr Kernel and KernelKey constructors for compile-time initialization and smaller binaries.
Direct function pointers for et_copy_index and et_view in prim_ops, reducing dispatch overhead.
ARM embedded platform compatibility added to all portable operators.
Fix for int overflow in Tensor.numel().
Fix for uninitialized memory in portable batch norm kernel.

Binary Size

About a 10 - 15kb reduction in core runtime code size depending on toolchain.
Suppress .eh_frame generation when EXECUTORCH_OPTIMIZE_SIZE is ON.
Export ET_LOG_ENABLED=0 as public compile definition on CMake targets.
Constexpr constructors enable compile-time initialization, eliminating static constructor overhead.

SDK / Pybindings

Pybindings for TextLLMRunner for Python-based LLM inference.
Custom data loader hooks - allowing you to provide custom dataloaders in Python.
Fixed and completed pybindings.pyi type stubs for better IDE support.

Backend Delegates

Arm (Ethos-U / TOSA)

Ethos-U SDK 26.02, Vela 5.0.0, FVP version bumps for Corstone-300/320.
VGF image classification flow with Ethos-U support and end-to-end documentation.
LLM extension - Added TOSA and Ethos-U backend support for running LLMs on Arm.
New operator support - aten.erfinv, PAD dialect op, TOSA shape ops, boolean attention masks, grouped transposed convolutions.
aarch64-linux-musl build support for lightweight Linux deployments.
Zephyr examples reorganized under zephyr/samples/ with STM Nucleo board example.
Operator documentation for all supported ops.

Cadence/HiFi

Channel-last conv kernels, im2row and transpose kernel fixes.

CoreML

Deprecation warning added for to_edge + to_backend workflow.

Cortex-M

Cortex-M as first-class target in aot_arm_compiler with dedicated documentation.
Quantized int8 batch matmul and pad via CMSIS-NN.
Improved PatternMatcher algorithm for more efficient quantizer pattern matching.
Enabled portable kernel usage to broaden operator support.

CUDA

SlimTensor integration - Replaced ETensor with SlimTensor as the internal tensor representation for cuda backend for lightweight tensor management and improved memory operations.
CUDA stream sharing - Added use_shared_cuda_stream to share a single stream across multiple methods (e.g., encoder, decoder, sampler) to ensure proper ordering for skip-copy optimization and removal of unnecessary synchronization.
Pybind integration - Integrated CUDA backend into the pybind build system with automatic CUDA detection.
Bug fixes and platform support - Fixed Triton SDPA NaN with sparse boolean masks, output stride mismatch during delegate copy-back, and improved Windows/MinGW cross-compilation support.

Metal

4-bit quantized linear - Ported quantized GEMM kernels from MLX for int4 inference.
Causal SDPA - Native causal scaled dot-product attention with GQA and hoisted mask computation, reducing memory and improving throughput for LLM inference.
Buffer pool - Best-fit matching with LRU eviction for reduced memory allocation overhead.
Dispatch pipelining - commitAndContinue overlaps GPU command encoding and execution, reducing latency for multi-op models.
Linear with bias decompose pass for broader coverage.
Voxtral Realtime with streaming mode and int4, bf16, and fp32 support.
Build support for macOS SDK < 15.
mmap weight prefetching - Prefetches weight blobs to eliminate page fault bottleneck, significantly reducing load time for large models.

NXP

Neutron Kernel Registration: The Neutron backend now supports selective kernel registration, allowing you to register only the kernels needed for your specific models to decrease the memory footprint of the Neutron kernel library.
Migration to eIQ Neutron SDK: The backend has transitioned to the self-contained eIQ Neutron SDK (eiq_neutron_sdk) package, replacing the legacy neutron_converter_SDK_<MCUX-ver>. The new SDK includes both the Neutron Converter and Neutron runtime for all NXP-supported platforms, simplifying backend deployment and removing the dependency on MCUXPresso SDK.
Weight Prefetching: The Neutron backend supports fetching weights from external memory (such as flash) on NeutronC platforms. This feature enables model deployment on MCU-class SoCs where models are too large to fit in SRAM.
Quantization Aware Training (QAT): The Neutron backend now supports Quantization Aware Training.
Example Runtime for Neutron Backend: Introduced nxp_executor_runner, an example runner for the Neutron backend utilizing the Neutron simulator (eIQ NSYS).
Channel-Last dim-order: Minor improvements in channel-last dim-order support.
New Operations: `aten.avg_pool1d.default`, `aten.clamp.default`, `aten.div.Tensor`, `aten.leaky_relu.default`, `aten.neg.default`, `aten.prelu.default`, `aten.slice.default`, `aten._softmax.default`, `aten.squeeze.default`, `aten.squeeze.dim`, `aten.squeeze.dims`, `aten.unsqueeze`, `aten.upsample_bilinear2d.vec`, `aten.upsample_nearest2d.vec`

OpenVINO

NNCF data-aware compression algorithms for OVQuantizer.

Qualcomm AI Engine Direct (QNN)

Backend-awareness quantizer - Makes quantization decisions based on target hardware capabilities for better accuracy/performance trade-offs.
CDSP Direct Mode - Direct compute DSP dispatch for lower-latency inference.
SLC allocator for optimized memory management on Qualcomm SoCs.
Attention Sink - Support attention sink for long context use case.
Bug Fixes
- Fix KeyError in InsertIOQDQ pass for LLM quantization
- Remove the workaround used to clean the kwargs for the constant operation

Samsung

Exynos 2600 (E9965) support documented.

Vulkan

New Operators
- Added limited support for aten.index.Tensor op (1D self tensor with a single index tensor)
- Added aten.where.self and aten.bitwise_and.Tensor ops; extended binary ops to support integer data-types and comparison ops to correctly output bool dtype
Static Int8 Quantized Linear (q8ta_linear)
- Added q8ta_linear operator for int8 quantized linear inference
- Added q8ta_linear_gemv specialized op for batch-1 int8 linear, with tree reduction for improved throughput
Static Int8 Quantized Convolution (q8ta_conv2d)
- Rewrote quantized conv2d with layout-flexible shaders for depthwise, pointwise, im2col, and general convolution paths
- Added auto-selection logic to prefer im2col vs. general path based on kernel parameters
- Added dynamic PACKED_INT8_CONV2D memory layout for device-adaptive conv2d performance
- Enabled im2col to handle grouped convolution and non-unit stride widths
- Added software fallback for dotPacked4x8AccSatEXT on devices lacking the extension
Static Int8 Quantized Fusion & Unary Ops
- Added fused q8ta_relu unary operator for int8x4 tensors
- Added apply_relu fusion support in q8ta conv operators
- Added layout-flexible quantize/dequantize operators
- Added layout-flexible quantized binary ops with support for mixed input layouts
Device Compatibility
- Added float16 → float32 fallback for devices without 16-bit buffer support
- Added Vulkan 1.0 compatibility for extension feature querying
- Added fix for Raspberry Pi 5 GPU
Performance & Profiling
- Added additional profiling blocks for finer-grained performance analysis
Bug Fixes
- Fixed mixed-dtype binary ops and comparison op padding bugs
- Fixed softmax NaN and depthwise conv correctness bugs
- Fixed missing memory barrier for first-use writes on aliased tensors
Export & Partitioning
- Added per-operator dtype constraints to op registry for more accurate partitioning
- Improved partitioning to skip unsupported dtypes
- Added support for auto_functionalized_v2 in export flow

XNNPACK

Voxtral Realtime, Sortformer, Silero VAD, and Parakeet model support.
Un-fused batchnorm1d/batchnorm2d via decomposition.
Fix for mul + ReLU fusion.
Fix for infinite loop in XNNExecutor::resize_outputs.

Platforms

Android

Zero-copy ByteBuffer image prefill API for LlmModule, reducing memory copies during multimodal inference.
Java ASR Module binding for on-device speech recognition.
Thread safety fix for LLM token buffer in JNI.
Fix for JNI scalar EValue input reading and string serialization.
NThreadsGuard support for controlling thread count.
Removed deprecated jcenter() repository.

Windows

Windows wheel builds with OpenSSL submodule support.
End-to-end CUDA Windows CI for Voxtral Realtime and Sortformer.
Windows schannel and git SSL backend fixes.

LLM & Model Enablement

LLM Runners

LoRA - New LoraConfig and MultimethodLoraConfig for structured on-device model personalization.
Multi-method support in export and runner for serving multiple model methods from a single .pte file.
Unified MultimodalRunner under IRunner with multimodal prefill support.
--num_bos/--num_eos wired into GenerationConfig.
Transform passes configurable via to_edge_transform_and_lower in etllm.

Speech / Audio

Consolidated AsrModule API - Single synchronous transcribe() method simplifies on-device speech recognition.
Fix for mel spectrogram preprocessor allocating gigabytes of planned memory.

DevTools

Devtools debuggability tutorial for getting started with model debugging.
Support inplace-op intermediate output recording for debugging in-place operations.
Stacktraces information available in calculate_numeric_gap API result.
XNNPACK backend support in devtool example runner.
ETDump out-of-memory message improvements.
Updated torch-export guide.

Security Fixes

Users processing untrusted inputs (especially audio) should upgrade promptly.

Heap-buffer-overflow in WAV file loader - Could cause crashes or memory corruption when loading malformed WAV files.
Stack buffer overflow in prepare_input_tensors() - Unchecked memcpy size could overflow stack buffers with oversized inputs.
Infinite loop in XNNExecutor::resize_outputs() - Could cause hangs during model execution with certain input shapes.
Int overflow in Tensor.numel() - Could cause incorrect behavior for very large tensors.

Deprecations

MPS backend deprecated - The MPS backend is deprecated and will be removed in a future release (v1.4.0).
CoreML to_edge + to_backend workflow - A deprecation warning has been added. Migrate to the new workflow.

Documentation

Cortex-M backend documentation added to the docs site.
Arm backend operator support documentation for all supported ops.
Ethos-U and VGF image classification examples documented.
NXP backend documentation updated for v1.2, including QAT guide.
Voxtral Realtime README with CPU/CUDA/Metal workflows and demo videos.
Devtools debuggability tutorial.

Contributors

We welcome 19 first-time contributors to ExecuTorch in this release:

@ares89, @corporateshark, @sarah-blades, @NickJLange, @rezimeta, @kamalkraj, @mohammed-saalim, @abdelaziz-mahdy, @GeorgeTzoupis, @annietllnd, @elpdumont, @tjamula, @julianchan-meta, @NickCao, @KamilMatejuk, @ahmet-f-gumustas, @nefainl, @Phineas1500, @sneakyHulk

Full Changelog: v1.1.0...v1.2.0

executorch 1.2.0 v1.2.0 on Python PyPI