pypi executorch 1.2.0
v1.2.0

9 hours ago

Highlights

ExecuTorch v1.2.0 expands on-device AI to more models and more hardware. This release adds real-time speech inference with Voxtral Realtime, promotes Cortex-M to a first-class embedded target, delivers major backend improvements, and reduces binary size for resource-constrained deployments.

Aligned with PyTorch 2.11, TorchAudio 2.11, TorchVision 0.26, TorchCodec 0.11, TorchAO 0.17, and PyTorch-Tokenizers 1.2

New model support - Voxtral Realtime (streaming speech), NVIDIA NeMo Sortformer (speaker diarization), Silero VAD (voice activity detection), and Qwen3.5 now exportable and runnable across multiple backends.

Cortex-M as a first-class target - Dedicated backend with CMSIS-NN integration, quantized int8 batch matmul and pad, improved pattern matching, and portable kernel usage for broader operator support.

Metal backend - 4-bit quantized inference via MLX-derived GEMM kernels, native causal SDPA with GQA, GPU buffer pool with LRU eviction, dispatch pipelining for lower latency, and mmap weight prefetching for faster model loading.

CUDA backend - SlimTensor integration for lightweight tensor management, CUDA stream sharing across methods for skip-copy optimization, and pybind integration with automatic CUDA detection.

Vulkan backend - Comprehensive int8 quantized inference with new linear, convolution, and fused operators, layout-flexible shaders, and improved device compatibility including fp16 fallback and Vulkan 1.0 support.

Arm backend (Ethos-U / TOSA) - LLM support via TOSA and Ethos-U backends, VGF image classification flow, Ethos-U SDK 26.02 with Vela 5.0.0, aarch64-linux-musl build support, and new operators including grouped transposed convolutions and boolean attention masks.

Qualcomm AI Engine Direct (QNN) - Backend-awareness quantizer for hardware-targeted quantization, CDSP Direct Mode for lower-latency DSP dispatch, SLC allocator, and attention sink for long context support.

NXP backend - Migration to eIQ Neutron SDK, selective kernel registration for smaller footprint, weight prefetching from external memory to SRAM, QAT support, and 14 new operators.

Faster serialization - New flatbuffer Program serialization in EXIR improves export performance.

Smaller binaries - Compile-time optimizations (.eh_frame suppression, constexpr kernel constructors, log disabling) reduce binary size for embedded targets.

LoRA and multi-method - Structured LoraConfig and MultimethodLoraConfig enable on-device model personalization. Multi-method support is now wired through export, runner, and ETRecord.

MPS backend deprecated - The MPS backend is deprecated and will be removed in a future release (v1.4.0).


New Models

  • Voxtral Realtime - Mistral's real-time speech model with true streaming inference on XNNPACK, Metal, and CUDA. Includes int4 quantization and fp32/bf16 support, live microphone input, and pre-exported .pte files for quick start.
  • Qwen3.5 - Export support for 0.8B, 2B, and 4B variants. Also added Qwen2.5-Coder.
  • NVIDIA NeMo Parakeet - Now XNNPACK by default, with int4 quantization, Metal int4 via HQQ, bfloat16, and CUDA support.
  • NVIDIA NeMo Sortformer - Speaker diarization model on XNNPACK and CUDA.
  • Silero VAD - Voice Activity Detection on XNNPACK for speech endpoint detection.

Core Components

EXIR / Export

  • Flatbuffer Program serialization for faster export and model loading.
  • CSE (Common Subexpression Elimination) pass eliminates redundant computations in exported models, reducing model size and improving inference speed. Includes hierarchical infrastructure for nested graph optimization.
  • Unified fusion passes across backends for more consistent optimization.
  • QAT ConvBN fuse pass and quantize fused ConvBN bias pass for improved quantization-aware training workflows.
  • Multi-method support in export and ETRecord for edge dialect programs.
  • PTE diff utility for comparing exported programs across versions.
  • Various other bugfixes and improvements.

Runtime

  • MmapUseMlockIgnoreErrors as default load mode for more robust memory-mapped loading.
  • LoadBackendOptionsMap - New API for passing backend-specific configuration (performance hints, memory limits) when loading models, without modifying the exported program.
  • Shared state in Module for more efficient multi-method execution.
  • MemoryFormatOpsPass fix - Preserves input dim_order for clone/to_copy with no memory_format kwarg.

Kernels & Operators

  • Constexpr Kernel and KernelKey constructors for compile-time initialization and smaller binaries.
  • Direct function pointers for et_copy_index and et_view in prim_ops, reducing dispatch overhead.
  • ARM embedded platform compatibility added to all portable operators.
  • Fix for int overflow in Tensor.numel().
  • Fix for uninitialized memory in portable batch norm kernel.

Binary Size

  • About a 10 - 15kb reduction in core runtime code size depending on toolchain.
  • Suppress .eh_frame generation when EXECUTORCH_OPTIMIZE_SIZE is ON.
  • Export ET_LOG_ENABLED=0 as public compile definition on CMake targets.
  • Constexpr constructors enable compile-time initialization, eliminating static constructor overhead.

SDK / Pybindings

  • Pybindings for TextLLMRunner for Python-based LLM inference.
  • Custom data loader hooks - allowing you to provide custom dataloaders in Python.
  • Fixed and completed pybindings.pyi type stubs for better IDE support.

Backend Delegates

Arm (Ethos-U / TOSA)

  • Ethos-U SDK 26.02, Vela 5.0.0, FVP version bumps for Corstone-300/320.
  • VGF image classification flow with Ethos-U support and end-to-end documentation.
  • LLM extension - Added TOSA and Ethos-U backend support for running LLMs on Arm.
  • New operator support - aten.erfinv, PAD dialect op, TOSA shape ops, boolean attention masks, grouped transposed convolutions.
  • aarch64-linux-musl build support for lightweight Linux deployments.
  • Zephyr examples reorganized under zephyr/samples/ with STM Nucleo board example.
  • Operator documentation for all supported ops.

Cadence/HiFi

  • Channel-last conv kernels, im2row and transpose kernel fixes.

CoreML

  • Deprecation warning added for to_edge + to_backend workflow.

Cortex-M

  • Cortex-M as first-class target in aot_arm_compiler with dedicated documentation.
  • Quantized int8 batch matmul and pad via CMSIS-NN.
  • Improved PatternMatcher algorithm for more efficient quantizer pattern matching.
  • Enabled portable kernel usage to broaden operator support.

CUDA

  • SlimTensor integration - Replaced ETensor with SlimTensor as the internal tensor representation for cuda backend for lightweight tensor management and improved memory operations.
  • CUDA stream sharing - Added use_shared_cuda_stream to share a single stream across multiple methods (e.g., encoder, decoder, sampler) to ensure proper ordering for skip-copy optimization and removal of unnecessary synchronization.
  • Pybind integration - Integrated CUDA backend into the pybind build system with automatic CUDA detection.
  • Bug fixes and platform support - Fixed Triton SDPA NaN with sparse boolean masks, output stride mismatch during delegate copy-back, and improved Windows/MinGW cross-compilation support.

Metal

  • 4-bit quantized linear - Ported quantized GEMM kernels from MLX for int4 inference.
  • Causal SDPA - Native causal scaled dot-product attention with GQA and hoisted mask computation, reducing memory and improving throughput for LLM inference.
  • Buffer pool - Best-fit matching with LRU eviction for reduced memory allocation overhead.
  • Dispatch pipelining - commitAndContinue overlaps GPU command encoding and execution, reducing latency for multi-op models.
  • Linear with bias decompose pass for broader coverage.
  • Voxtral Realtime with streaming mode and int4, bf16, and fp32 support.
  • Build support for macOS SDK < 15.
  • mmap weight prefetching - Prefetches weight blobs to eliminate page fault bottleneck, significantly reducing load time for large models.

NXP

  • Neutron Kernel Registration: The Neutron backend now supports selective kernel registration, allowing you to register only the kernels needed for your specific models to decrease the memory footprint of the Neutron kernel library.
  • Migration to eIQ Neutron SDK: The backend has transitioned to the self-contained eIQ Neutron SDK (eiq_neutron_sdk) package, replacing the legacy neutron_converter_SDK_<MCUX-ver>. The new SDK includes both the Neutron Converter and Neutron runtime for all NXP-supported platforms, simplifying backend deployment and removing the dependency on MCUXPresso SDK.
  • Weight Prefetching: The Neutron backend supports fetching weights from external memory (such as flash) on NeutronC platforms. This feature enables model deployment on MCU-class SoCs where models are too large to fit in SRAM.
  • Quantization Aware Training (QAT): The Neutron backend now supports Quantization Aware Training.
  • Example Runtime for Neutron Backend: Introduced nxp_executor_runner, an example runner for the Neutron backend utilizing the Neutron simulator (eIQ NSYS).
  • Channel-Last dim-order: Minor improvements in channel-last dim-order support.
  • New Operations: `aten.avg_pool1d.default`, `aten.clamp.default`, `aten.div.Tensor`, `aten.leaky_relu.default`, `aten.neg.default`, `aten.prelu.default`, `aten.slice.default`, `aten._softmax.default`, `aten.squeeze.default`, `aten.squeeze.dim`, `aten.squeeze.dims`, `aten.unsqueeze`, `aten.upsample_bilinear2d.vec`, `aten.upsample_nearest2d.vec`

OpenVINO

  • NNCF data-aware compression algorithms for OVQuantizer.

Qualcomm AI Engine Direct (QNN)

  • Backend-awareness quantizer - Makes quantization decisions based on target hardware capabilities for better accuracy/performance trade-offs.
  • CDSP Direct Mode - Direct compute DSP dispatch for lower-latency inference.
  • SLC allocator for optimized memory management on Qualcomm SoCs.
  • Attention Sink - Support attention sink for long context use case.
  • Bug Fixes
    • Fix KeyError in InsertIOQDQ pass for LLM quantization
    • Remove the workaround used to clean the kwargs for the constant operation

Samsung

  • Exynos 2600 (E9965) support documented.

Vulkan

  • New Operators
    • Added limited support for aten.index.Tensor op (1D self tensor with a single index tensor)
    • Added aten.where.self and aten.bitwise_and.Tensor ops; extended binary ops to support integer data-types and comparison ops to correctly output bool dtype
  • Static Int8 Quantized Linear (q8ta_linear)
    • Added q8ta_linear operator for int8 quantized linear inference
    • Added q8ta_linear_gemv specialized op for batch-1 int8 linear, with tree reduction for improved throughput
  • Static Int8 Quantized Convolution (q8ta_conv2d)
    • Rewrote quantized conv2d with layout-flexible shaders for depthwise, pointwise, im2col, and general convolution paths
    • Added auto-selection logic to prefer im2col vs. general path based on kernel parameters
    • Added dynamic PACKED_INT8_CONV2D memory layout for device-adaptive conv2d performance
    • Enabled im2col to handle grouped convolution and non-unit stride widths
    • Added software fallback for dotPacked4x8AccSatEXT on devices lacking the extension
  • Static Int8 Quantized Fusion & Unary Ops
    • Added fused q8ta_relu unary operator for int8x4 tensors
    • Added apply_relu fusion support in q8ta conv operators
    • Added layout-flexible quantize/dequantize operators
    • Added layout-flexible quantized binary ops with support for mixed input layouts
  • Device Compatibility
    • Added float16 → float32 fallback for devices without 16-bit buffer support
    • Added Vulkan 1.0 compatibility for extension feature querying
    • Added fix for Raspberry Pi 5 GPU
  • Performance & Profiling
    • Added additional profiling blocks for finer-grained performance analysis
  • Bug Fixes
    • Fixed mixed-dtype binary ops and comparison op padding bugs
    • Fixed softmax NaN and depthwise conv correctness bugs
    • Fixed missing memory barrier for first-use writes on aliased tensors
  • Export & Partitioning
    • Added per-operator dtype constraints to op registry for more accurate partitioning
    • Improved partitioning to skip unsupported dtypes
    • Added support for auto_functionalized_v2 in export flow

XNNPACK

  • Voxtral Realtime, Sortformer, Silero VAD, and Parakeet model support.
  • Un-fused batchnorm1d/batchnorm2d via decomposition.
  • Fix for mul + ReLU fusion.
  • Fix for infinite loop in XNNExecutor::resize_outputs.

Platforms

Android

  • Zero-copy ByteBuffer image prefill API for LlmModule, reducing memory copies during multimodal inference.
  • Java ASR Module binding for on-device speech recognition.
  • Thread safety fix for LLM token buffer in JNI.
  • Fix for JNI scalar EValue input reading and string serialization.
  • NThreadsGuard support for controlling thread count.
  • Removed deprecated jcenter() repository.

Windows

  • Windows wheel builds with OpenSSL submodule support.
  • End-to-end CUDA Windows CI for Voxtral Realtime and Sortformer.
  • Windows schannel and git SSL backend fixes.

LLM & Model Enablement

LLM Runners

  • LoRA - New LoraConfig and MultimethodLoraConfig for structured on-device model personalization.
  • Multi-method support in export and runner for serving multiple model methods from a single .pte file.
  • Unified MultimodalRunner under IRunner with multimodal prefill support.
  • --num_bos/--num_eos wired into GenerationConfig.
  • Transform passes configurable via to_edge_transform_and_lower in etllm.

Speech / Audio

  • Consolidated AsrModule API - Single synchronous transcribe() method simplifies on-device speech recognition.
  • Fix for mel spectrogram preprocessor allocating gigabytes of planned memory.

DevTools

  • Devtools debuggability tutorial for getting started with model debugging.
  • Support inplace-op intermediate output recording for debugging in-place operations.
  • Stacktraces information available in calculate_numeric_gap API result.
  • XNNPACK backend support in devtool example runner.
  • ETDump out-of-memory message improvements.
  • Updated torch-export guide.

Security Fixes

Users processing untrusted inputs (especially audio) should upgrade promptly.

  • Heap-buffer-overflow in WAV file loader - Could cause crashes or memory corruption when loading malformed WAV files.
  • Stack buffer overflow in prepare_input_tensors() - Unchecked memcpy size could overflow stack buffers with oversized inputs.
  • Infinite loop in XNNExecutor::resize_outputs() - Could cause hangs during model execution with certain input shapes.
  • Int overflow in Tensor.numel() - Could cause incorrect behavior for very large tensors.

Deprecations

  • MPS backend deprecated - The MPS backend is deprecated and will be removed in a future release (v1.4.0).
  • CoreML to_edge + to_backend workflow - A deprecation warning has been added. Migrate to the new workflow.

Documentation

  • Cortex-M backend documentation added to the docs site.
  • Arm backend operator support documentation for all supported ops.
  • Ethos-U and VGF image classification examples documented.
  • NXP backend documentation updated for v1.2, including QAT guide.
  • Voxtral Realtime README with CPU/CUDA/Metal workflows and demo videos.
  • Devtools debuggability tutorial.

Contributors

We welcome 19 first-time contributors to ExecuTorch in this release:

@ares89, @corporateshark, @sarah-blades, @NickJLange, @rezimeta, @kamalkraj, @mohammed-saalim, @abdelaziz-mahdy, @GeorgeTzoupis, @annietllnd, @elpdumont, @tjamula, @julianchan-meta, @NickCao, @KamilMatejuk, @ahmet-f-gumustas, @nefainl, @Phineas1500, @sneakyHulk

Full Changelog: v1.1.0...v1.2.0

Don't miss a new executorch release

NewReleases is sending notifications on new releases.