pypi executorch 1.1.0
v1.1.0

22 hours ago

ExecuTorch 1.1 Release Notes

Highlights

  • CUDA Backend for NVIDIA GPUs: New experimental backend enables GPU inference with AOTInductor compilation, Triton SDPA kernels, INT4 weight quantization, and async memory allocation achieving latency reduction—validated on Voxtral, Gemma3, and Whisper with Windows support
  • Metal Backend for Apple GPUs: New experimental backend enables GPU inference on Apple Silicon using Metal Performance Shaders with custom SDPA kernel achieving peedup—validated on Voxtral, Whisper, and Parakeet models
  • Vulkan Static Int8 Quantization: Complete infrastructure for statically quantized CNNs with packed int8 tensors, specialized memory layouts for conv/matmul, and fully quantized conv2d variants—includes extensive shader-level performance optimizations and SDPA fusion for LLM inference
  • Cadence DSP for Vision and LLMs: New optimized Vision module kernel library for Cadence/Xtensa DSPs with quantized implementations targeting vision workloads; added LLM operator support including RoPE, group-quantized embedding, and batched matmul for transformer models
  • 4-bit Weight Quantization: Arm Ethos-U NPU now supports A8W4 quantization (4-bit weights, 8-bit activations) providing 2x weight compression for conv2d, conv3d, depthwise conv, and linear operators with automatic runtime unpacking
  • Multimodal Vision-Language Models: Qualcomm backend adds complete infrastructure for VLMs including SmolVLM-500M and InternVL3-1B with vision encoder quantization, multimodal AOT compilation, and Adreno GPU backend support
  • Dynamic Control Flow: torch.cond and while loops now supported across Arm (Ethos-U, VGF, TOSA) and LLM pipelines, enabling dynamic execution patterns like early exit, adaptive computation, and iterative refinement in both quantized and floating-point models
  • Python 3.13 Support: ExecuTorch now fully supports Python 3.13 (requires coremltools 9.0+) with wheels built and tested for Python 3.10-3.13
  • Embedded Platform Expansion: Zephyr RTOS integration enables ExecuTorch as an external module with Kconfig options and Corstone-300/Ethos-U acceleration; new Raspberry Pi Pico2 deployment tutorial for RP2040 microcontrollers with <400KB memory footprint; Cortex-M backend adds CMSIS-NN accelerated kernels for linear, conv2d, depthwise conv2d, avg_pool2d, softmax, and other operations enabling CNNs and transformers on microcontrollers
  • Security Hardening: Addressed multiple CVEs including integer overflow in HierarchicalAllocator, stack buffer overflow in resize_tensor, and overflow-safe arithmetic in nbytes/numel using c10::mul_overflows()
  • Parakeet ASR Model: Added Parakeet TDT 0.6B automatic speech recognition model to examples with dedicated runner, export scripts, and timestamp support for production-quality speech-to-text applications

Core Components

EXIR/Export

  • Scan operator support: Full implementation of torch.scan higher-order op with loop constructs, carry state management, and et_copy_index for output stacking—enables recurrent and stateful model patterns (#16028)
  • Shared state memory planning: New option to co-locate shared buffers across entry points; automatically propagates mutability if a buffer is mutable in any entry point (#14230)
  • Custom external constant filenames: external_constants config now accepts a Callable[[Node], Optional[str]] to route constants to custom files (e.g., lambda x: "weights" → "weights.ptd") (#15862)
  • Dynamic shape fixes: create_arg now handles torch.SymInt arguments; dim_order_from_stride uses guard_or_false for symbolic strides (#16774, #15472)
  • Multi-partitioner fix: Each partitioner now processed individually with its specific op requirements, fixing conflicts when partitioners have different decomposition expectations (#14458)
  • SpecPropPass guard fix: Fixed double-tracing issue where tensor specs didn't capture shape guards—now generates specs from meta values (#15485)
  • Constant deduplication fix: When constants share data but have different FQNs, both FQNs are now properly saved (#15139)
  • PTEFile class: deserialize_pte_binary() now returns a PTEFile object; access the program via .program attribute (#15864)
  • Removed qnnpack backend: The deprecated qnnpack backend has been deleted; migrate to XNNPACK (#14663)

Runtime

  • Security fixes: Addressed CVE in HierarchicalAllocator with integer overflow check, fixed stack buffer overflow in resize_tensor (bounds check for dimension limit), and added overflow-safe arithmetic in nbytes() and numel() using c10::mul_overflows() (#16103, #15626, #16804, #15581)
  • Caching CPU memory allocator: New CPUCachingAllocator class that reuses freed memory blocks, reducing allocation overhead for repeated inference (#16120)
  • Windows 64-bit mmap support: Fixed mmap to use _fstat64 and uint64_t offsets, enabling files larger than 2GB on Windows (#15538)
  • Flexible tensor views: make_tensor_ptr() now supports custom sizes, dim order, and strides for squeeze/unsqueeze operations; views properly keep source tensor alive via custom deleter (#14944, #15056)
  • Execution state checking: New in_progress() API detects mid-execution state; execute() returns InvalidState if called during step() and auto-resets on error (#16531)
  • Large model support: FreeableBuffer now supports uint64_t data pointers for cross-core memory and models larger than 2GB (#14570)

Kernels

  • Quantization performance: Parallelized op_choose_qparams and added ARM NEON SIMD + multithreading for op_quantize, significantly accelerating quantized KV-cache prefill in LLMs (#15767, #15768, #15769)
  • Parallel slice_copy: Added multithreading to slice_copy for large workloads, reducing RoPE computation overhead by 5-10% in LLM prefills (#16125)
  • Fast clone_dim_order: Added memcpy fast path when dim order is unchanged—benchmarked at 4.4x faster (27.9ms → 6.4ms) (#15815)
  • New operator: grid_sampler_2d: Full portable kernel implementation supporting bilinear/nearest interpolation with all padding modes (#16051)
  • New operators: bitwise shift: Added bitwise_left_shift and bitwise_right_shift for Tensor and Scalar variants (#15893)
  • Quantized embedding int32 indices: Sub-8-bit quantized embedding ops now support int32 indices in addition to int64 (#16518)
  • Security fixes: Fixed fuzzer-discovered vulnerabilities including heap-buffer-overflow in constant_pad_nd, integer overflow in pixel_shuffle, division by zero in reduction ops, and out-of-bounds access in pad2d (#16468, #16138, #16462, #15865)

SDK/Pybindings

  • Bundled program support: Added Python bindings for loading .bpte (bundled program) and .ptd (program with data) files (#14678)
  • Thread introspection: Exposed get_num_threads() to Python for debugging and profiling (#15944)

Backend Delegates

Arm

Ethos-U NPU (U55/U85)

  • 16-bit activation quantization (A16W8): New get_symmetric_a16w8_quantization_config() enables 16-bit activations with 8-bit weights for improved accuracy on precision-sensitive models—supports linear, conv2d, sigmoid, tanh, permute, LayerNorm, LSTM, rsqrt, add, sub, and mul operators (#14258, #15101, #15256, #16015, #15524)
  • 4-bit weight quantization (A8W4): New get_symmetric_a8w4_quantization_config() enables 4-bit weights with 8-bit activations for 2x weight compression—supports conv2d, conv3d, depthwise conv, and linear; runtime automatically expands packed 4-bit outputs to int8 tensors (#16577, #15588)
  • QAT BatchNorm folding: Quantization-Aware Training now supports Conv1d+BatchNorm1d and Conv2d+BatchNorm2d folding with in-place activation fusion (hardtanh, relu) for optimized quantized models (#16001)
  • Vela compiler 4.5.0: Updated Ethos-U Vela compiler with bug fixes, removing multiple operator xfails that previously required workarounds (#16075)
  • Output layout mismatch handling: Runtime now automatically detects and corrects Vela's output padding and packing, ensuring tensor byte layouts match ExecuTorch expectations without manual intervention (#15588)
  • FPU embedded target: Added embedded target variant with floating-point unit support for mixed-precision workloads (#15202)

VGF GPU Backend

  • Mixed INT+FP profile as default: VGF backend now defaults to supporting both quantized (INT) and floating-point operations in the same model—enables partial quantization where performance-critical layers stay in float while memory-bound layers use int8, validated on MobileNetV2 and Llama models (#16176, #15773, #16311)
  • Conv3d operator support: Added 3D convolution for video and volumetric workloads with FP32, Int8, and A16W8 quantization support (#16093)
  • Stable Diffusion model testing: VGF backend validated on Stable Diffusion transformer modules, demonstrating support for generative AI workloads (#14655)
  • PyPI package installation: VGF backend dependencies can now be installed via pip install executorch[arm-vgf] for easier setup (#15737)
  • ETDump profiling support: DevTools integration enables on-device profiling and performance analysis for VGF-delegated models (#15221)

TOSA Infrastructure

  • TOSA 2025.11.0: Updated to latest TOSA specification with improved operator coverage and compatibility (#16424)
  • New operators: Added support for fill_.Scalar, bitwise_not, split_copy, remainder, copy, floor_divide, mean, sum, min/max with unset dim, select_scatter, masked_fill_.Scalar, clamp.Tensor, log1p, and tan (#14501, #14460, #14717, #15409, #15363, #15380, #14933, #15972, #16272, #16273)
  • 6D tensor and pixel shuffle: Support for rank-6 tensors and pixel shuffle/unshuffle operations for super-resolution and image processing models (#14626)
  • Per-channel rescale: TOSA.RESCALE now supports per-channel quantization for improved accuracy in quantized convolutions (#15267)
  • TOSA pybindings: Python bindings for tosa_serialization library enable programmatic inspection and manipulation of TOSA graphs (#15356)
  • Large model compilation speedup: Cherry-picked TOSA patch significantly reduces compilation time for large models (#15592)
  • FVP PMU trace profiling: Performance Monitoring Unit trace output now exposed from FVP runs, enabling detailed cycle-level profiling and Model Explorer visualization overlays (#14401)
  • Validated models: SmolLM2-135M (LLM) and DeiT-Tiny (Vision Transformer) added to CI model testing with accuracy evaluators (#14722, #14579)

Control Flow

  • torch.cond and while loops: Full support for conditional execution (torch.cond) and while loops in quantized and floating-point models—enables dynamic control flow patterns like early exit, adaptive computation, and iterative refinement (#15549, #15849, #16287)
  • Submodule serialization: Control flow submodules are properly tagged, partitioned, and serialized for correct delegation (#15364, #15381)

MLSDK CPU Runtime

  • Model validation: MobileNetV2, DeepLabV3, and Conformer models validated on MLSDK runtime (#15098)
  • Updated pip packages: ai_ml_* packages updated to 0.8.0 with pip installation now the default; build-from-source remains available as an option (#16796)

Cadence

  • 16-bit activation quantization (A16W8): New quantizers enable 16-bit activations with 8-bit weights for improved accuracy in precision-sensitive models—supported for linear, conv1d, conv2d, matmul, and softmax via CadenceWith16BitLinearActivationsQuantizer, CadenceWith16BitConvActivationsQuantizer, CadenceWith16BitMatmulActivationsQuantizer, and CadenceWithSoftmaxQuantizer (#15010, #15997, #16007, #16008)
  • Mixed quantization (W8A32): Experimental support for 8-bit weight-only quantization with 32-bit float activations for linear, conv, and GRU operations—provides model compression while maintaining full-precision activations (#14134, #15137, #15171, #15209)
  • LLM operator support: Added RoPE (rotary positional embeddings) custom op for transformer position encoding, group-quantized embedding for memory-efficient token lookups, and batched matmul for parallel attention heads (#14399, #14916, #14956)
  • Vision module: New optimized kernel library targeting vision workloads with quantized implementations of softmax, conv2d, linear, matmul, layer_norm, and im2row operations (#12480, #14518)
  • ETDump profiling: CadenceETDump class now available in OSS for on-device profiling—extracts execution cycles, dumps intermediate tensors, and prints performance summaries including framework overhead breakdown (#14616)
  • Composable quantizer system: Multiple specialized quantizers can be combined for fine-grained control: CadenceDefaultQuantizer (A8W8), CadenceWithLayerNormQuantizer, CadenceFusedConvReluQuantizer (fused conv+relu patterns), CadenceWakeWordQuantizer (add/cat support), and CadenceRmsNormNopQuantizer (#16117)
  • Strongly-typed quantize/dequantize ops: Type-specific variants (quantize_per_tensor_asym8s, _asym16s, _asym32s, etc.) enable better optimization and clearer semantics in the graph (#14268, #15165)
  • Broadcast semantics: Quantized ops now support tensor broadcasting, enabling more flexible model architectures without explicit reshape operations (#16283)

Core ML

  • Custom pipeline passes: New pass_names parameter in generate_compile_specs() allows specifying custom CoreML optimization passes instead of the default pipeline—useful for fine-tuning optimizations or debugging compilation (#16118)
  • Configurable storage directories: Asset, trash, and database directories now configurable via compile-time macros (EXECUTORCH_COREML_ASSETS_DIRECTORY_PATH, EXECUTORCH_COREML_TRASH_DIRECTORY_PATH, EXECUTORCH_COREML_DATABASE_DIRECTORY_PATH) for apps with specific storage requirements (#5daa6e3f3a95)
  • CoreMLQuantizer migration: Quantizer implementation migrated into ExecuTorch using torchao's PT2E quantization APIs while maintaining coremltools compatibility (#847d70de2fce)
  • Improved asset management: Refactored asset lifecycle with staging directory workflow, better trash cleanup, file protection attributes, and transaction-aware operations (#aeee757953f4)

Cortex-M

  • Quantized operator expansion: Added CMSIS-NN accelerated kernels for linear (fully-connected), conv2d, depthwise conv2d, avg_pool2d, softmax (int8, dim=-1), maximum, minimum, mul, and permute—enabling common CNN and transformer building blocks on Cortex-M processors (#14252, #15896, #16233, #16178, #16152, #15872, #15591, #15848)
  • McuQuantizer: New composable quantizer with SharedQspecQuantizer for multi-input ops sharing scales, IO quantizers for boundary ops, and per-tensor/per-channel weight quantization support (#15459, #15872, #15590)
  • Activation fusion: New pass fuses ReLU, Hardtanh, Hardsigmoid, and Clamp with preceding conv/linear ops; Hardswish decomposed into fusable primitives for better performance (#15917, #16016)
  • Channels broadcasting: ADD and MUL ops now support broadcasting across channel dimensions (#16131)
  • CI integration: Cortex-M tests now run in trunk CI workflow with dedicated pass manager infrastructure (#15690, #14986)

CUDA

New experimental backend for NVIDIA GPU acceleration — The CUDA backend enables GPU inference using AOTInductor compilation with Triton kernel support.

  • Core infrastructure: Complete backend with CUDA partitioner, AOTInductor-based export, and comprehensive shim layer for tensor operations (create, copy, reinterpret, destroy) (#44f3740563a9, #87e9c160b64f)
  • Triton SDPA kernel: Custom Triton-optimized scaled dot-product attention kernel generated via KernelAgent, replacing Edge SDPA and enabling model export that was previously blocked (#15877, #16167)
  • INT4 weight quantization: Support for _weight_int4pack_mm kernel enabling INT4 weight-only quantized LLM inference with on-the-fly dequantization (#15089)
  • Async memory allocation: Switched from cudaMallocManaged to cudaMallocAsync/cudaFreeAsync, achieving 3-9x latency reduction on Voxtral benchmark by eliminating synchronization overhead (#14976)
  • Multi-method support: Method names tracked in compile specs enabling models with multiple entry points to store artifacts independently (#14715)
  • Windows support: Platform abstraction layer for cross-platform compatibility; CUDA backend now builds and runs on Windows (#15183, #15711)
  • Model support: Voxtral, Gemma3, and Whisper models validated with CI testing and benchmarking (#14875, #15323)

MediaTek

No significant changes in this release.

Metal

New experimental backend for Apple GPU acceleration — The Metal backend enables GPU inference on Apple Silicon using Metal Performance Shaders (MPS) and custom Metal kernels. This backend is experimental and APIs may change in future releases.

  • Core infrastructure: Complete backend with Python partitioning, AOT compilation via AOTInductor, and Objective-C++ runtime featuring ETMetalStream, ETMetalShaderLibrary, and kernel dispatch primitives (#15015, #15019, #15020, #15021, #15022, #15024)
  • Operator support: Matrix multiplication (mm, addmm), convolution, and scaled dot-product attention with bfloat16/float32 support (#15023)
  • Native SDPA kernel: Custom Metal shader implementation (ported from MLX) for scaled dot-product attention achieving 2-3x speedup over MPSGraph-based implementation; supports grouped query attention (GQA) and floating-point attention masks (#16086)
  • MPSGraph caching: Compiled MPSGraph objects cached by operation signature, eliminating recompilation overhead for repeated matrix multiplication and convolution operations (#15346)
  • External weights: Model weights stored separately from .so file as binary blob for faster loading and better memory management (#15341)
  • Model support: Voxtral, Whisper, and Parakeet models validated with CI testing (#15233, #15685, #16562)

MPS

No significant changes in this release.

NXP

  • Quantization-Aware Training (QAT): Full QAT support in NeutronQuantizer with Conv+BatchNorm fusion optimization during training—enables higher accuracy quantized models compared to post-training quantization (#15692, #16246)
  • Per-channel quantization: Convolution layers now support per-channel weight quantization with different scale/zero-point per output channel for improved accuracy (#14061)
  • Channels-last optimization: Dim order support tracks tensor memory layouts and avoids unnecessary transpose operations when tensors are already in channels-last (NHWC) format (#16146, #16223)
  • New operators: Added support for conv_transpose2d, split (via slice decomposition), sub, mul, slice, permute_copy, unsqueeze, and clone operations (#15146, #16490, #14514, #15971, #15889, #15099, #16467, #15106)
  • Graph optimization passes: New Linear+Add fusion pass and Q/DQ removal pass enabling non-delegated CPU ops to run directly in int8, reducing quantization overhead (#14112, #15148)
  • Neutron SDK 25.09: Updated to latest NXP Neutron Software SDK with driver API improvements and additional error handling (#14591)

OpenVINO

  • Llama model support: End-to-end Llama export and execution with 4-bit weight-only quantization—integrated with ExecuTorch's LLM export infrastructure, includes Llama 3.2 example configuration and enhanced partitioner for LLM-specific operations (#e73144933ee8)
  • PTQ quantizer fix: Fixed critical bug in post-training quantization where weight quantization results were discarded due to missing return statement (#15891)

Qualcomm

  • GPU backend support: New Adreno GPU backend infrastructure alongside existing Hexagon DSP (HTP) support—includes GPU backend, context, device, and graph classes with platform-specific configurations for aarch64 and x86_64 (#12165)
  • Six new SoC platforms: Added support for SAR2230P, SA8255, SM8350 (Snapdragon 888), SM8850 (Snapdragon 8 Elite Gen 5), QCM6490, and SW6100 (#9c568d76e85f, #ed72daf5e415, #ea0c612122fd, #2ebed8855c10, #0ee2f491a12e, #04f1e4d22383)
  • Multimodal VLM support: Complete infrastructure for vision-language models including SmolVLM-500M and InternVL3-1B—adds vision encoder quantization, multimodal AOT compilation, and calibration support (#16292)
  • LLM quantization recipe system: Fine-grained per-layer quantization control for LLMs with dedicated quantization recipes providing better modularity and customization (#15807)
  • 16a4w_block QAT: Quantization-aware training support for 16-bit activations with 4-bit per-block weight quantization, enabling accuracy recovery with significant model compression (#a47a12257fe3)
  • Intermediate output debugger: New debugging tool compares QNN vs CPU intermediate tensor outputs with custom metrics (cosine similarity, MSE) and multi-format visualization (SVG, CSV) to pinpoint accuracy issues (#15735)
  • New operators: Added avg_pool3d, adaptive_avg_pool3d, adaptive_max_pool2d, max_pool3d, grid_sampler 2D/3D, floor_divide, triu, and linear with non-constant weights (#15460, #15371, #15897, #14888, #16014)
  • Model optimizations: Validated Gemma-2B, Granite3.3-2B, Gemma2-2B, GLM1.5B, CodeGen2-1B LLMs and ViT models with static quantization and lookahead attention support (#14459, #15808, #16624, #15691, #15408, #15696, #14412)

Samsung

  • A8W8 quantization: Full 8-bit activation and weight quantization support via new EnnQuantizer using TorchAO PT2E APIs—includes per-channel/per-tensor quantization, PTQ and QAT modes, and fused activation patterns (Conv+ReLU, Linear+ReLU) (#14464)
  • Model examples: Added 10 quantized export examples for MobileNet V2/V3, ResNet-18/50, Inception V3/V4, ViT, DeepLab V3, EDSR, and Wav2Letter (#14464)
  • E2E testing infrastructure: Device farm integration with runtime executor enabling CI testing on Samsung hardware; validates 12 vision/speech models and 25+ operators (#15731)

Vulkan

  • Static int8 quantization: Complete infrastructure for statically quantized CNNs with packed int8 tensors—includes new kInt8x4 dtype, specialized memory layouts for conv/matmul (kPackedInt8_4W4C, kPackedInt8_4H4W), and fully quantized conv2d with depthwise, pointwise, and tiled variants (#14609, #14668, #14669, #14670)
  • SDPA fusion for LLMs: New fusion pass combines update_cache + custom_sdpa into single sdpa_with_kv_cache operator, eliminating redundant cache operations and improving transformer inference efficiency (#15645)
  • Shader specialization: Conv2D shaders now use Vulkan specialization constants for kernel parameters and tensor sizes instead of runtime uniforms, enabling compile-time optimizations and improved pipeline performance (#16036)
  • New operators: Added pow (tensor-scalar), embedding (buffer impl), gather, rotary positional embeddings (RoPE), and to_dim_order_copy for dtype conversion—critical for transformer models (#15319, #15320, #15749, #15679, #15677)
  • Expanded dtype support: Int32 and bool (uint8) tensor support added to 15+ operators including clone, concat, gather, split, padding, and indexing operations (#15829)
  • Performance optimizations: Extensive shader-level optimizations for quantized operations including:
    • Quantized matmul: Devectorized shaders, 4x3 tiling, buffer-based weight storage, uint16→int conversion, reduced shift operations, and improved 4-bit unpacking (#788ef2f45053, #baa41c6607b6, #ccc5eb0a1a8c, #aa4ab0248a38, #a1081e6456c8)
    • Quantized conv2d: Reduced register/memory pressure in depthwise convolutions, improved tile sizes, optimized scale/offset calculations (#a0a627805245, #18c1c5b515c4)
    • Push constants migration: Moved operation parameters from uniform buffers to push constants for softmax, clone, binary scalar ops, and addmm—reducing memory access overhead (#919356627d79, #9cd840279be9, #56e131bad818, #b765e9989b88)
    • ALU optimization: Better GPU pipeline utilization through instruction reordering and split uint-to-float conversions (#7d18005b71d9)

XNNPACK

  • Workspace sharing: New backend option to control memory arena sharing between XNNPACK delegates with three modes—Disabled (max parallelism), PerModel (balanced), and Global (min memory)—reducing memory consumption on constrained devices (#a523306de447)
  • New operators: Added sin and cos trigonometric operators for fp32/fp16, and view_copy/static_reshape support with dynamic dimension handling (#14711, #15431, #7959)
  • MSVC/Windows build support: Fixed C99 designated initializer compatibility and compiler flags for building XNNPACK backend with Microsoft Visual C++ (#15224)
  • No-op clone removal: New optimization pass removes unnecessary clone operations from functional graphs, eliminating redundant memory copies and improving inference performance (#15884)
  • External data for quantization scales: Per-channel quantization scales now tagged for external storage alongside weights, enabling efficient program-data separation for large quantized models (#e570942c1fd3)
  • Quantization metadata propagation: New pass propagates external storage tags to q/dq nodes, fixing quantized weight externalization when using program-data separation (#14864)

Platforms

Android

  • Float16 tensor support: New Tensor.fromBlob(short[], long[]) and Tensor.allocateHalfBuffer() APIs for creating half-precision (FP16) tensors directly from IEEE-754 encoded short arrays (#15479)
  • Tensor factory methods: Added Tensor.ones(shape, dtype) and Tensor.zeros(shape, dtype) for conveniently creating initialized tensors across all numeric dtypes (#15388)
  • Backend metadata in MethodMetadata: getMethodMetadata() now includes a backends field showing which backends executed the method (e.g., "XnnpackBackend") (#14397)
  • Enhanced exception diagnostics: Exceptions now automatically include detailed runtime logs; new getDetailedError() method for programmatic access (#14557, #16193)

Apple/iOS

  • Type-casting tensor copy: New copy(to:) method allows copying tensors while converting data types, e.g., floatTensor.copy(to: Double.self) (#15511)
  • Flexible tensor views: Tensor views now support overriding shape, dimension order, and strides, enabling reshape operations without copying data (#15512)
  • Logging crash fix: Fixed null pointer crash in logging when filename or message was nil (#14388)

LLM & Model Enablement

Audio/Speech Models

  • Parakeet ASR support: Added Parakeet TDT 0.6B model to examples with runner, export scripts, and timestamp support (#16349, #16545)
  • WAV audio loader: New utility for loading and normalizing PCM audio from .wav files (16-bit and 32-bit formats) for speech models (#14923)
  • Voxtral and Whisper improvements: Metal backend support, CUDA multimodal runner, and various fixes (#14918, #15273, #15740)

Multimodal

  • MultimodalRunner consolidation: Unified Llava execution through MultimodalRunner, removing separate LlavaRunner (#14250, #14356)
  • Token input support: MultimodalInput now supports tokenizer-encoded input as vectors of token IDs (#14451)
  • Float32 image input: Multimodal runner now accepts float32 image tensors (#14359)

LoRA

  • LoRA weight export: Export LoRA weights to separate .ptd files for flexible adapter loading (#15061)
  • LoRA quantization: Support for quantizing LoRA linear layers (lora_a, lora_b) alongside base weights (#15935)
  • MLP and Unsloth support: Extended LoRA to MLP layers and added Unsloth compatibility (#15132)

Static Attention

  • StaticAttentionIOManager: New manager with optional callback on prefill logits, precomputed RoPE support, and flexible module initialization (#14336, #15200)
  • Partial logits generation: Added generate_full_logits=False support to skip full vocabulary logits generation for memory/performance optimization (#16171)
  • Attention skip support: TransformerBlock now supports skipping attention layers for efficient inference (#14826, #16104)
  • CoreML iOS26 fix: Fixed numerics issues with static attention on CoreML (#16144)

Advanced LLM Features

  • torch.cond support with custom ops: New executorch::alias and executorch::update_cross_attn_cache custom ops enable torch.cond for conditional KV cache updates in cross-attention (required since torch.cond doesn't support aliasing/mutations) (#16366)
  • Conv1d to Conv2d decomposition: CUDA backend now automatically decomposes conv1d operations to conv2d for better hardware utilization (#15092)
  • Export simplifications: Removed deprecated fairseq and sharded checkpoint dependencies from export_llama, streamlining the export process (#15968, #16052)

Quantization & Performance

  • HQQ as default: HQQ is now the default post-training quantization method in ExecuTorch (#14834)
  • Quantized SDPA optimization: Reduced allocation overhead and fixed flakiness in quantized scaled dot-product attention (#16119, #15766)
  • Temp allocator for scratch memory: LLM ops now use temp allocator for scratch memory, reducing overhead (#16121)

Runner Infrastructure

  • JNI multiple .ptd files: Android JNI layer now supports loading multiple .ptd data files (#14769)
  • Tokenizer updates: Multiple tokenizer pin bumps and pytorch-tokenizers integration (#15492)
  • MSVC support: LLM runner now builds with MSVC on Windows (#15250)

DevTools

  • Custom numerical comparators: Inspector.calculate_numeric_gap() now accepts custom NumericalComparatorBase subclasses in addition to built-in "MSE", "L1", "SNR" metrics—enables domain-specific comparison logic (#15969)
  • Partial debug handle matching: AOT-to-runtime comparison now uses subset matching and searches ancestor nodes, fixing failures when custom ops (e.g., int4_matmul) cause node loss during graph rewriting (#14306)
  • Multi-output operator fix: Fixed false negatives when comparing operators like dropout or layer_norm that produce multiple outputs—now matches by shape and dtype instead of taking last element (#16813)

Build/CI

Platform & Toolchain Support

  • Python 3.13 support: ExecuTorch now supports Python 3.13 (requires coremltools 9.0+); wheels are built and tested for Python 3.10-3.13 (#16004)
  • MSVC build support: ExecuTorch can now be built with Microsoft Visual C++ compiler on Windows; added MSVC-compatible compiler flags across CMake files and new CI workflow (#14970, #15361)
  • Android NDK r28c: Upgraded from NDK r27b to r28c with native 16KB page size support for better Android 15+ compatibility (#14522)
  • CUDA 12.9 support: Added CUDA 12.9 to supported versions (12.6, 12.8, 12.9, 13.0) with CI testing (#15818)

Build System

  • CMake workflow presets: New cmake --workflow llm-release command combines configure, build, and install into a single step—simplifies builds for LLM workloads with CUDA and Metal variants (#15821)
  • Editable install fix: Fixed pip install -e . writing version.py to wrong location, and fixed license field format for newer setuptools compatibility (#15941)
  • CMake install paths: Swept major CMake files to use CMAKE_INSTALL_INCLUDEDIR/CMAKE_INSTALL_LIBDIR instead of hardcoded include/lib paths (#12792)

Docker & Containers

  • Docker gcc11 upgrade: Default CI container upgraded from gcc9 to gcc11; added new gcc9-nopytorch variant for faster binary size tests (#16227)
  • Arm dependencies in Docker: Arm backend dependencies now pre-installed in Docker images (#15812)

CI Infrastructure

  • Backend Tester refactor: Split backend test workflows into per-backend files (XNNPACK, CoreML, Vulkan, QNN) for better CI visibility and faster debugging (#14137)
  • Dependency management: Switched from npm to pnpm for faster, more disk-efficient Node.js dependency handling (#16019)

Documentation

New Tutorials & Guides

  • Raspberry Pi Pico2 deployment: End-to-end tutorial for deploying PyTorch models to the RP2040 microcontroller—covers model export, firmware build, and flash deployment with <400KB memory footprint (#15188)
  • Export LLMs with Optimum: Guide for using HuggingFace's Optimum-ExecuTorch to export LLMs directly from HuggingFace Hub with built-in quantization (8da4w, 4w) and custom SDPA/KV-cache optimizations (#15062)
  • Android LlamaDemo with QNN backend: Step-by-step tutorial for running Llama on Android with Qualcomm's QNN backend, including AAR rebuild, APK installation, and troubleshooting common issues (#16011)
  • CUDA backend guide: New documentation covering CUDA backend setup, AOTInductor integration, Triton kernel support, and INT4 quantization for GPU inference (#16780)
  • DevTools debugging tutorial: Comprehensive 500-line tutorial for debugging ExecuTorch models using Inspector, ETDump, and numerical comparison tools (#16802)

Getting Started Improvements

  • Windows native build support: Integrated Windows build steps directly into main documentation; now explicitly supports Windows x86_64 with Visual Studio 2022+ (#15311)
  • Build from source restructuring: Consolidated and simplified build documentation with use-case organized tables, reducing redundancy by ~30% (#15311)
  • Platform requirements update: Clarified support for Linux (x86_64/ARM64), macOS (ARM64), and Windows (x86_64) with specific toolchain requirements (#15311)

Success Stories & Ecosystem

  • Success Stories page: New page featuring production deployments at Meta (Instagram, WhatsApp, Quest, Ray-Ban Smart Glasses), Liquid AI, NimbleEdge, and PrivateMind (#15236)
  • Ecosystem integrations: Added documentation for HuggingFace Transformers, React Native ExecuTorch, torchao, Unsloth, Ultralytics YOLO, Arm ML Evaluation Kit, Alif Semiconductor, and Digica AI SDK (#15358, #16371)
  • New hardware support docs: Added QCS9100 to Qualcomm SoCs list, CentOS to supported host OS, updated GCC requirements for AOT compilation (#16332, #16021, #15789)

Examples/Demos

New Examples

  • Static Llama for CoreML: New export and runtime scripts for LLaMA models with static shapes targeting Apple's Neural Engine—enables optimized on-device LLM inference on iOS/macOS with ~14 tokens/second decode speed (#71ebc50a1cc0)
  • Pruning example for Arm: Jupyter notebook demonstrating neural network pruning for Ethos-U NPU deployment with performance benchmarks (#15851)
  • Partial quantization example: Updated VGF notebook showing how to selectively quantize layers while keeping others in floating-point for Arm backends (#16298)

Zephyr RTOS Integration

  • ExecuTorch as Zephyr module: Full integration allowing ExecuTorch to be built as an external Zephyr module without upstream changes—includes CMake integration, Kconfig options, and example executor runner targeting Corstone-300 with Ethos-U acceleration (#16294)
  • MVE/Helium SIMD support: Enabled ARM M-profile Vector Extension for CMSIS-NN in Zephyr builds, unlocking SIMD performance on Cortex-M55/M85 processors (#16621)

Model Explorer Visualization

  • Performance overlay: New --perf_overlay flag displays execution cycles per operator as a color-coded heatmap (green→red) using FVP PMU trace data (#15411)
  • .pte file visualization: Added --visualize_pte option to visualize ExecuTorch model files including delegate subgraphs, complementing existing TOSA graph visualization (#16290)

Runner Improvements

  • Single-thread optimization: Executor runner now uses NoThreadPoolGuard when -cpu_threads=1, eliminating thread pool overhead for single-threaded deployments (#15264)
  • HuggingFace tokenizer support: CoreML static runner now uses pytorch_tokenizer for HuggingFace/Unsloth LoRA model compatibility (#16606)

Breaking Changes

  • PTEFile API change: deserialize_pte_binary() now returns a PTEFile object instead of directly returning the program; access the program via the .program attribute (#15864)
  • QNNPACK backend removed: The deprecated QNNPACK backend has been deleted; users should migrate to XNNPACK backend which provides better performance and wider operator coverage (#14663)
  • LlavaRunner removed: Separate LlavaRunner class has been removed; use MultimodalRunner for all multimodal model execution including Llava (#14250, #14356)
  • Cadence API rename: fuse_pt2() function renamed to apply_pre_edge_transform_passes() for clarity (#15724)
  • Multimodal field rename: image_encoder renamed to vision_encoder to align with HuggingFace Transformers naming conventions (#14392)
  • export_for_training removed: Training export API has been removed from ExecuTorch; use PyTorch's standard training workflows (#16434)
  • LLM export simplifications: Removed sharded checkpoint (#15968) and fairseq (#16052) dependencies from export_llama; these legacy dependencies are no longer supported

Deprecations

  • Arm backend internal models: Internal models using aot_arm_compiler are deprecated; users should migrate to the standard TOSA-based compilation flow (#15302)

Contributors

We welcome 48 first-time contributors to ExecuTorch in this release:

9rum, Adi, Andrew, andrewor14, Ben Mehlow, chandanjain999, csoonie, Daksh Shami, dhkimxx, Fabrizio Milo, Felix Weilbach, gpires-meta, Ivaylo Enchev, Jae Ku, James Nicholson, Jeongmin Ha, jethroqti, Jing Wen, John Gibson, Julian, Liuchuan Yu, Marco Giordano, Matt Clayton, Matt Grimm, Meng Tan, Mengyang Liu, Michael Z. Lee, Mick Killianey, Mustafa Cavus, Nitin Jain, Nolan O'Brien, NostalgiaJohn, Onuralp SEZER, Pablo Marquez Tello, Patryk Ozga, Per Held, PetarTerziev-UL, Shubham Panchal, Sriram Bharadwaj, TadayukiOkada, tanvirislam-meta, Tare, tmi, Tomás Agustín González Orlando, Vaclav Novak, Xingguo Li, Yinrun Lyu, Young Han

Don't miss a new executorch release

NewReleases is sending notifications on new releases.