ExecuTorch v1.3.1 Release Notes

1.3.1 is the first broadly published 1.3 patch release, following the Maven-only 1.3.0 release.

Highlights

ExecuTorch v1.3.1 expands model and backend coverage across embedded, mobile, and GPU targets. This release adds major Arm, Cortex-M, VGF, NXP, Qualcomm, CUDA, Metal, MLX, Vulkan, and XNNPACK improvements, and continues broadening LLM and multimodal model support.

Arm backend (Ethos-U / TOSA / VGF / Cortex-M) - Major expansion in TOSA dialect lowering, quantization support, VGF infrastructure, Cortex-M test coverage, Ethos-U compatibility, and CI reliability.
New and expanded model support - Qwen3.5 MoE, Gemma 4 31B, LFM2.5, Voxtral Realtime/TTS, Llama4 export support, and additional Android LLM runner documentation.
CUDA backend - Qwen3.5 MoE export/runtime work, fused MoE and INT4 Triton kernels, CUDA graph capture/replay, GPU-side sampling, and memory/stat reporting.
Metal and MLX backends - Qwen3.5 MoE Metal path, Gemma 4 31B MLX support, MLX delegate progress, integer op support, top-k fallback, gated delta rule kernels, and hardened runtime code signing.
Vulkan backend - Cooperative matrix dispatch for linear/matmul, safer memory handling, integer overflow and heap-buffer fixes, improved partitioning, and VGF compatibility work.
Qualcomm AI Engine Direct (QNN) - Expanded ATen op support, LPAI and Direct Mode work, custom op and quantization annotation APIs, debugging improvements, performance dumping, and heap profiling.
NXP backend - Continued new Neutron flow rollout, eIQ Neutron SDK 3.1.1, stricter graph verification, QAT coverage, and additional operator support.
Runtime and packaging - Shared library install support, Android lifecycle/error handling fixes, and improved wheel/build reliability.

New Models and Model Enablement

Qwen3.5 MoE export and runner support across CUDA and Metal, including fused MoE kernels, INT4 paths, CUDA graph support, and runtime stats.
Gemma 4 31B export and runner support with CUDA and MLX backend support.
LFM2.5 export and runner support for MLX.
Voxtral Realtime and Voxtral TTS improvements, including streaming fixes, CUDA backend support, and documentation updates.
Android LLM runner documentation and HuggingFace-oriented usage docs.

Core Components

EXIR / Export

Safer ETRecord/export deserialization.
FP8 placeholder support in ExecuTorch serialization.
Fixes for torch.split alias annotations and partition grouping.
Migration cleanup away from deprecated capture config paths.

Runtime

Backend and runtime option handling improvements.
Better contextual error reporting for debuggability.

Kernels and Operators

Portable adaptive average pool 2D kernel.
fp32 accumulation for Half/BFloat16 grid sampler bilinear paths.
Fused quantized add, mul, relu, hardswish, and bmm kernels.
Quantized max pool and quantized conv1d fixes.
Additional quantization and observer fixes across portable and backend kernels.

Backend Delegates

Arm / Ethos-U / TOSA / VGF / Cortex-M

Expanded TOSA op coverage including slice, max pool, avg pool, adaptive avg pool, shape ops, resize, LSTM decomposition, and additional dtype/layout handling.
FP16/BF16 operator coverage is now considered largely complete; most support landed incrementally over prior releases and the 1.3 cycle adds test stabilization for FP16 MobileNetV3 and VGF bilinear paths.
Refactored dim-order and permute handling. Removed the legacy tosa_dim_order path, added permute-removal for TOSA ops, and improved permute fusion over elementwise ops. This reduced unnecessary transposes in lowered graphs.
Improved quantization support, including composable quantizer work, constant folding fixes, TOSA/VGF linear quantization modes, and Cortex-M MVE/Helium int16 quantize/dequantize support.
VGF build/test infrastructure and documentation.
Cortex-M docs, TinyML validation, CMSIS-NN integration, linting, and CI improvements.
Ethos-U driver backwards compatibility and more robust FVP/toolchain download handling.

CUDA

Qwen3.5 MoE export/runtime path with fused MoE and INT4 Triton kernels.
Gemma 4 31B support on CUDA.
CUDA graph capture/replay and GPU-side Gumbel-max sampling.
Memory-efficient model export.
Runtime cross-method weight sharing.
GPU memory tracking and structured runtime stats.
CI and build reliability fixes.

Metal / MLX

Qwen3.5 MoE Metal export, build target, source transforms, and integration tests.
MLX delegate progress and integer op support.
Gemma 4 31B support through MLX.
Top-k fallback, gated delta rule kernels, SDPA updates, and hardened runtime code signing.

Vulkan

Cooperative matrix dispatch for linear/matmul.
Safer memory allocation and heap-buffer handling.
Integer overflow fixes and improved partitioning behavior.
VGF compatibility and Parakeet documentation/build fixes.

Qualcomm AI Engine Direct (QNN)

New QNN op support including isinf, isnan, rand, randn, log2, log10, log1p, trunc, acos, atan2, tan, avg_pool1d, remainder, reflection_pad, and scatter.src.
LPAI and Direct Mode support.
Custom op package and quantization annotation APIs.
Debugging/numeric discrepancy tooling and performance dump improvements.
Multimodal and LLM quantization guidance updates.

NXP

eIQ Neutron SDK updated to version 3.1.1. eIQ Neutron SDK 3.1.x introduced a new MLIR-based conversion flow with broader operator support for Neutron-C architectures.
Operators migrated to the MLIR flow: aten.avg_pool2d, aten.max_pool2d, aten.abs, aten.constant_pad_nd, aten.sigmoid, aten.leaky_relu, and aten.adaptive_avg_pool2d.
New ops: 1D MaxPool, aten.bmm, and aten.conv_transpose1.
Unified test pipeline for operator and model testing.

XNNPACK

Enable weight cache and workspace sharing as runtime options.
Output padding support for transposed convolution.
Workspace sharing and weight cache concurrency fixes.

Platforms

Android

Improved JNI error reporting and lifecycle handling.
Tensor.copyDataInto added to the Java API.
LLM module lifecycle/thread-safety fixes and test coverage.
Modern Clang and ETDump path override fixes.

Apple / CoreML / iOS

CoreML profiler crash fix.
CoreML partitioner skips for unsupported random and argmin/argmax patterns.
iOS 18 quantization error hints.
Shared library install support for consumers and backends.

Documentation

Android LLM runner docs.
Arm, Cortex-M, Ethos-U, and VGF documentation updates.
NXP QAT and Neutron flow documentation updates.
Vulkan, Parakeet, Voxtral, and LFM2.5 documentation updates.

Deprecations

The legacy MPS backend remains deprecated and is expected to be removed in a future release.

Contributors

We welcome 62 first-time human contributors to ExecuTorch in this release:

@s09g, @Froskekongen, @MahinAshraful, @qti-mmadhava, @quic-boyuc, @qti-horodnic, @rezaasjd, @ocvh, @twsl, @mvartani-meta, @ifed-ucsd, @tmtrademarked, @berndporr, @leixin, @amov-meta, @xiaodong705, @Hyungkeun-Park-Nota, @AlannaBurke, @kastopia, @zhaoxul-qti, @VasuAgrawal, @JorickvdHoeven, @dveremeevfb, @alexey-sidnev, @Lidang-Jiang, @notaJiminLee, @irtrukhina, @mvsfb, @SAY-5, @Jah-yee, @fgsiveone, @boomitsnoom, @jeanschmidt, @KevinUW114514, @aksharabhardwaj766-commits, @AlessandroVacca, @gkrulce, @AswaniSahoo, @OpenByteDev, @ishangodawatta, @WongJohnson, @ymrohit, @XAheli, @jiawei-lyu, @Nazim-fad, @yadferhad, @darknight054, @xuyanwen2012, @matt-cossins, @zeel2104, @john-rocky, @luhenry, @christine-long-meta, @hboyraz, @telgamal-1, @madhesh60, @omkar-334, @vacu9708, @Vasanthadithya-mundrathi, @ozgecinko, @wliuyx, @tomeuv

Full Changelog

Full Changelog: v1.2.0...v1.3.1

executorch 1.3.1 v1.3.1 on Python PyPI