🎉 LocalAI 3.10.0 Release! 🚀

LocalAI 3.10.0 is big on agent capabilities, multi-modal support, and cross-platform reliability.

We've added native Anthropic API support, launched a new Video Generation UI, introduced Open Responses API compatibility, and enhanced performance with a unified GPU backend system.

For a full tour, see below!

📌 TL;DR

Feature	Summary
Anthropic API Support	Fully compatible `/v1/messages` endpoint for seamless drop-in replacement of Claude.
Open Responses API	Native support for stateful agents with tool calling, streaming, background mode, and multi-turn conversations, passing all official acceptance tests.
Video & Image Generation Suite	New video gen UI + LTX-2 support for text-to-video and image-to-video.
Unified GPU Backends	GPU libraries (CUDA, ROCm, Vulkan) packaged inside backend containers — works out of the box on Nvidia, AMD, and ARM64 (Experimental).
Tool Streaming & XML Parsing	Full support for streaming tool calls and XML-formatted tool outputs.
System-Aware Backend Gallery	Only see backends your system can run (e.g., hide MLX on Linux).
Crash Fixes	Prevents crashes on AVX-only CPUs (Intel Sandy/Ivy Bridge) and fixes VRAM reporting on AMD GPUs.
Request Tracing	Debug agents & fine-tuning with memory-based request/response logging.
Moonshine Backend	Ultra-fast transcription engine for low-end devices.
Pocket-TTS	Lightweight, high-fidelity text-to-speech with voice cloning.
Vulkan arm64 builds	We now build backends and images for vulkan on arm64 as well

🚀 New Features & Major Enhancements

🤖 Open Responses API: Build Smarter, Autonomous Agents

LocalAI now supports the OpenAI Responses API, enabling powerful agentic workflows locally.

Stateful conversations via response_id — resume and manage long-running agent sessions.
Background mode: Run agents asynchronously and fetch results later.
Streaming support for tools, images, and audio.
Built-in tools: Web search, file search, and computer use (via MCP integrations).
Multi-turn interaction with dynamic context and tool use.

✅ Ideal for developers building agents that can browse, analyze files, or interact with systems — all on your local machine.

🔧 How to Use:

Set response_id in your request to maintain session state across calls.
Use background: true to run agents asynchronously.
Retrieve results via GET /api/v1/responses/{response_id}.
Enable streaming with stream: true to receive partial responses and tool calls in real time.

📌 Tip: Use response_id to build agent orchestration systems that persist context and avoid redundant computation.

Our support passes all the official acceptance tests:

🧠 Anthropic Messages API: Clone Claude Locally

LocalAI now fully supports the Anthropic messages API.

Use https://api.localai.host/v1/messages as a drop-in replacement for Claude.
Full tool/function calling support, just like OpenAI.
Streaming and non-streaming responses.
Compatible with anthropic-sdk-go, LangChain, and other tooling.

🔥 Perfect for teams migrating from Anthropic to local inference with full feature parity.

🎥 Video Generation: From Text to Video in the Web UI

New dedicated video generation page with intuitive controls.
LTX-2 is supported
Supports text-to-video and image-to-video workflows.
Built on top of diffusers with full compatibility.

📌 How to Use:

Go to /video in the web UI.
Enter a prompt (e.g., "A cat walking on a moonlit rooftop").
Optionally upload an image for image-to-video generation.
Adjust parameters like fps, num_frames, and guidance_scale.

⚙️ Unified GPU Backends: Acceleration Works Out of the Box

A major architectural upgrade: GPU libraries (CUDA, ROCm, Vulkan) are now packaged inside backend containers.

Single image: Now you don't need anymore to pull a specific image for your GPU. Any image works regardless if you have a GPU or not.
No more manual GPU driver setup — just run the image and get acceleration.
Works on Nvidia (CUDA), AMD (ROCm), and ARM64 (Vulkan).
Vulkan arm64 builds enabled
Reduced image complexity, faster builds, and consistent performance.

🚀 This means latest/master images now support GPU acceleration on all platforms — no extra config!

Note: this is experimental, please help us by filing an issue if something doesn't work!

🧩 Tool Streaming & Advanced Parsing

Enhance your agent workflows with richer tool interaction.

Streaming tool calls: Receive partial tool arguments in real time (e.g., input_json_delta).
XML-style tool call parsing: Models that return tools in XML format (<function>...</function>) are now properly parsed alongside text.
Works across all backends (llama.cpp, vLLM, diffusers, etc.).

💡 Enables more natural, real-time interaction with agents that use structured tool outputs.

🌐 System-Aware Backend Gallery: Only Compatible Backends Show

The backend gallery now shows only backends your system can run.

Auto-detects system capabilities (CPU, GPU, MLX, etc.).
Hides unsupported backends (e.g., MLX on Linux, CUDA on AMD).
Shows detected capabilities in the hero section.

🎤 New TTS Backends: Pocket-TTS

Add expressive voice generation to your apps with Pocket-TTS.

Real-time text-to-speech with voice cloning support (requires HF login).
Lightweight, fast, and open-source.
Available in the model gallery.

🗣️ Perfect for voice agents, narrators, or interactive assistants.
❗ Note: Voice cloning requires HF authentication and a registered voice model.

🔍 Request Tracing: Debug Your Agents

Trace requests and responses in memory — great for fine-tuning and agent debugging.

Enable via runtime setting or API.
Log stored in memory, dropped after max size.
Fetch logs via GET /api/v1/trace.
Export to JSON for analysis.

🪄 New 'Reasoning' Field: Extract Thinking Steps

LocalAI now automatically detects and extracts thinking tags from model output.

Supports both SSE and non-SSE modes.
Displays reasoning steps in the chat UI (under "Thinking" tab).
Fixes issue where thinking content appeared as part of final answer.

🚀 Moonshine Backend: Faster Transcription for Low-End Devices

Add Moonshine, an ONNX-based transcription engine, for fast, lightweight speech-to-text.

Optimized for low-end devices (Raspberry Pi, older laptops).
One of the fastest transcription engines available.
Supports live transcription.

🛠️ Fixes & Stability Improvements

🔧 Prevent BMI2 Crashes on AVX-Only CPUs

Fixed crashes on older Intel CPUs (Ivy Bridge, Sandy Bridge) that lack BMI2 instructions.

Now safely falls back to llama-cpp-fallback (SSE2 only).
No more EOF errors during model warmup.

✅ Ensures LocalAI runs smoothly on older hardware.

📊 Fix Swapped VRAM Usage on AMD GPUs

Correctly parses rocm-smi output: used and total VRAM are now displayed correctly.

Fixes misreported memory usage on dual-Radeon setups.
Handles HIP_VISIBLE_DEVICES properly (e.g., when using only discrete GPU).

🚀 The Complete Local Stack for Privacy-First AI

LocalAI

The free, Open Source OpenAI alternative. Drop-in replacement REST API compatible with OpenAI specifications for local AI inferencing. No GPU required.

Link: https://github.com/mudler/LocalAI

LocalAGI

Local AI agent management platform. Drop-in replacement for OpenAI's Responses API, supercharged with advanced agentic capabilities and a no-code UI.

Link: https://github.com/mudler/LocalAGI

LocalRecall

RESTful API and knowledge base management system providing persistent memory and storage capabilities for AI agents. Works alongside LocalAI and LocalAGI.

Link: https://github.com/mudler/LocalRecall

❤️ Thank You

LocalAI is a true FOSS movement — built by contributors, powered by community.

If you believe in privacy-first AI:

✅ Star the repo
💬 Contribute code, docs, or feedback
📣 Share with others

Your support keeps this stack alive.

✅ Full Changelog

📋 Click to expand full changelog

What's Changed

Bug fixes 🐛

fix(ui): correctly parse import errors by @mudler in #7726
fix(cli): import via CLI needs system state by @mudler in #7746
fix(amd-gpu): correctly show total and used vram by @mudler in #7761
fix: add nil checks before mergo.Merge to prevent panic in gallery model installation by @majiayu000 in #7785
fix: Usage for image generation is incorrect (and causes error in LiteLLM) by @majiayu000 in #7786
fix: propagate validation errors by @majiayu000 in #7787
fix: Failed to download checksums.txt when using launch to install localai by @majiayu000 in #7788
fix(image-gen): fix scrolling issues by @mudler in #7829
fix(llama.cpp/mmproj): fix loading mmproj in nested sub-dirs different from model path by @mudler in #7832
fix: Prevent BMI2 instruction crash on AVX-only CPUs by @coffeerunhobby in #7817
fix: Highly inconsistent agent response to cogito agent calling MCP server - Body "Invalid http method" by @majiayu000 in #7790
fix(chat/ui): record model name in history for consistency by @mudler in #7845
fix(ui): fix 404 on API menu link by pointing to index.html by @DEVMANISHOFFL in #7878
fix: BMI2 crash on AVX-only CPUs (Intel Ivy Bridge/Sandy Bridge) by @coffeerunhobby in #7864
fix(model): do not assume success when deleting a model process by @jroeber in #7963
fix(functions): do not duplicate function when valid JSON is inside XML tags by @mudler in #8043

Exciting New Features 🎉

feat: disable force eviction by @mudler in #7725
feat(api): Allow tracing of requests and responses by @richiejp in #7609
feat(UI): image generation improvements by @mudler in #7804
feat(image-gen/UI): move controls to the left, make the page more compact by @mudler in #7823
feat(function): Add tool streaming, XML Tool Call Parsing Support by @mudler in #7865
chore: Update to Ubuntu24.04 (cont #7423) by @richiejp in #7769
feat: package GPU libraries inside backend containers for unified base image by @Copilot in #7891
feat(backends): add moonshine backend for faster transcription by @mudler in #7833
feat: enable Vulkan arm64 image builds by @Copilot in #7912
feat: Add Anthropic Messages API support by @Copilot in #7948
feat: add tool/function calling support to Anthropic Messages API by @Copilot in #7956
feat(api): support 'reasoning' api field by @mudler in #7959
feat: Filter backend gallery by system capabilities by @Copilot in #7950
feat(tts): add pocket-tts backend by @mudler in #8018
feat(diffusers): add support to LTX-2 by @mudler in #8019
feat(ui): add video gen UI by @mudler in #8020
feat(api): add support for open responses specification by @mudler in #8063

🧠 Models

chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #7801
chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #7807
chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #7816
Fix(gallery): Updated checksums for qwen3-vl-30b instruct & thinking by @Nold360 in #7819
chore(model-gallery): ⬆️ update checksum by @localai-bot in #7821
chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #7826
chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #7831
chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #7840
chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #7903
chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #7916
chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #7922
chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #7954
chore(model gallery): add qwen3-coder-30b-a3b-instruct based on model request by @rampa3 in #8082

📖 Documentation and examples

chore(AGENTS.md): Add section to help with building backends by @richiejp in #7871
[gallery] add JSON schema for gallery model specification by @DEVMANISHOFFL in #7890
chore(doc): put alert on install.sh until is fixed by @mudler in #8042

👒 Dependencies

chore(deps): bump securego/gosec from 2.22.9 to 2.22.11 by @dependabot[bot] in #7774
chore(deps): bump google.golang.org/grpc from 1.77.0 to 1.78.0 by @dependabot[bot] in #7777
chore(deps): bump github.com/schollz/progressbar/v3 from 3.18.0 to 3.19.0 by @dependabot[bot] in #7775
chore(deps): bump github.com/modelcontextprotocol/go-sdk from 1.1.0 to 1.2.0 by @dependabot[bot] in #7776
chore(deps): bump dependabot/fetch-metadata from 2.4.0 to 2.5.0 by @dependabot[bot] in #7876
chore(deps): bump github.com/labstack/echo/v4 from 4.14.0 to 4.15.0 by @dependabot[bot] in #7875
chore(deps): bump protobuf from 6.33.2 to 6.33.4 in /backend/python/transformers by @dependabot[bot] in #7993
chore(deps): bump github.com/mudler/go-processmanager from 0.0.0-20240820160718-8b802d3ecf82 to 0.1.0 by @dependabot[bot] in #7992
chore(deps): bump github.com/onsi/gomega from 1.38.3 to 1.39.0 by @dependabot[bot] in #8000
chore(deps): bump github.com/gpustack/gguf-parser-go from 0.22.1 to 0.23.1 by @dependabot[bot] in #8001
chore(deps): bump fyne.io/fyne/v2 from 2.7.1 to 2.7.2 by @dependabot[bot] in #8003
chore(deps): bump github.com/onsi/ginkgo/v2 from 2.27.3 to 2.27.5 by @dependabot[bot] in #8004
chore(deps): bump torch from 2.3.1+cxx11.abi to 2.8.0 in /backend/python/rerankers in the pip group across 1 directory by @dependabot[bot] in #8066

Other Changes

docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #7716
chore: ⬆️ Update ggml-org/whisper.cpp to 6114e692136bea917dc88a5eb2e532c3d133d963 by @localai-bot in #7717
chore: ⬆️ Update ggml-org/llama.cpp to c18428423018ed214c004e6ecaedb0cbdda06805 by @localai-bot in #7718
chore: ⬆️ Update ggml-org/llama.cpp to 85c40c9b02941ebf1add1469af75f1796d513ef4 by @localai-bot in #7731
chore: ⬆️ Update ggml-org/llama.cpp to 7ac8902133da6eb390c4d8368a7d252279123942 by @localai-bot in #7740
chore: ⬆️ Update ggml-org/llama.cpp to a4bf35889eda36d3597cd0f8f333f5b8a2fcaefc by @localai-bot in #7751
chore: ⬆️ Update ggml-org/llama.cpp to 4ffc47cb2001e7d523f9ff525335bbe34b1a2858 by @localai-bot in #7760
chore(ci): be more precise when detecting existing models by @mudler in #7767
chore: ⬆️ Update leejet/stable-diffusion.cpp to 4ff2c8c74bd17c2cfffe3a01be77743fb3efba2f by @richiejp in #7771
chore: ⬆️ Update ggml-org/llama.cpp to c9a3b40d6578f2381a1373d10249403d58c3c5bd by @localai-bot in #7778
Revert "chore(deps): bump securego/gosec from 2.22.9 to 2.22.11" by @mudler in #7789
feat(swagger): update swagger by @localai-bot in #7794
chore: ⬆️ Update ggml-org/llama.cpp to 0f89d2ecf14270f45f43c442e90ae433fd82dab1 by @localai-bot in #7795
chore: ⬆️ Update ggml-org/whisper.cpp to e9898ddfb908ffaa7026c66852a023889a5a7202 by @localai-bot in #7810
chore: ⬆️ Update ggml-org/llama.cpp to 13814eb370d2f0b70e1830cc577b6155b17aee47 by @localai-bot in #7809
feat(swagger): update swagger by @localai-bot in #7820
chore: ⬆️ Update ggml-org/llama.cpp to ced765be44ce173c374f295b3c6f4175f8fd109b by @localai-bot in #7822
chore: ⬆️ Update ggml-org/llama.cpp to 706e3f93a60109a40f1224eaf4af0d59caa7c3ae by @localai-bot in #7836
feat(swagger): update swagger by @localai-bot in #7847
chore: ⬆️ Update ggml-org/llama.cpp to e57f52334b2e8436a94f7e332462dfc63a08f995 by @localai-bot in #7848
chore(Makefile): refactor common make targets by @mudler in #7858
chore: ⬆️ Update leejet/stable-diffusion.cpp to b90b1ee9cf84ea48b478c674dd2ec6a33fd504d6 by @localai-bot in #7862
chore: ⬆️ Update ggml-org/llama.cpp to 4974bf53cf14073c7b66e1151348156aabd42cb8 by @localai-bot in #7861
chore: ⬆️ Update leejet/stable-diffusion.cpp to c5602a676caff5fe5a9f3b76b2bc614faf5121a5 by @localai-bot in #7880
chore: ⬆️ Update ggml-org/whisper.cpp to 679bdb53dbcbfb3e42685f50c7ff367949fd4d48 by @localai-bot in #7879
chore: ⬆️ Update ggml-org/llama.cpp to e443fbcfa51a8a27b15f949397ab94b5e87b2450 by @localai-bot in #7881
chore(image-ui): simplify interface by @mudler in #7882
chore(llama.cpp/flags): simplify conditionals by @mudler in #7887
chore: ⬆️ Update ggml-org/llama.cpp to ccbc84a5374bab7a01f68b129411772ddd8e7c79 by @localai-bot in #7894
chore: ⬆️ Update leejet/stable-diffusion.cpp to 9be0b91927dfa4007d053df72dea7302990226bb by @localai-bot in #7895
chore(dockerfile): drop driver-requirements section by @mudler in #7907
chore(detection): detect GPU vendor from files present in the system by @mudler in #7908
chore(ci): restore building of GPU vendor images by @mudler in #7910
chore(Dockerfile): restore GPU vendor specific sections by @mudler in #7911
fix(intel): Add ARG for Ubuntu codename in Dockerfile by @mudler in #7917
chore: ⬆️ Update ggml-org/llama.cpp to ae9f8df77882716b1702df2bed8919499e64cc28 by @localai-bot in #7915
chore(ci): use latest jetpack image for l4t by @mudler in #7926
chore(l4t-12): do not use python 3.12 (wheels are only for 3.10) by @mudler in #7928
chore(docs): Add Crush and VoxInput to the integrations by @richiejp in #7924
Optimize GPU library copying to preserve symlinks and avoid duplicates by @Copilot in #7931
chore(uv): add --index-strategy=unsafe-first-match to l4t by @mudler in #7934
chore: ⬆️ Update leejet/stable-diffusion.cpp to 0e52afc6513cc2dea9a1a017afc4a008d5acf2b0 by @localai-bot in #7930
chore(ci): roll back l4t-cuda12 configurations by @mudler in #7935
Revert "chore(uv): add --index-strategy=unsafe-first-match to l4t" by @mudler in #7936
chore(deps): Bump llama.cpp to '480160d47297df43b43746294963476fc0a6e10f' by @mudler in #7933
chore(llama.cpp): propagate errors during model load by @mudler in #7937
chore: ⬆️ Update ggml-org/llama.cpp to 593da7fa49503b68f9f01700be9f508f1e528992 by @localai-bot in #7946
feat(swagger): update swagger by @localai-bot in #7964
chore: ⬆️ Update ggml-org/llama.cpp to b1377188784f9aea26b8abde56d4aee8c733eec7 by @localai-bot in #7965
fix(l4t-12): use pip to install python deps by @mudler in #7967
chore: ⬆️ Update ggml-org/llama.cpp to 0c3b7a9efebc73d206421c99b7eb6b6716231322 by @localai-bot in #7978
chore: ⬆️ Update leejet/stable-diffusion.cpp to 885e62ea822e674c6837a8225d2d75f021b97a6a by @localai-bot in #7979
chore(backends): do not bundle cuda target directory by @mudler in #7982
chore(vulkan): bump vulkan-sdk to 1.4.335.0 by @mudler in #7981
chore: ⬆️ Update ggml-org/llama.cpp to bcf7546160982f56bc290d2e538544bbc0772f63 by @localai-bot in #7991
chore: ⬆️ Update leejet/stable-diffusion.cpp to 7010bb4dff7bd55b03d35ef9772142c21699eba9 by @localai-bot in #8013
chore: ⬆️ Update ggml-org/whisper.cpp to a96310871a3b294f026c3bcad4e715d17b5905fe by @localai-bot in #8014
chore: ⬆️ Update ggml-org/llama.cpp to e4832e3ae4d58ac0ecbdbf4ae055424d6e628c9f by @localai-bot in #8015
chore: ⬆️ Update ggml-org/whisper.cpp to 47af2fb70f7e4ee1ba40c8bed513760fdfe7a704 by @localai-bot in #8039
chore: ⬆️ Update ggml-org/llama.cpp to d98b548120eecf98f0f6eaa1ba7e29b3afda9f2e by @localai-bot in #8040
fix: reduce log verbosity for /api/operations polling by @Divyanshupandey007 in #8050
chore: ⬆️ Update ggml-org/whisper.cpp to 2eeeba56e9edd762b4b38467bab96c2517163158 by @localai-bot in #8052
chore: ⬆️ Update ggml-org/llama.cpp to 785a71008573e2d84728fb0ba9e851d72d3f8fab by @localai-bot in #8053
fix(ci): use more beefy runner for expensive jobs by @mudler in #8065
Revert "chore(deps): bump torch from 2.3.1+cxx11.abi to 2.8.0 in /backend/python/rerankers in the pip group across 1 directory" by @mudler in #8072
chore: ⬆️ Update ggml-org/llama.cpp to 388ce822415f24c60fcf164a321455f1e008cafb by @localai-bot in #8073
chore: ⬆️ Update ggml-org/whisper.cpp to f53dc74843e97f19f94a79241357f74ad5b691a6 by @localai-bot in #8074
chore(ui): add video generation link by @mudler in #8079
chore: ⬆️ Update ggml-org/llama.cpp to 2fbde785bc106ae1c4102b0e82b9b41d9c466579 by @localai-bot in #8087
chore: ⬆️ Update leejet/stable-diffusion.cpp to 9565c7f6bd5fcff124c589147b2621244f2c4aa1 by @localai-bot in #8086

New Contributors

@majiayu000 made their first contribution in #7785
@coffeerunhobby made their first contribution in #7817
@DEVMANISHOFFL made their first contribution in #7878
@jroeber made their first contribution in #7963
@Divyanshupandey007 made their first contribution in #8050

Full Changelog: v3.9.0...v3.10.0

mudler/LocalAI v3.10.0 on GitHub