🎉 LocalAI 3.10.0 Release! 🚀
LocalAI 3.10.0 is big on agent capabilities, multi-modal support, and cross-platform reliability.
We've added native Anthropic API support, launched a new Video Generation UI, introduced Open Responses API compatibility, and enhanced performance with a unified GPU backend system.
For a full tour, see below!
📌 TL;DR
| Feature | Summary |
|---|---|
| Anthropic API Support | Fully compatible /v1/messages endpoint for seamless drop-in replacement of Claude.
|
| Open Responses API | Native support for stateful agents with tool calling, streaming, background mode, and multi-turn conversations, passing all official acceptance tests. |
| Video & Image Generation Suite | New video gen UI + LTX-2 support for text-to-video and image-to-video. |
| Unified GPU Backends | GPU libraries (CUDA, ROCm, Vulkan) packaged inside backend containers — works out of the box on Nvidia, AMD, and ARM64 (Experimental). |
| Tool Streaming & XML Parsing | Full support for streaming tool calls and XML-formatted tool outputs. |
| System-Aware Backend Gallery | Only see backends your system can run (e.g., hide MLX on Linux). |
| Crash Fixes | Prevents crashes on AVX-only CPUs (Intel Sandy/Ivy Bridge) and fixes VRAM reporting on AMD GPUs. |
| Request Tracing | Debug agents & fine-tuning with memory-based request/response logging. |
| Moonshine Backend | Ultra-fast transcription engine for low-end devices. |
| Pocket-TTS | Lightweight, high-fidelity text-to-speech with voice cloning. |
| Vulkan arm64 builds | We now build backends and images for vulkan on arm64 as well |
🚀 New Features & Major Enhancements
🤖 Open Responses API: Build Smarter, Autonomous Agents
LocalAI now supports the OpenAI Responses API, enabling powerful agentic workflows locally.
- Stateful conversations via
response_id— resume and manage long-running agent sessions. - Background mode: Run agents asynchronously and fetch results later.
- Streaming support for tools, images, and audio.
- Built-in tools: Web search, file search, and computer use (via MCP integrations).
- Multi-turn interaction with dynamic context and tool use.
✅ Ideal for developers building agents that can browse, analyze files, or interact with systems — all on your local machine.
🔧 How to Use:
- Set
response_idin your request to maintain session state across calls.- Use
background: trueto run agents asynchronously.- Retrieve results via
GET /api/v1/responses/{response_id}.- Enable streaming with
stream: trueto receive partial responses and tool calls in real time.
📌 Tip: Use
response_idto build agent orchestration systems that persist context and avoid redundant computation.
Our support passes all the official acceptance tests:
🧠 Anthropic Messages API: Clone Claude Locally
LocalAI now fully supports the Anthropic messages API.
- Use
https://api.localai.host/v1/messagesas a drop-in replacement for Claude. - Full tool/function calling support, just like OpenAI.
- Streaming and non-streaming responses.
- Compatible with
anthropic-sdk-go, LangChain, and other tooling.
🔥 Perfect for teams migrating from Anthropic to local inference with full feature parity.
🎥 Video Generation: From Text to Video in the Web UI
- New dedicated video generation page with intuitive controls.
- LTX-2 is supported
- Supports text-to-video and image-to-video workflows.
- Built on top of
diffuserswith full compatibility.
📌 How to Use:
- Go to
/videoin the web UI.- Enter a prompt (e.g., "A cat walking on a moonlit rooftop").
- Optionally upload an image for image-to-video generation.
- Adjust parameters like
fps,num_frames, andguidance_scale.
⚙️ Unified GPU Backends: Acceleration Works Out of the Box
A major architectural upgrade: GPU libraries (CUDA, ROCm, Vulkan) are now packaged inside backend containers.
- Single image: Now you don't need anymore to pull a specific image for your GPU. Any image works regardless if you have a GPU or not.
- No more manual GPU driver setup — just run the image and get acceleration.
- Works on Nvidia (CUDA), AMD (ROCm), and ARM64 (Vulkan).
- Vulkan arm64 builds enabled
- Reduced image complexity, faster builds, and consistent performance.
🚀 This means latest/master images now support GPU acceleration on all platforms — no extra config!
Note: this is experimental, please help us by filing an issue if something doesn't work!
🧩 Tool Streaming & Advanced Parsing
Enhance your agent workflows with richer tool interaction.
- Streaming tool calls: Receive partial tool arguments in real time (e.g.,
input_json_delta). - XML-style tool call parsing: Models that return tools in XML format (
<function>...</function>) are now properly parsed alongside text. - Works across all backends (llama.cpp, vLLM, diffusers, etc.).
💡 Enables more natural, real-time interaction with agents that use structured tool outputs.
🌐 System-Aware Backend Gallery: Only Compatible Backends Show
The backend gallery now shows only backends your system can run.
- Auto-detects system capabilities (CPU, GPU, MLX, etc.).
- Hides unsupported backends (e.g., MLX on Linux, CUDA on AMD).
- Shows detected capabilities in the hero section.
🎤 New TTS Backends: Pocket-TTS
Add expressive voice generation to your apps with Pocket-TTS.
- Real-time text-to-speech with voice cloning support (requires HF login).
- Lightweight, fast, and open-source.
- Available in the model gallery.
🗣️ Perfect for voice agents, narrators, or interactive assistants.
❗ Note: Voice cloning requires HF authentication and a registered voice model.
🔍 Request Tracing: Debug Your Agents
Trace requests and responses in memory — great for fine-tuning and agent debugging.
- Enable via runtime setting or API.
- Log stored in memory, dropped after max size.
- Fetch logs via
GET /api/v1/trace. - Export to JSON for analysis.
🪄 New 'Reasoning' Field: Extract Thinking Steps
LocalAI now automatically detects and extracts thinking tags from model output.
- Supports both SSE and non-SSE modes.
- Displays reasoning steps in the chat UI (under "Thinking" tab).
- Fixes issue where thinking content appeared as part of final answer.
🚀 Moonshine Backend: Faster Transcription for Low-End Devices
Add Moonshine, an ONNX-based transcription engine, for fast, lightweight speech-to-text.
- Optimized for low-end devices (Raspberry Pi, older laptops).
- One of the fastest transcription engines available.
- Supports live transcription.
🛠️ Fixes & Stability Improvements
🔧 Prevent BMI2 Crashes on AVX-Only CPUs
Fixed crashes on older Intel CPUs (Ivy Bridge, Sandy Bridge) that lack BMI2 instructions.
- Now safely falls back to
llama-cpp-fallback(SSE2 only). - No more
EOFerrors during model warmup.
✅ Ensures LocalAI runs smoothly on older hardware.
📊 Fix Swapped VRAM Usage on AMD GPUs
Correctly parses rocm-smi output: used and total VRAM are now displayed correctly.
- Fixes misreported memory usage on dual-Radeon setups.
- Handles
HIP_VISIBLE_DEVICESproperly (e.g., when using only discrete GPU).
🚀 The Complete Local Stack for Privacy-First AI
LocalAI |
The free, Open Source OpenAI alternative. Drop-in replacement REST API compatible with OpenAI specifications for local AI inferencing. No GPU required. |
LocalAGI |
Local AI agent management platform. Drop-in replacement for OpenAI's Responses API, supercharged with advanced agentic capabilities and a no-code UI. |
LocalRecall |
RESTful API and knowledge base management system providing persistent memory and storage capabilities for AI agents. Works alongside LocalAI and LocalAGI. |
❤️ Thank You
LocalAI is a true FOSS movement — built by contributors, powered by community.
If you believe in privacy-first AI:
- ✅ Star the repo
- 💬 Contribute code, docs, or feedback
- 📣 Share with others
Your support keeps this stack alive.
✅ Full Changelog
📋 Click to expand full changelog
What's Changed
Bug fixes 🐛
- fix(ui): correctly parse import errors by @mudler in #7726
- fix(cli): import via CLI needs system state by @mudler in #7746
- fix(amd-gpu): correctly show total and used vram by @mudler in #7761
- fix: add nil checks before mergo.Merge to prevent panic in gallery model installation by @majiayu000 in #7785
- fix: Usage for image generation is incorrect (and causes error in LiteLLM) by @majiayu000 in #7786
- fix: propagate validation errors by @majiayu000 in #7787
- fix: Failed to download checksums.txt when using launch to install localai by @majiayu000 in #7788
- fix(image-gen): fix scrolling issues by @mudler in #7829
- fix(llama.cpp/mmproj): fix loading mmproj in nested sub-dirs different from model path by @mudler in #7832
- fix: Prevent BMI2 instruction crash on AVX-only CPUs by @coffeerunhobby in #7817
- fix: Highly inconsistent agent response to cogito agent calling MCP server - Body "Invalid http method" by @majiayu000 in #7790
- fix(chat/ui): record model name in history for consistency by @mudler in #7845
- fix(ui): fix 404 on API menu link by pointing to index.html by @DEVMANISHOFFL in #7878
- fix: BMI2 crash on AVX-only CPUs (Intel Ivy Bridge/Sandy Bridge) by @coffeerunhobby in #7864
- fix(model): do not assume success when deleting a model process by @jroeber in #7963
- fix(functions): do not duplicate function when valid JSON is inside XML tags by @mudler in #8043
Exciting New Features 🎉
- feat: disable force eviction by @mudler in #7725
- feat(api): Allow tracing of requests and responses by @richiejp in #7609
- feat(UI): image generation improvements by @mudler in #7804
- feat(image-gen/UI): move controls to the left, make the page more compact by @mudler in #7823
- feat(function): Add tool streaming, XML Tool Call Parsing Support by @mudler in #7865
- chore: Update to Ubuntu24.04 (cont #7423) by @richiejp in #7769
- feat: package GPU libraries inside backend containers for unified base image by @Copilot in #7891
- feat(backends): add moonshine backend for faster transcription by @mudler in #7833
- feat: enable Vulkan arm64 image builds by @Copilot in #7912
- feat: Add Anthropic Messages API support by @Copilot in #7948
- feat: add tool/function calling support to Anthropic Messages API by @Copilot in #7956
- feat(api): support 'reasoning' api field by @mudler in #7959
- feat: Filter backend gallery by system capabilities by @Copilot in #7950
- feat(tts): add pocket-tts backend by @mudler in #8018
- feat(diffusers): add support to LTX-2 by @mudler in #8019
- feat(ui): add video gen UI by @mudler in #8020
- feat(api): add support for open responses specification by @mudler in #8063
🧠 Models
- chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #7801
- chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #7807
- chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #7816
- Fix(gallery): Updated checksums for qwen3-vl-30b instruct & thinking by @Nold360 in #7819
- chore(model-gallery): ⬆️ update checksum by @localai-bot in #7821
- chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #7826
- chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #7831
- chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #7840
- chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #7903
- chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #7916
- chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #7922
- chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #7954
- chore(model gallery): add qwen3-coder-30b-a3b-instruct based on model request by @rampa3 in #8082
📖 Documentation and examples
- chore(AGENTS.md): Add section to help with building backends by @richiejp in #7871
- [gallery] add JSON schema for gallery model specification by @DEVMANISHOFFL in #7890
- chore(doc): put alert on install.sh until is fixed by @mudler in #8042
👒 Dependencies
- chore(deps): bump securego/gosec from 2.22.9 to 2.22.11 by @dependabot[bot] in #7774
- chore(deps): bump google.golang.org/grpc from 1.77.0 to 1.78.0 by @dependabot[bot] in #7777
- chore(deps): bump github.com/schollz/progressbar/v3 from 3.18.0 to 3.19.0 by @dependabot[bot] in #7775
- chore(deps): bump github.com/modelcontextprotocol/go-sdk from 1.1.0 to 1.2.0 by @dependabot[bot] in #7776
- chore(deps): bump dependabot/fetch-metadata from 2.4.0 to 2.5.0 by @dependabot[bot] in #7876
- chore(deps): bump github.com/labstack/echo/v4 from 4.14.0 to 4.15.0 by @dependabot[bot] in #7875
- chore(deps): bump protobuf from 6.33.2 to 6.33.4 in /backend/python/transformers by @dependabot[bot] in #7993
- chore(deps): bump github.com/mudler/go-processmanager from 0.0.0-20240820160718-8b802d3ecf82 to 0.1.0 by @dependabot[bot] in #7992
- chore(deps): bump github.com/onsi/gomega from 1.38.3 to 1.39.0 by @dependabot[bot] in #8000
- chore(deps): bump github.com/gpustack/gguf-parser-go from 0.22.1 to 0.23.1 by @dependabot[bot] in #8001
- chore(deps): bump fyne.io/fyne/v2 from 2.7.1 to 2.7.2 by @dependabot[bot] in #8003
- chore(deps): bump github.com/onsi/ginkgo/v2 from 2.27.3 to 2.27.5 by @dependabot[bot] in #8004
- chore(deps): bump torch from 2.3.1+cxx11.abi to 2.8.0 in /backend/python/rerankers in the pip group across 1 directory by @dependabot[bot] in #8066
Other Changes
- docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #7716
- chore: ⬆️ Update ggml-org/whisper.cpp to
6114e692136bea917dc88a5eb2e532c3d133d963by @localai-bot in #7717 - chore: ⬆️ Update ggml-org/llama.cpp to
c18428423018ed214c004e6ecaedb0cbdda06805by @localai-bot in #7718 - chore: ⬆️ Update ggml-org/llama.cpp to
85c40c9b02941ebf1add1469af75f1796d513ef4by @localai-bot in #7731 - chore: ⬆️ Update ggml-org/llama.cpp to
7ac8902133da6eb390c4d8368a7d252279123942by @localai-bot in #7740 - chore: ⬆️ Update ggml-org/llama.cpp to
a4bf35889eda36d3597cd0f8f333f5b8a2fcaefcby @localai-bot in #7751 - chore: ⬆️ Update ggml-org/llama.cpp to
4ffc47cb2001e7d523f9ff525335bbe34b1a2858by @localai-bot in #7760 - chore(ci): be more precise when detecting existing models by @mudler in #7767
- chore: ⬆️ Update leejet/stable-diffusion.cpp to 4ff2c8c74bd17c2cfffe3a01be77743fb3efba2f by @richiejp in #7771
- chore: ⬆️ Update ggml-org/llama.cpp to
c9a3b40d6578f2381a1373d10249403d58c3c5bdby @localai-bot in #7778 - Revert "chore(deps): bump securego/gosec from 2.22.9 to 2.22.11" by @mudler in #7789
- feat(swagger): update swagger by @localai-bot in #7794
- chore: ⬆️ Update ggml-org/llama.cpp to
0f89d2ecf14270f45f43c442e90ae433fd82dab1by @localai-bot in #7795 - chore: ⬆️ Update ggml-org/whisper.cpp to
e9898ddfb908ffaa7026c66852a023889a5a7202by @localai-bot in #7810 - chore: ⬆️ Update ggml-org/llama.cpp to
13814eb370d2f0b70e1830cc577b6155b17aee47by @localai-bot in #7809 - feat(swagger): update swagger by @localai-bot in #7820
- chore: ⬆️ Update ggml-org/llama.cpp to
ced765be44ce173c374f295b3c6f4175f8fd109bby @localai-bot in #7822 - chore: ⬆️ Update ggml-org/llama.cpp to
706e3f93a60109a40f1224eaf4af0d59caa7c3aeby @localai-bot in #7836 - feat(swagger): update swagger by @localai-bot in #7847
- chore: ⬆️ Update ggml-org/llama.cpp to
e57f52334b2e8436a94f7e332462dfc63a08f995by @localai-bot in #7848 - chore(Makefile): refactor common make targets by @mudler in #7858
- chore: ⬆️ Update leejet/stable-diffusion.cpp to
b90b1ee9cf84ea48b478c674dd2ec6a33fd504d6by @localai-bot in #7862 - chore: ⬆️ Update ggml-org/llama.cpp to
4974bf53cf14073c7b66e1151348156aabd42cb8by @localai-bot in #7861 - chore: ⬆️ Update leejet/stable-diffusion.cpp to
c5602a676caff5fe5a9f3b76b2bc614faf5121a5by @localai-bot in #7880 - chore: ⬆️ Update ggml-org/whisper.cpp to
679bdb53dbcbfb3e42685f50c7ff367949fd4d48by @localai-bot in #7879 - chore: ⬆️ Update ggml-org/llama.cpp to
e443fbcfa51a8a27b15f949397ab94b5e87b2450by @localai-bot in #7881 - chore(image-ui): simplify interface by @mudler in #7882
- chore(llama.cpp/flags): simplify conditionals by @mudler in #7887
- chore: ⬆️ Update ggml-org/llama.cpp to
ccbc84a5374bab7a01f68b129411772ddd8e7c79by @localai-bot in #7894 - chore: ⬆️ Update leejet/stable-diffusion.cpp to
9be0b91927dfa4007d053df72dea7302990226bbby @localai-bot in #7895 - chore(dockerfile): drop driver-requirements section by @mudler in #7907
- chore(detection): detect GPU vendor from files present in the system by @mudler in #7908
- chore(ci): restore building of GPU vendor images by @mudler in #7910
- chore(Dockerfile): restore GPU vendor specific sections by @mudler in #7911
- fix(intel): Add ARG for Ubuntu codename in Dockerfile by @mudler in #7917
- chore: ⬆️ Update ggml-org/llama.cpp to
ae9f8df77882716b1702df2bed8919499e64cc28by @localai-bot in #7915 - chore(ci): use latest jetpack image for l4t by @mudler in #7926
- chore(l4t-12): do not use python 3.12 (wheels are only for 3.10) by @mudler in #7928
- chore(docs): Add Crush and VoxInput to the integrations by @richiejp in #7924
- Optimize GPU library copying to preserve symlinks and avoid duplicates by @Copilot in #7931
- chore(uv): add --index-strategy=unsafe-first-match to l4t by @mudler in #7934
- chore: ⬆️ Update leejet/stable-diffusion.cpp to
0e52afc6513cc2dea9a1a017afc4a008d5acf2b0by @localai-bot in #7930 - chore(ci): roll back l4t-cuda12 configurations by @mudler in #7935
- Revert "chore(uv): add --index-strategy=unsafe-first-match to l4t" by @mudler in #7936
- chore(deps): Bump llama.cpp to '480160d47297df43b43746294963476fc0a6e10f' by @mudler in #7933
- chore(llama.cpp): propagate errors during model load by @mudler in #7937
- chore: ⬆️ Update ggml-org/llama.cpp to
593da7fa49503b68f9f01700be9f508f1e528992by @localai-bot in #7946 - feat(swagger): update swagger by @localai-bot in #7964
- chore: ⬆️ Update ggml-org/llama.cpp to
b1377188784f9aea26b8abde56d4aee8c733eec7by @localai-bot in #7965 - fix(l4t-12): use pip to install python deps by @mudler in #7967
- chore: ⬆️ Update ggml-org/llama.cpp to
0c3b7a9efebc73d206421c99b7eb6b6716231322by @localai-bot in #7978 - chore: ⬆️ Update leejet/stable-diffusion.cpp to
885e62ea822e674c6837a8225d2d75f021b97a6aby @localai-bot in #7979 - chore(backends): do not bundle cuda target directory by @mudler in #7982
- chore(vulkan): bump vulkan-sdk to 1.4.335.0 by @mudler in #7981
- chore: ⬆️ Update ggml-org/llama.cpp to
bcf7546160982f56bc290d2e538544bbc0772f63by @localai-bot in #7991 - chore: ⬆️ Update leejet/stable-diffusion.cpp to
7010bb4dff7bd55b03d35ef9772142c21699eba9by @localai-bot in #8013 - chore: ⬆️ Update ggml-org/whisper.cpp to
a96310871a3b294f026c3bcad4e715d17b5905feby @localai-bot in #8014 - chore: ⬆️ Update ggml-org/llama.cpp to
e4832e3ae4d58ac0ecbdbf4ae055424d6e628c9fby @localai-bot in #8015 - chore: ⬆️ Update ggml-org/whisper.cpp to
47af2fb70f7e4ee1ba40c8bed513760fdfe7a704by @localai-bot in #8039 - chore: ⬆️ Update ggml-org/llama.cpp to
d98b548120eecf98f0f6eaa1ba7e29b3afda9f2eby @localai-bot in #8040 - fix: reduce log verbosity for /api/operations polling by @Divyanshupandey007 in #8050
- chore: ⬆️ Update ggml-org/whisper.cpp to
2eeeba56e9edd762b4b38467bab96c2517163158by @localai-bot in #8052 - chore: ⬆️ Update ggml-org/llama.cpp to
785a71008573e2d84728fb0ba9e851d72d3f8fabby @localai-bot in #8053 - fix(ci): use more beefy runner for expensive jobs by @mudler in #8065
- Revert "chore(deps): bump torch from 2.3.1+cxx11.abi to 2.8.0 in /backend/python/rerankers in the pip group across 1 directory" by @mudler in #8072
- chore: ⬆️ Update ggml-org/llama.cpp to
388ce822415f24c60fcf164a321455f1e008cafbby @localai-bot in #8073 - chore: ⬆️ Update ggml-org/whisper.cpp to
f53dc74843e97f19f94a79241357f74ad5b691a6by @localai-bot in #8074 - chore(ui): add video generation link by @mudler in #8079
- chore: ⬆️ Update ggml-org/llama.cpp to
2fbde785bc106ae1c4102b0e82b9b41d9c466579by @localai-bot in #8087 - chore: ⬆️ Update leejet/stable-diffusion.cpp to
9565c7f6bd5fcff124c589147b2621244f2c4aa1by @localai-bot in #8086
New Contributors
- @majiayu000 made their first contribution in #7785
- @coffeerunhobby made their first contribution in #7817
- @DEVMANISHOFFL made their first contribution in #7878
- @jroeber made their first contribution in #7963
- @Divyanshupandey007 made their first contribution in #8050
Full Changelog: v3.9.0...v3.10.0
