🎉 LocalAI 4.2.0 Release! 🚀
LocalAI 4.2.0 is out!
This release teaches LocalAI to see and hear. New /v1/voice/* and /v1/audio/diarization endpoints, a full face-recognition pipeline with antispoofing, word-level timestamps for faster-whisper, and a client-cancellable Whisper. There is also a drop-in Ollama API, video generation in stable-diffusion.ggml, a redesigned chat with i18n and admin-configurable branding, eleven new backends, an interactive model config editor with autocomplete, and a hardened distributed mode v2. vLLM finally hits feature parity with llama.cpp and gets tensor-parallel distributed workers.
📌 TL;DR
| Feature | Summary |
|---|---|
| 🎙️ Voice Recognition | New /v1/voice/*. Verify, identify, embed and analyze speakers.
|
| 👤 Face Recognition + Liveness | 1:1 verify, 1:N identify, detect, analyze, embed, and reject spoofed photos. |
| 🎬 Diarization | New /v1/audio/diarization endpoint, "who spoke when?" via sherpa-onnx + vibevoice.cpp.
|
| 🗣️ Better Transcriptions | Word-level timestamps, client-cancellable Whisper, segments + duration + language on the stream-done event. |
| 🦙 Ollama API | Drop-in compatibility. Point your ollama client straight at LocalAI.
|
| 🎬 Video Generation | stable-diffusion.ggml now generates video (i2v, first-last-frame).
|
| 💬 Redesigned UI | Chat redesign, Nord palette, i18n (5 languages), admin-configurable branding. |
| ✏️ Interactive Model Editor | Autocomplete-driven config editor in the UI. |
| 📦 Universal Importer | Imports across most backends, not just llama.cpp. |
| 🚦 Concurrency Groups | Per-model exclusive groups for safe backend loading. |
| 🧪 11 New Backends | sglang, ik-llama-cpp, TurboQuant, sam.cpp, Kokoros, qwen3tts.cpp, tinygrad-multimodal, LocalVQE, vibevoice-cpp, insightface (liveness), voice-rec. |
| ⚡ vLLM @ parity | Feature parity with llama.cpp + tensor-parallel distributed workers + full engine_args.
|
| 🛰️ Distributed v2 | Hardened orchestrator, round-robin replicas, scoped Upgrade All, NATS install/upgrade split. |
🚀 New Features & Major Enhancements
🎙️ Voice Recognition
LocalAI is now ears-on. New /v1/voice/* endpoints let you verify, identify, analyze and embed speakers, powered by a SpeechBrain + ONNX Python backend.
- 1:1 Verify, "is this the same speaker?"
- 1:N Identify, "who is talking, out of my enrolled users?"
- Embeddings, voice fingerprints for your own pipelines
- Analyze, age, gender, emotion attributes per segment
🔥 Pairs naturally with the new diarization endpoint for full speaker pipelines.
voice.mp4
👤 Face Recognition & Antispoofing
A complete face-biometrics pipeline, built on InsightFace + ONNX.
- 1:1 Verify, match two faces
- 1:N Identify, resolve a face against an enrolled set
- Detection & Analysis, find faces, extract attributes (age, gender, emotion, race)
- Embeddings, facial fingerprints for your own stack
- 🆕 Antispoofing (liveness), reject spoofed photos and videos
✅ Samples never leave your machine. They go only to the running backend.
face.mp4
🎬 Diarization & a smarter audio pipeline
Audio is a first-class citizen now.
/v1/audio/diarization, segments speech by speaker turn (sherpa-onnx + vibevoice.cpp)- Word-level timestamps for faster-whisper
- Client cancellation for Whisper via the ggml
abort_callback. Stop a transcription mid-flight and free the GPU. - Stream-done metadata on
/v1/audio/transcriptions.segments,durationandlanguageon the final event. - Audio transformations UI (LocalVQE), explore audio FX directly from the React UI
- Transcription error visibility, handler errors land in the access log and on the client
🦙 Ollama drop-in API
Point your existing Ollama client at LocalAI. Everything keeps working. Another front door, same engine.
OLLAMA_HOST=http://localhost:8080 ollama run qwen3🎬 Video Generation
The stable-diffusion.ggml backend now generates video, with curated gallery entries for Wan 2.1 FLF2V 14B 720P and Wan i2v 720p, plus a new stablediffusion-ggml-development meta backend to track the cutting edge.
🎨 React UI: total refresh
A massive UI cycle landed in 4.2:
- 💬 Chat redesign, cleaner layout, faster perceived latency, better message density
- 🎨 Editorial refresh with the Nord palette, calmer, more focused, dark-mode-first
- 🌍 Multilingual / i18n, English, Italiano, Español, Deutsch, 简体中文
- 🪪 Brandable instance, admin-configurable name, tagline, and assets (logo, favicon)
- ✏️ Interactive model config editor, autocomplete over known fields, live validation, automatic file-renaming on save
- 🧰 Backend management UX, revamped backend list with concrete versions
- 🛟 Better error UX, distributed backend management errors surface cleanly
💡 Self-host with your branding. The login page, sidebar, footer, and browser tab all pick up the instance name and logo.
chat.mp4
i18n.mp4
🔄 Backend & model lifecycle
- Backend versioning with automatic upgrade detection
- Pin models so they survive the reaper
- On-demand toggle per model to control auto-load
- Concurrency groups, per-model exclusive groups so heavy backends won't trample each other
- Universal importer, single flow that imports across most backends, with clean multi-shard GGUF handling and dedicated importers for vibevoice-cpp and whisper.cpp HF repos
importer.mp4
model-editor.mp4
🧪 New Backends!
| Backend | What it brings |
|---|---|
| sglang | High-throughput LLM serving + speculative decoding (EAGLE/EAGLE3/DFLASH/MTP) |
| ik-llama.cpp | ikawrakow's llama.cpp fork |
| TurboQuant | Quant-focused llama.cpp fork |
| sam.cpp | Segment Anything detection |
| Kokoros | Rust-native Kokoro TTS |
| qwen3tts.cpp | Qwen3 TTS |
| tinygrad-multimodal (experimental) | tinygrad-powered multimodal |
| vibevoice.cpp | Diarization-grade speech |
| LocalVQE | Audio transformations / FX |
| insightface | Face antispoofing |
| voice-rec | Speaker recognition / embeddings |
⚡ vLLM at parity (and beyond)
- vLLM parity with llama.cpp, same feature surface, same ergonomics
- vLLM
engine_args, the fullAsyncEngineArgsexposed via a generic YAML map - Tensor-parallel distributed workers, fan a single model across nodes
- CUDA 13 builds for vLLM, vLLM-omni and sglang
- L4T arm64 (CUDA 13), vLLM/vLLM-omni/sglang variants for Jetson-class arm64
- MLX backend refactored, shared helpers and enhanced functionality
- llama.cpp
split_modefor explicit multi-GPU placement - Speculative decoding wired through for llama.cpp, Gemma 4 thinking support added
- Vision / mtmd marker propagated from the backend via
ModelMetadata
🛰️ Distributed Mode v2
Distributed mode keeps maturing. This release was a hardening pass across the orchestration loop:
- Orchestrator resilience, auto-upgrade routing, worker bind-wait, RAG-init crash, log-spam fixes
- Round-robin across replicas of the same model
- Upgrade All scoped to nodes that actually have the backend installed
- NATS install / upgrade split,
backend.upgradeno longer piggybacks on install - Cached-replica lookup honors NodeSelector, the reconciler no longer scales up empty backends
- VRAM/RAM reporting correct on NVIDIA unified-memory hosts
- Agent nodes, queue loops stop on teardown, dead-letter cap added
- Autoscaling, load-model extracted from
Route()and applied during autoscale
🔐 Auth & Security
- Settings API, env-supplied
ApiKeysare stripped before persisting (no accidental leaks) - grpc-server hardening, removed unsafe
sprintf()in the C++ grpc server - OIDC, bumped
go-oidc/v3to 3.18.0 - Security hardening pass across the codebase
- AI coding assistants policy, LocalAI now follows the Linux kernel's DCO/attribution guidelines (
Assisted-by:trailer, no AI co-authors)
🖥️ Hardware & deployment
- CUDA 13 for vLLM, vLLM-omni, and sglang
- NVIDIA L4T arm64 (CUDA 13) for Jetson-class boards
- ROCm 7.x bumped to latest
- gfx1151 (Strix Halo / Ryzen AI MAX) support,
AMDGPU_TARGETSexposed as a build-arg - Intel GPU, latest oneapi-basekit (b70 support) across Intel images
- arm64 CI, cpu-whisperx and cpu-faster-whisper now ship arm64 images
- whisperx, ROCm/HIPBLAS target dropped (pinned to rocm6.4 wheels)
🛠️ Under the Hood
- Better CLI errors with actionable guidance
- golangci-lint baseline (
new-from-merge-base) keeps drift in check - Coding-agent discoverability, new APIs let coding agents introspect and configure LocalAI
- Autoparser, prefers backend-emitted chat deltas, correct logprob passthrough, strips partial reasoning tags during warm-up
- Reasoning + tools, no more empty content from thinking models in retry loops
- Streaming hygiene, deduped content, deduped tool calls, recovered reasoning, unique
tool_callIDs in deferred flushes - HTTP, handler-error status now visible in the access log + transcription error surface
- Backend monitor accepts
modelas a query parameter - Config loader, YAML backup files are ignored
- GGUF thinking probe respects explicit
reasoningconfig - Inference defaults refreshed from Unsloth
- Embeddings on collection upload, dim changes handled gracefully
- Python backends, JIT subprocesses use
tempfile.gettempdir()instead of hardcoded/tmp - Draft model paths, relative paths now resolve against the models dir
- whisper-cpp: implement streaming transcription and context cancellation
🐞 Notable fixes
- Cascading user deletion on PostgreSQL, deleting a user removes all owned data
- Importer emits all shards for multi-part GGUF models
- Open Responses parses OpenAI-spec nested
tool_choiceand uses the correct setter - llama-cpp:
server-chat.cppincluded in grpc-server TU,common -> llama-commonrename, turboquantcommon.hdetection - ik-llama-cpp: adapted to
common_grammarinsampling.h, patchedclip.cppfor the newggml_quantize_chunksignature - Kokoros: trait stubs (
face_verify,face_analyze,audio_transcription_stream), CI publish - stable-diffusion.ggml: MP4 container forced in ffmpeg mux, new i2v options
- Gallery: orphaned meta-backend uninstall, gemma-4 URIs, flux-kontext param overrides, Wan dedup, z-image-turbo load, Qwen3.5 typo override, tag-casing normalization
- Streaming: content + tool-call dedup, reasoning recovery, unique tool-call IDs in deferred flush
- Realtime: consume ChatDeltas when the C++ autoparser clears
Response - Tool-calls: use
SetFunctionCallNameStringwhen forcing a specific tool - Faster-whisper: cast segment timestamps to int after multiplication
- mlx-vlm: pinned to v0.4.4 to unblock CUDA builds
- vLLM: dropped flash-attn wheel to avoid torch 2.10 ABI mismatch
- Downloader: list supported URL schemes in
DownloadFileerrors - Backend: resolve relative
draft_modelpaths against the models dir - CI: wire
AMDGPU_TARGETSthrough the backend workflow, switch gallery-agent tosigs.k8s.io/yaml, recover rerankers + vllm-omni on aarch64, unbreak master CI for docs/kokoros/vibevoice-cpp ABI
🆕 Gallery additions
- Wan 2.1 FLF2V 14B 720P (video)
- Wan i2v 720p (image-to-video)
- stablediffusion-ggml-development meta backend
- chroma1-hd (diffusers)
- Gemma 4 (+ mmproj)
- EmbeddingGemma
- Qwen 3.5, Qwen-ASR, OCR entries for llama.cpp
- Qwen3-VL Reranker, Qwen3-VL Embedding (tagged)
- A steady stream of automated gallery-agent model additions throughout the cycle 🤖
🚀 The Complete Local Stack for Privacy-First AI
LocalAI |
The free, Open Source OpenAI alternative. Drop-in REST API compatible with OpenAI specs for local AI inferencing. No GPU required. |
LocalAGI |
Local AI agent management platform. Drop-in for OpenAI's Responses API, with advanced agentic capabilities and a no-code UI. |
LocalRecall |
RESTful API and knowledge-base management providing persistent memory and storage for AI agents. Pairs with LocalAI and LocalAGI. |
❤️ Thank You
LocalAI is a true FOSS movement, built by contributors, powered by community.
If you believe in privacy-first, self-hosted AI:
- ⭐ Star the repo
- 💬 Contribute code, docs, translations or feedback
- 📣 Share with others
Your support keeps this stack alive.
✅ Full Changelog
📋 Click to expand full changelog
What's Changed
Bug fixes 🐛
- fix(autoscaling): extract load model from Route() and use as well when doing autoscale by @mudler in #9270
- fix(nodes): better detection if nodes goes down or model is not available by @mudler in #9274
- fix: try to add whisperx and faster-whisper for more variants by @mudler in #9278
- fix: thinking models with tools returning empty content (reasoning-only retry loop) by @mudler in #9290
- fix(streaming): deduplicate tool call emissions during streaming by @mudler in #9292
- fix(streaming): skip chat deltas for role-init elements to prevent first token duplication by @mudler in #9299
- Fix load of z-image-turbo by @thelittlefireman in #9264
- fix(agents): handle embedding model dim changes on collection upload by @mudler in #9365
- fix(gallery): correct gemma-4 model URIs returning 404 by @mvanhorn in #9379
- fix(ui): rename model config files on save to prevent duplicates by @mudler in #9388
- fix(ci): switch gallery-agent to sigs.k8s.io/yaml by @mudler in #9397
- fix(llama-cpp): rename linked target common -> llama-common by @mudler in #9408
- fix(vision): propagate mtmd media marker from backend via ModelMetadata by @mudler in #9412
- fix(turboquant): resolve common.h by detecting llama-common vs common target by @mudler in #9413
- fix(rocm): add gfx1151 support and expose AMDGPU_TARGETS build-arg by @keithmattix in #9410
- fix(kokoros): implement audio_transcription_stream trait stub by @mudler in #9422
- fix(turboquant): drop ignore-eos patch, bump fork to b8967-627ebbc by @mudler in #9423
- fix(distributed): stop queue loops on agent nodes + dead-letter cap by @mudler in #9433
- fix(gallery): allow uninstalling orphaned meta backends + force reinstall by @mudler in #9434
- fix(stable-diffusion.ggml): force mp4 container in ffmpeg mux by @mudler in #9435
- fix(settings): strip env-supplied ApiKeys from the request before persisting by @SAY-5 in #9438
- fix(api): remove duplicate /api/traces endpoint that broke React UI by @pjbrzozowski in #9427
- fix(distributed): pass ExternalURI through NATS backend install by @russell in #9446
- fix(ci): wire AMDGPU_TARGETS through backend build workflow by @russell in #9445
- fix(config): ignore yaml backup files in model loader by @leinasi2014 in #9443
- [gallery] Fix duplicate sha256 keys in Wan models by @sec171 in #9461
- fix(tests): update InstallBackend call sites for new URI/Name/Alias params by @mudler in #9467
- Fix: Add model parameter to neutts-air gallery definition by @localai-bot in #8793
- fix(gallery-agent): process blacklist command on recently-closed PRs by @mudler in #9473
- Respect explicit reasoning config during GGUF thinking probe by @leinasi2014 in #9463
- fix(streaming): dedupe content, recover reasoning, unique tool_call IDs in deferred flush by @mudler in #9470
- fix(backend-monitor): accept model as a query parameter by @Dennisadira in #9411
- fix(kokoros): Build and publish the backend images from CI/CD by @richiejp in #9487
- fix: remove unsafe sprintf() in grpc-server.cpp by @orbisai0security in #9486
- fix(kokoros): implement face_verify and face_analyze trait stubs by @mudler in #9499
- fix(ik-llama-cpp): adapt to common_grammar struct in sampling.h by @mudler in #9512
- fix(llama-cpp): include server-chat.cpp in grpc-server translation unit by @mudler in #9511
- fix(importer): emit all shards for multi-part GGUF models by @mudler in #9513
- fix(openresponses): parse OpenAI-spec nested tool_choice + use correct setter by @walcz-de in #9509
- fix: use SetFunctionCallNameString when forcing a specific tool (3 sites) by @Anai-Guo in #9526
- fix(ik-llama-cpp): patch clip.cpp for new ggml_quantize_chunk signature by @mudler in #9531
- fix(realtime): consume ChatDeltas when C++ autoparser clears Response by @richiejp in #9538
- fix: add hipblaslt library by @eglia in #9541
- fix(distributed): correct VRAM/RAM reporting on NVIDIA unified-memory hosts by @mudler in #9545
- fix(vllm): drop flash-attn wheel to avoid torch 2.10 ABI mismatch by @richiejp in #9557
- fix(mlx-vlm): pin upstream to v0.4.4 to unblock CUDA builds by @mudler in #9568
- fix(gallery): normalize inconsistent tag casing/plurals across gallery models by @Anai-Guo in #9574
- fix(gallery): correct Qwen3.5 typo in qwen3.5-27b-claude-4.6 model override (closes #9362) by @Anai-Guo in #9580
- fix(diffusers): drop compel from requirements to unblock pip resolver by @mudler in #9632
- fix(ci): fix AMDGPU_TARGETS empty-string bypass in hipblas builds by @russell in #9626
- fix(distributed): honor NodeSelector in cached-replica lookup, stop empty-backend reconciler scaleups by @localai-bot in #9652
- fix(distributed): orchestrator resilience — auto-upgrade routing, worker bind-wait, RAG-init crash, log spam by @localai-bot in #9657
- fix(faster-whisper): cast segment timestamps to int after multiplication by @arteven in #9674
- fix(python-backend): make JIT subprocesses work on hosts of any size by @richiejp in #9679
- fix(distributed): scope Upgrade All to nodes that have the backend installed by @mudler in #9678
- fix(backend): resolve relative draft_model paths against the models dir by @localai-bot in #9680
- fix: unbreak master CI (docs, kokoros, vibevoice-cpp ABI) by @localai-bot in #9682
- fix(ci): unbreak rerankers (torch bump) and vllm-omni on aarch64 by @localai-bot in #9688
- fix(distributed): round-robin replicas of the same model by @localai-bot in #9695
- fix(downloader): list supported URL schemes in DownloadFile error by @Anai-Guo in #9689
- fix(auth): cascade user deletion across all owned data on PostgreSQL by @localai-bot in #9702
- fix(http): make handler-error status visible in access log + transcription errors by @localai-bot in #9707
- fix(distributed): make backend upgrade actually re-install on workers by @localai-bot in #9708
- fix(distributed): split NATS backend.upgrade off install + dedup loads by @localai-bot in #9717
- fix(gallery): keep auto-upgrade off non-dev backends when -development is installed by @mudler in #9736
Exciting New Features 🎉
- feat(ui): Interactive model config editor with autocomplete by @richiejp in #9149
- feat: track files being staged by @mudler in #9275
- feat: Add Kokoros backend by @richiejp in #9212
- feat(api): add ollama compatibility by @mudler in #9284
- feat(sam.cpp): add sam.cpp detection backend by @mudler in #9288
- feat(swagger): update swagger by @localai-bot in #9300
- chore(qwen3-asr): pass prompt as context to transcribe by @mudler in #9301
- feat: Add toggle mechanism to enable/disable models from loading on demand by @neurocis in #9304
- feat: allow to pin models and skip from reaping by @mudler in #9309
- feat(swagger): update swagger by @localai-bot in #9310
- feat: backend versioning, upgrade detection and auto-upgrade by @mudler in #9315
- feat(swagger): update swagger by @localai-bot in #9318
- feat(qwen3tts.cpp): add new backend by @mudler in #9316
- feat(ux): backend management enhancement by @mudler in #9325
- feat(rocm): bump to 7.x by @mudler in #9323
- feat(backends): add ik-llama-cpp by @mudler in #9326
- feat(swagger): update swagger by @localai-bot in #9329
- feat(vllm): parity with llama.cpp backend by @mudler in #9328
- feat: refactor shared helpers and enhance MLX backend functionality by @mudler in #9335
- feat: wire transcription for llama.cpp, add streaming support by @mudler in #9353
- feat(backend): add turboquant llama.cpp-fork backend by @mudler in #9355
- feat(swagger): update swagger by @localai-bot in #9356
- feat(backend): add tinygrad multimodal backend (experimental) by @mudler in #9364
- feat(backends): add sglang by @mudler in #9359
- refactor(tinygrad): reuse tinygrad.apps.llm instead of vendored Transformer by @mudler in #9380
- feat(stable-diffusion.ggml): add support for video generation by @mudler in #9420
- feat(distributed): sync state with frontends, better backend management reporting by @mudler in #9426
- feat(swagger): update swagger by @localai-bot in #9431
- feat(gallery): add Wan 2.1 FLF2V 14B 720P by @mudler in #9440
- feat(gallery): add wan i2v 720p by @mudler in #9457
- feat: improve CLI error messages with actionable guidance by @localai-bot in #8880
- chore(whisperx): drop ROCm/hipblas build target by @mudler in #9474
- feat(face-recognition): add insightface/onnx backend for 1:1 verify, 1:N identify, embedding, detection, analysis by @mudler in #9480
- feat(importer): expand importer flow to almost all backends by @mudler in #9466
- feat(swagger): update swagger by @localai-bot in #9498
- feat: voice recognition by @mudler in #9500
- feat(insightface): add antispoofing (liveness) detection by @mudler in #9515
- feat(swagger): update swagger by @localai-bot in #9518
- feat: add biometrics UI by @mudler in #9524
- feat: Add Sherpa ONNX backend for ASR and TTS by @richiejp in #8523
- [intel GPU support] Use latest oneapi-basekit image for Intel images to support b70 by @arbrick in #9543
- feat(react-ui): editorial refresh with Nord palette and polished primitives by @mudler in #9550
- feat: surface distributed backend management errors by @mudler in #9552
- feat(backends): add CUDA 13 + L4T arm64 CUDA 13 variants for vllm/vllm-omni/sglang by @mudler in #9553
- feat(llama-cpp): expose split_mode option for multi-GPU placement by @mudler in #9560
- ci(backends): build cpu-whisperx and cpu-faster-whisper for linux/arm64 by @mudler in #9573
- [intel GPU support] Use latest oneapi-basekit image for Intel images to support b70 (in more places this time) by @arbrick in #9578
- feat: Log backend exit code by @richiejp in #9581
- feat(distributed): support multiple replicas of one model on the same node by @mudler in #9583
- feat(swagger): update swagger by @localai-bot in #9587
- feat: localai assistant chat modality by @mudler in #9602
- chore: add golangci-lint with new-from-merge-base baseline by @richiejp in #9603
- feat(swagger): update swagger by @localai-bot in #9607
- feat(vllm): expose AsyncEngineArgs via generic engine_args YAML map by @richiejp in #9563
- feat(vibevoice-cpp): add purego TTS+ASR backend by @mudler in #9610
- feat: react chat redesign by @mudler in #9616
- feat(llama-cpp): bump to d775992 and adapt to spec params refactor by @mudler in #9618
- feat(backends/python): use tempfile.gettempdir() instead of hardcoded /tmp by @Anai-Guo in #9629
- feat(importers): whisper.cpp HF repos pick a quant + nest under whisper/models by @mudler in #9630
- feat(branding): admin-configurable instance name, tagline, and assets by @mudler in #9635
- feat(swagger): update swagger by @localai-bot in #9643
- feat(react-ui): add multilingual (i18n) support by @mudler in #9642
- feat(ci): allow routing apt traffic through an alternate Ubuntu mirror by @mudler in #9650
- feat: add LocalVQE backend and audio transformations UI by @richiejp in #9640
- feat(swagger): update swagger by @localai-bot in #9660
- feat(concurrency-groups): per-model exclusive groups for backend loading by @mudler in #9662
- feat(api): add /v1/audio/diarization endpoint with sherpa-onnx + vibevoice.cpp by @mudler in #9654
- feat(vllm, distributed): tensor parallel distributed workers by @richiejp in #9612
- feat: support word-level timestamps for faster-whisper by @eglia in #9621
- feat(importers): add vibevoice-cpp importer for GGUF bundles by @localai-bot in #9685
- feat(gallery): Speed up load times and clean gallery entries by @richiejp in #9211
- feat(swagger): update swagger by @localai-bot in #9699
- feat(sglang): wire engine_args, add cuda13 build, ship MTP gallery demos by @richiejp in #9686
- feat(api/transcription): include segments + duration + language on stream done event by @localai-bot in #9709
- feat(whisper): honor client cancellation via ggml abort_callback by @localai-bot in #9710
- chore: Security hardening by @richiejp in #9719
- ci: phase 1-3 of GHA free tier migration (path filter, multi-arch split prep, /mnt disk relief) by @localai-bot in #9726
- feat(swagger): update swagger by @localai-bot in #9723
- ci: pilot per-arch split + manifest merge for faster-whisper and llama-cpp-quantization by @localai-bot in #9727
- ci: finish GHA free-tier migration (per-arch fan-out, image splits, retire self-hosted, fix provenance) by @localai-bot in #9730
- ci: consolidate llama-cpp-darwin into the matrix-driven Darwin flow by @mudler in #9731
- feat(whisper-cpp): implement streaming transcription by @localai-bot in #9751
🧠 Models
- chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #9399
- chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #9400
- chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #9425
- chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #9436
- chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #9464
- chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #9481
- chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #9491
- chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #9505
- chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #9555
- chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #9558
- chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #9611
- chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #9615
- Add tags to qwen3-vl-reranker and Qwen3-VL-Embedding to the gallery by @ER-EPR in #9628
- chore(model gallery): add chroma1-hd diffusers model by @Anai-Guo in #9646
- chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #9653
- chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #9681
- chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #9703
- chore(model gallery): 🤖 add 1 new models via gallery agent by @localai-bot in #9720
📖 Documentation and examples
- docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #9268
- docs(agents): capture vllm backend lessons + runtime lib packaging by @mudler in #9333
- chore(agents): Update the backend creation instructions to include Rust and extra tests by @richiejp in #9490
👒 Dependencies
- chore: ⬆️ Update ggml-org/llama.cpp to
66c4f9ded01b29d9120255be1ed8d5835bcbb51dby @localai-bot in #9269 - chore(llama.cpp): bump to 'd12cc3d1ca6bba741cd77887ac9c9ee18c8415c7' by @mudler in #9282
- chore: ⬆️ Update leejet/stable-diffusion.cpp to
e8323cabb0e4511ba18a50b1cb34cf1f87fc71efby @localai-bot in #9281 - chore: ⬆️ Update ggml-org/llama.cpp to
d132f22fc92f36848f7ccf2fc9987cd0b0120825by @localai-bot in #9302 - chore: ⬆️ Update PABannier/sam3.cpp to
01832ef85fcc8eb6488f1d01cd247f07e96ff5a9by @localai-bot in #9311 - chore: ⬆️ Update ggml-org/llama.cpp to
e62fa13c2497b2cd1958cb496e9489e86bbd5182by @localai-bot in #9312 - chore(model-gallery): ⬆️ update checksum by @localai-bot in #9321
- chore: ⬆️ Update leejet/stable-diffusion.cpp to
6b675a5ede9b0edf0a0f44191e8b79d7ef27615aby @localai-bot in #9320 - chore: ⬆️ Update ggml-org/llama.cpp to
ff5ef8278615a2462b79b50abdf3cc95cfb31c6fby @localai-bot in #9319 - chore: ⬆️ Update ggml-org/llama.cpp to
1e9d771e2c2f1113a5ebdd0dc15bafe57dce64beby @localai-bot in #9330 - chore(deps): bump softprops/action-gh-release from 2 to 3 by @dependabot[bot] in #9336
- chore(deps): bump actions/upload-pages-artifact from 4 to 5 by @dependabot[bot] in #9337
- chore(deps): bump github.com/testcontainers/testcontainers-go from 0.41.0 to 0.42.0 by @dependabot[bot] in #9338
- chore(model-gallery): ⬆️ update checksum by @localai-bot in #9346
- chore(deps): bump sentence-transformers from 5.2.3 to 5.4.0 in /backend/python/transformers by @dependabot[bot] in #9342
- chore: ⬆️ Update ggml-org/llama.cpp to
e97492369888f5311e4d1f3beb325a36bbed70e9by @localai-bot in #9347 - chore: ⬆️ Update ikawrakow/ik_llama.cpp to
55d3c05bf7b377deaa5dc84d255d9740a345a206by @localai-bot in #9348 - chore(deps): bump github.com/google/go-containerregistry from 0.21.3 to 0.21.5 by @dependabot[bot] in #9343
- chore(deps): bump github.com/testcontainers/testcontainers-go/modules/nats from 0.41.0 to 0.42.0 by @dependabot[bot] in #9341
- chore(deps): bump github.com/swaggo/echo-swagger from 1.4.1 to 1.5.2 by @dependabot[bot] in #9344
- chore(deps): bump github.com/charmbracelet/glamour from 0.10.0 to 1.0.0 by @dependabot[bot] in #9340
- chore: ⬆️ Update ggml-org/llama.cpp to
fae3a28070fe4026f87bd6a544aba1b2d1896566by @localai-bot in #9357 - chore(model-gallery): ⬆️ update checksum by @localai-bot in #9358
- chore(model-gallery): ⬆️ update checksum by @localai-bot in #9369
- chore: ⬆️ Update ggml-org/llama.cpp to
b3d758750a268bf93f084ccfa3060fb9a203192aby @localai-bot in #9370 - chore: ⬆️ Update ikawrakow/ik_llama.cpp to
1163af96cf6bb4a4b819f998f84c153a49768b99by @localai-bot in #9368 - chore(deps): bump the npm_and_yarn group across 1 directory with 6 updates by @dependabot[bot] in #9373
- chore: ⬆️ Update leejet/stable-diffusion.cpp to
c41c5ded7af85e01b7fe442ff7950c720706d53aby @localai-bot in #9366 - chore(model-gallery): ⬆️ update checksum by @localai-bot in #9384
- chore: ⬆️ Update ikawrakow/ik_llama.cpp to
eaf83865a132f66e8f49efe0e78491625942f068by @localai-bot in #9382 - chore: ⬆️ Update leejet/stable-diffusion.cpp to
a564fdf642780d1df123f1c413b19961375b8346by @localai-bot in #9383 - chore: ⬆️ Update TheTom/llama-cpp-turboquant to `45f8a066ed5f5bb38c695cec532f6cef9f4efa9d' by @mudler in #9385
- chore: ⬆️ Update ggml-org/llama.cpp to
4fbdabdc61c04d1262b581e1b8c0c3b119f688ffby @localai-bot in #9381 - chore: bump inference defaults from unsloth by @github-actions[bot] in #9396
- chore(deps): bump dompurify from 3.3.2 to 3.4.0 in /core/http/react-ui in the npm_and_yarn group across 1 directory by @dependabot[bot] in #9376
- chore: ⬆️ Update ggml-org/whisper.cpp to
166c20b473d5f4d04052e699f992f625ea2a2fddby @localai-bot in #9403 - chore: ⬆️ Update ikawrakow/ik_llama.cpp to
52efa12fdae390d1dca6ecd7ca00010fe51f651eby @localai-bot in #9404 - chore: ⬆️ Update ggml-org/llama.cpp to
4f02d4733934179386cbc15b3454be26237940bbby @localai-bot in #9415 - chore: ⬆️ Update leejet/stable-diffusion.cpp to
7d33d4b2ddeafa672761a5880ec33bdff452504dby @localai-bot in #9417 - chore: ⬆️ Update ikawrakow/ik_llama.cpp to
8befd92ea5f702494ea9813fe42a52fb015db5feby @localai-bot in #9418 - chore: ⬆️ Update leejet/stable-diffusion.cpp to
44cca3d626d301e2215d5e243277e8f0e65bfa78by @localai-bot in #9428 - chore: ⬆️ Update ggml-org/llama.cpp to
4eac5b45095a4e8a1ff1cce4f6d030e0872fb4adby @localai-bot in #9429 - chore: ⬆️ Update ikawrakow/ik_llama.cpp to
00ba208a5c036eee72d4a631b4f57c126095cb03by @localai-bot in #9430 - chore: ⬆️ Update ikawrakow/ik_llama.cpp to
d4824131580b94ffa7b0e91c955e2b237c2fe16eby @localai-bot in #9447 - chore(model-gallery): ⬆️ update checksum by @localai-bot in #9451
- chore: ⬆️ Update ggml-org/whisper.cpp to
fc674574ca27cac59a15e5b22a09b9d9ad62aafeby @localai-bot in #9450 - chore: ⬆️ Update ggml-org/llama.cpp to
cf8b0dbda9ac0eac30ee33f87bc6702ead1c4664by @localai-bot in #9448 - chore(deps): bump github.com/aws/aws-sdk-go-v2/service/s3 from 1.97.1 to 1.99.1 by @dependabot[bot] in #9452
- chore(deps): bump github.com/containerd/containerd from 1.7.30 to 1.7.31 by @dependabot[bot] in #9453
- chore(deps): bump github.com/modelcontextprotocol/go-sdk from 1.4.1 to 1.5.0 by @dependabot[bot] in #9454
- chore(deps): bump github.com/aws/aws-sdk-go-v2/config from 1.32.14 to 1.32.16 by @dependabot[bot] in #9456
- chore(deps): bump github.com/coreos/go-oidc/v3 from 3.17.0 to 3.18.0 by @dependabot[bot] in #9455
- chore: ⬆️ Update ggml-org/llama.cpp to
5a4cd6741fc33227cdacb329f355ab21f8481de2by @localai-bot in #9479 - chore: ⬆️ Update leejet/stable-diffusion.cpp to
c97702e1057c2fe13a7074cd9069cb9dd6edc1bfby @localai-bot in #9495 - chore(model-gallery): ⬆️ update checksum by @localai-bot in #9522
- chore: ⬆️ Update ggml-org/llama.cpp to
187a45637054881ecacf17f8e2f6f8f2ba7df1c7by @localai-bot in #9520 - chore: ⬆️ Update leejet/stable-diffusion.cpp to
b8bdffc19962be7e5a84bfefeb2e31bd885b571aby @localai-bot in #9521 - chore(deps): bump postcss from 8.5.8 to 8.5.10 in /core/http/react-ui in the npm_and_yarn group across 1 directory by @dependabot[bot] in #9544
- chore(deps): bump rustls-webpki from 0.103.10 to 0.103.13 in /backend/rust/kokoros in the cargo group across 1 directory by @dependabot[bot] in #9546
- chore: ⬆️ Update ggml-org/llama.cpp to
361fe72acb7b9bd79059cc177cbeda99b35b5db9by @localai-bot in #9548 - chore: ⬆️ Update ikawrakow/ik_llama.cpp to
cb58a561f0c49f68b6d125cdfda037ed80433821by @localai-bot in #9549 - chore: ⬆️ Update TheTom/llama-cpp-turboquant to
67559e580b10e4e47e9a6fd6218873997976886dby @localai-bot in #9497 - chore: ⬆️ Update ikawrakow/ik_llama.cpp to
3a945af45d45936341a45bbf7deda56776a4af26by @localai-bot in #9570 - chore: ⬆️ Update TheTom/llama-cpp-turboquant to
11a241d0db78a68e0a5b99fe6f36de6683100f6aby @localai-bot in #9571 - chore: ⬆️ Update ggml-org/llama.cpp to
dcad77cc3b0865153f486327064fb0320a57a476by @localai-bot in #9572 - chore: ⬆️ Update ggml-org/llama.cpp to
f53577432541bb9edc1588c4ef45c66bf07e4468by @localai-bot in #9577 - chore: ⬆️ Update ggml-org/llama.cpp to
665abc609740d397d30c0d8ef4157dbf900bd1a3by @localai-bot in #9584 - chore: ⬆️ Update ikawrakow/ik_llama.cpp to
d6f3e4e28fbf75e6181e6ea32e734de9ce9304fdby @localai-bot in #9585 - chore: ⬆️ Update leejet/stable-diffusion.cpp to
a81677f59c92d90343aebca51dfed7decf0a0cb0by @localai-bot in #9586 - chore(deps): bump github.com/testcontainers/testcontainers-go/modules/postgres from 0.41.0 to 0.42.0 by @dependabot[bot] in #9591
- chore(deps): bump github.com/onsi/ginkgo/v2 from 2.28.1 to 2.28.2 by @dependabot[bot] in #9593
- chore(deps): bump packaging from 24.1 to 26.2 in /backend/python/coqui by @dependabot[bot] in #9594
- chore: ⬆️ Update ikawrakow/ik_llama.cpp to
453a027c17e4d63a7f16b871197a396240a65138by @localai-bot in #9608 - chore: ⬆️ Update leejet/stable-diffusion.cpp to
3d6064b37ef4607917f8acf2ca8c8906d5087413by @localai-bot in #9617 - chore: ⬆️ Update ikawrakow/ik_llama.cpp to
a8aecbf15933295af96504f9a693998322185b5cby @localai-bot in #9625 - chore: ⬆️ Update ggml-org/llama.cpp to
beb42fffa45eded44804a1fd4916146222371581by @localai-bot in #9624 - deps: update quic-go to v0.59.0 (fix session ticket panic) by @egyptianbman in #9655
- chore(model-gallery): ⬆️ update checksum by @localai-bot in #9661
- chore: ⬆️ Update vllm-project/vllm cu130 wheel to
0.20.1by @localai-bot in #9649 - chore(deps): bump docs/themes/hugo-theme-relearn from
f69a085to8bb66faby @dependabot[bot] in #9665 - chore: ⬆️ Update ggml-org/llama.cpp to
eff06702b2a52e1020ea009ebd86cb9f5acabab5by @localai-bot in #9637 - chore: ⬆️ Update ikawrakow/ik_llama.cpp to
45dfd80371785731bc2ed05a76252497a4e7a282by @localai-bot in #9644 - chore(deps): bump actions/checkout from 4 to 6 by @dependabot[bot] in #9663
- chore: ⬆️ Update ggml-org/llama.cpp to
bbeb89d76c41bc250f16e4a6fefcc9b530d6e3f3by @localai-bot in #9676 - chore: ⬆️ Update ikawrakow/ik_llama.cpp to
8b56d813a9ed04fa7b7fe2588fddd845cf64eccbby @localai-bot in #9677 - chore: ⬆️ Update TheTom/llama-cpp-turboquant to
69d8e4be47243e83b3d0d71e932bc7aa61c644dcby @localai-bot in #9638 - chore: ⬆️ Update ggml-org/whisper.cpp to
4bf733672b2871d4153158af4f621a6dd9104f4aby @localai-bot in #9636 - chore(model-gallery): ⬆️ update checksum by @localai-bot in #9700
- chore: ⬆️ Update ikawrakow/ik_llama.cpp to
b93721902b4662f9b973b1c412006081c958d085by @localai-bot in #9697 - chore: ⬆️ Update ggml-org/llama.cpp to
2496f9c14965c39589f53eea31bdb6d762b1d360by @localai-bot in #9698 - chore: ⬆️ Update leejet/stable-diffusion.cpp to
90e87bc846f17059771efb8aaa31e9ef0cab6f78by @localai-bot in #9701 - chore(deps): bump openssl from 0.10.76 to 0.10.79 in /backend/rust/kokoros in the cargo group across 1 directory by @dependabot[bot] in #9694
- chore(deps): bump the go_modules group across 1 directory with 8 updates by @dependabot[bot] in #9705
- chore: ⬆️ Update ikawrakow/ik_llama.cpp to
9a26522af234f8db079ae3735f35ab6c20fe2c66by @localai-bot in #9713 - chore: ⬆️ Update ggml-org/llama.cpp to
05ff59cb57860cc992fc6dcede32c696efea711cby @localai-bot in #9714 - chore: ⬆️ Update ggml-org/whisper.cpp to
c81b2dabbc45484dee2ca6658cfe39c841df5c70by @localai-bot in #9712 - chore(deps): bump LocalAGI for collection rehydrate-on-init-failure fix by @localai-bot in #9721
- chore: ⬆️ Update ikawrakow/ik_llama.cpp to
98950267c67fd95937a54ebd6e3c66cf2679b710by @localai-bot in #9725 - chore: ⬆️ Update ggml-org/llama.cpp to
9f5f0e689c9e977e5f23a27e344aa36082f44738by @localai-bot in #9724 - chore: ⬆️ Update ikawrakow/ik_llama.cpp to
ab0f22b819ac57b7e7484f69c00c10fc755d5c6cby @localai-bot in #9734 - chore: ⬆️ Update ggml-org/llama.cpp to
00d56b11c3477b99bc18562dc1d1834f0d961778by @localai-bot in #9733
Other Changes
- ci: add pre-built base-grpc-builder image infrastructure (PR 1/2) by @localai-bot in #9737
- ci: refactor llama-cpp variant Dockerfiles to consume prebuilt base-grpc images (PR 2/2) by @localai-bot in #9738
- chore: ⬆️ Update ggml-org/llama.cpp to
1e5ad35d560b90a8ac447d149c8f8447ae1fcaa0by @localai-bot in #9739 - docs(agents): update CI caching docs after the GHA-free-tier migration by @localai-bot in #9742
- ci: split backend-jobs into single-arch and multi-arch matrices by @localai-bot in #9746
- chore: ⬆️ Update ggml-org/llama.cpp to
2b2babd1243c67ca811c0a5852cedf92b1a20024by @localai-bot in #9747 - chore: ⬆️ Update ikawrakow/ik_llama.cpp to
23127139cb6fa314899c3b5f4935b88b3374c56cby @localai-bot in #9748 - chore: ⬆️ Update ggml-org/whisper.cpp to
c33c5618b72bb345df029b730b36bc0e369845a3by @localai-bot in #9749 - chore: ⬆️ Update vllm-project/vllm cu130 wheel to
0.20.2by @localai-bot in #9750 - chore: ⬆️ Update ggml-org/llama.cpp to
389ff61d77b5c71cec0cf92fe4e5d01ace80b797by @localai-bot in #9752
New Contributors
- @neurocis made their first contribution in #9304
- @thelittlefireman made their first contribution in #9264
- @mvanhorn made their first contribution in #9379
- @keithmattix made their first contribution in #9410
- @SAY-5 made their first contribution in #9438
- @pjbrzozowski made their first contribution in #9427
- @russell made their first contribution in #9446
- @leinasi2014 made their first contribution in #9443
- @sec171 made their first contribution in #9461
- @Dennisadira made their first contribution in #9411
- @orbisai0security made their first contribution in #9486
- @Anai-Guo made their first contribution in #9526
- @arbrick made their first contribution in #9543
- @eglia made their first contribution in #9541
- @egyptianbman made their first contribution in #9655
- @arteven made their first contribution in #9674
Full Changelog: v4.1.3...v4.2.0
