🎉 LocalAI 4.6.0 Release! 🚀
LocalAI 4.6.0 is out!
This is a reliability-focused release: AMD ROCm backends now run on-GPU at full speed, distributed model loads no longer wedge when a worker dies, and realtime sessions warm up predictably. It also brings conversation forking to the built-in chat UI, a Prometheus counter for PII/audit events, and an SSRF fix for the model gallery.
Highlights:
- 🔴 AMD ROCm runs correctly - ggml audio backends offload to the GPU, hipBLASLt kernel-tuning data is bundled (no more slow generic kernels),
rocm-vllminstalls the right wheel, and the ASIC ID table is found. - 🎙️ Predictable realtime - sessions eagerly warm the whole pipeline (VAD, ASR, LLM, TTS) up front, so the first turn no longer pays per-model cold-start stalls, plus a new
POST /backend/loadAPI and "Load into memory" UI button. - 🌿 Forking chat - retry any assistant answer, branch a new chat from any point, duplicate, or copy the whole conversation, directly in the built-in UI.
- 🛡️ Distributed hardening - a dead worker can no longer pin the model-load advisory lock (the ~15-minute wedge is gone), and orphaned backend workers self-terminate instead of holding VRAM.
- 📊 PII/audit metrics - PII detections/masks/blocks are exported as a Prometheus counter, so you can alert when the filter stops firing.
- 🔒 Gallery SSRF fix -
POST /models/applyconfig-URL fetches are validated against private/loopback/metadata addresses.
Plus idempotent backend installs, tool-calling and reasoning fixes across the vLLM and Python/MLX backends, cloud-proxy compatibility with the newest reasoning models, and the usual set of dependency updates.
📌 TL;DR
| Area | Summary |
|---|---|
| 🔴 AMD ROCm reliability | ggml audio backends now compile with -DGGML_HIP=ON and link HIP (real GPU offload); hipBLASLt TensileLibrary data bundled + HIPBLASLT_TENSILE_LIBPATH exported; rocm-vllm installs from the AMD wheel index on Python 3.12; amdgpu.ids symlinked so the ASIC table is found.
|
| 🎙️ Realtime warm-up + load API | Sessions block-warm the full pipeline at start (errors surface up front); new POST /backend/load / POST /v1/backend/load, a "Load into memory" UI action, and a load_model MCP tool. Opt out per pipeline with disable_warmup: true.
|
| 🌿 Forking chat | Regenerate any assistant answer (not just the last), branch a new chat from any turn, duplicate a chat, or copy it as Markdown - all client-side in the React UI. |
| 🛡️ Process & distributed lifecycle | A dead worker no longer pins the per-model PostgreSQL advisory lock (bounded load ceiling + context-scoped lock_timeout); backend workers self-terminate on parent death (LOCALAI_BACKEND_PARENT_WATCH); the watchdog stops logging optional Free() as an error.
|
| ⚙️ Idempotent backend installs | POST /backends/apply and the LOCALAI_EXTERNAL_BACKENDS boot loop no longer re-pull an already-installed backend unless force: true.
|
| 📊 PII/audit Prometheus counter | localai_pii_events_total{kind,origin,action,direction} on /metrics, complementing the /api/pii/events ring buffer.
|
| 🔒 Gallery SSRF hardening | Gallery config URL fetches run through ValidateExternalURL, blocking private, loopback, link-local, and cloud-metadata addresses.
|
| 🧩 Tool-calling & reasoning fixes | Non-streaming vLLM tool calls restored; MLX/Python backends decode tool-call arguments for chat templates and split closing-only </think> reasoning blocks.
|
🚀 New Features & Major Enhancements
🔴 AMD ROCm backends run correctly on-GPU
Four coupled fixes make ROCm/hipBLAS backends actually run on AMD hardware, and at full speed, instead of silently falling back to CPU or slow generic kernels:
- GPU offload for ggml audio backends (#10667):
rocm-qwen3-tts-cpp,rocm-omnivoice-cpp,acestep-cpp, andvibevoice-cppwere building CPU-only because their Makefiles passed the no-op-DGGML_HIPBLAS=ON(upstream ggml only understands-DGGML_HIP=ON) and the CMake link loop omittedhip. They now use the same hipblas recipe as llama-cpp and link the HIP backend. - hipBLASLt kernel-tuning data (#10660, #10672): the packager bundled rocBLAS data but not the parallel hipBLASLt
TensileLibrary_lazy_gfx*.datfiles, so every arch silently used slow kernels and loggedCannot read "TensileLibrary_lazy_gfx*.dat". The data is now bundled andHIPBLASLT_TENSILE_LIBPATHis exported by thellama-cppandturboquantrun.sh. rocm-vllminstalls the right wheel (#10642, #10651): the backend was pulling the CUDA-only PyPIvllm(fatalModuleNotFoundError: No module named 'vllm'on AMD). It now pins CPython 3.12 and installs vLLM from the ROCm wheel index (https://wheels.vllm.ai/rocm/).- ASIC ID table found (#10624, #10627): the compute-only hipblas image lacks
/opt/amdgpu/share/libdrm/amdgpu.ids, so every model load warned. Ubuntu'slibdrm-commoncopy is now symlinked into place.
🎙️ Realtime: eager pipeline warm-up + a load-into-memory API
Realtime voice sessions now eagerly and blockingly warm the entire pipeline (VAD, transcription, LLM, TTS, sound detection, voice recognition) at session start instead of lazy-loading each sub-model on first use. The first turn no longer pays per-model cold-start stalls, and model-load errors surface up front at session start (as model_load_error) rather than mid-stream. Pipeline sub-models load concurrently, so a session warms in the time of its slowest stage, not the sum, and a failed stage names every broken model in a joined error.
This also adds a LocalAI-native POST /backend/load (and /v1/backend/load), the inverse of /backend/shutdown, exposed as a "Load into memory" UI action and a load_model MCP admin tool, so admins can pre-warm any model (including full pipelines) on demand. The --load-to-memory startup flag now routes through the same engine. Opt out per pipeline with disable_warmup: true.
🔗 PRs: #10662
🌿 Forking chat in the built-in UI
The React chat UI gains conversation-management tools: regenerate any assistant answer (not just the last), branch a new chat from any answer, duplicate a chat into an independent copy, or copy the whole conversation to the clipboard as Markdown. Retrying a mid-conversation answer correctly truncates the conversation before re-asking, both in the DOM and in the request payload (this also fixes a latent stale-closure bug where a mid-conversation retry sent the downstream turns back to the model). All client-side, no backend changes.
🔗 PRs: #10654
🛡️ Sturdier process and distributed lifecycle
- Dead-worker advisory-lock wedge (#10600): a distributed worker going mid-load could pin a per-model PostgreSQL advisory lock and fail every subsequent request to that model with
55P03for ~15 minutes. The detached load context is now bounded by a model-load ceiling, the install wait honors cancellation viasingleflight.DoChan, andlock_timeoutis scoped to the caller's context budget instead of a deployment-global GUC. - Parent-death safety net (#10639): if LocalAI is
SIGKILLed before teardown, spawned backend workers used to get reparented to init and linger, holding VRAM and their port. Each backend now polls its parent PID and self-terminates on reparenting. Configurable viaLOCALAI_BACKEND_PARENT_WATCH(default on, auto-off on Windows) andLOCALAI_BACKEND_PARENT_WATCH_INTERVAL(default2s). C++ coverage is llama-cpp for now; Python covers all backends. - Quieter watchdog (#10602, #10607): the optional
Free()RPC returns gRPCUnimplementedfor many backends and the federation proxy, so the watchdog no longer logs a misleadingError freeing GPU resourceson eviction. A newgrpcerrors.IsUnimplementedhelper distinguishes it from genuine failures. - Idempotent backend installs (#10643):
POST /backends/applyand theLOCALAI_EXTERNAL_BACKENDSboot loop no longer re-download and re-extract an already-installed backend on every apply/boot. Pass"force": true(the UI's install button still does, doubling as "Reinstall").
📊 PII/audit events as a Prometheus counter
The PII middleware / MITM audit pipeline now emits a single monotonic counter, localai_pii_events_total{kind, origin, action, direction}, on /metrics, instrumented at the EventStore.Record choke point. Labels are cardinality-bounded (no pattern or user IDs). This complements the capacity-bound /api/pii/events ring buffer and, crucially, makes silent filter failure alertable: rate() on the counter detects that the PII filter stopped firing after a deploy.
🔗 PRs: #10641
🔒 Gallery SSRF hardening
POST /models/apply with an empty id fetches the supplied url directly; in a default Docker setup (no API key) any reachable client could probe internal services or cloud-metadata (169.254.169.254) and exfiltrate a slice via the job error. Gallery config fetches now run through the existing ValidateExternalURL guard (the same one protecting the CORS proxy and media downloads), blocking private, loopback, link-local, unspecified, and metadata addresses. Only plain http(s):// is validated; huggingface://, github:, oci://, ollama://, and file:// are untouched.
🔗 PRs: #10673
🐛 Bug Fixes (recap)
fix(vllm): restore non-streaming tool-call extraction that regressed after #10351 (a capability flag was mistaken for run state) - #10638fix(python-backends): decode tool-callargumentsfor chat templates (unbreaks MLX/Qwen3.5 agent loops) and split reasoning when a model emits only a closing</think>- #10658fix(cloud-proxy): droptemperature/top_pand sendmax_completion_tokensso routing to the newest reasoning models (Claude Opus 4.x, GPT-5.x) stops 400ing - #10640fix(config): revert defaultingswa_full:truefor sliding-window-attention models (restores the memory-light reduced KV cache; still available as an explicit per-model opt-in) - #10674fix(kokoros): implement theAudioTranscriptionLivetrait stub so the backend compiles against the updated proto - #10612fix(launcher): keep the desktop launcher's data/config under~/.localaiinstead of the GUI's working directory - #10610, #10613
👒 Dependencies
Submodule and backend bumps this cycle:
ggml-org/llama.cppx4ikawrakow/ik_llama.cppx4CrispStrobe/CrispASRx4leejet/stable-diffusion.cppx3vllm-metal(darwin) x3ggml-org/whisper.cppx2mudler/parakeet.cppx1localai-org/privacy-filter.cppx1vllm-project/vllmcu130 wheel to0.24.0
Plus new gallery models added via the gallery agent (#10663, #10644).
📖 Documentation
- Docs version bump for the release - #10614
🙌 New Contributors
- @alaningtrump made their first contribution in #10657
Full Changelog: v4.5.6...v4.6.0
