🎉 LocalAI 4.6.0 Release! 🚀

LocalAI 4.6.0 is out!

This is a reliability-focused release: AMD ROCm backends now run on-GPU at full speed, distributed model loads no longer wedge when a worker dies, and realtime sessions warm up predictably. It also brings conversation forking to the built-in chat UI, a Prometheus counter for PII/audit events, and an SSRF fix for the model gallery.

Highlights:

🔴 AMD ROCm runs correctly - ggml audio backends offload to the GPU, hipBLASLt kernel-tuning data is bundled (no more slow generic kernels), rocm-vllm installs the right wheel, and the ASIC ID table is found.
🎙️ Predictable realtime - sessions eagerly warm the whole pipeline (VAD, ASR, LLM, TTS) up front, so the first turn no longer pays per-model cold-start stalls, plus a new POST /backend/load API and "Load into memory" UI button.
🌿 Forking chat - retry any assistant answer, branch a new chat from any point, duplicate, or copy the whole conversation, directly in the built-in UI.
🛡️ Distributed hardening - a dead worker can no longer pin the model-load advisory lock (the ~15-minute wedge is gone), and orphaned backend workers self-terminate instead of holding VRAM.
📊 PII/audit metrics - PII detections/masks/blocks are exported as a Prometheus counter, so you can alert when the filter stops firing.
🔒 Gallery SSRF fix - POST /models/apply config-URL fetches are validated against private/loopback/metadata addresses.

Plus idempotent backend installs, tool-calling and reasoning fixes across the vLLM and Python/MLX backends, cloud-proxy compatibility with the newest reasoning models, and the usual set of dependency updates.

📌 TL;DR

Area	Summary
🔴 AMD ROCm reliability	ggml audio backends now compile with `-DGGML_HIP=ON` and link HIP (real GPU offload); hipBLASLt `TensileLibrary` data bundled + `HIPBLASLT_TENSILE_LIBPATH` exported; `rocm-vllm` installs from the AMD wheel index on Python 3.12; `amdgpu.ids` symlinked so the ASIC table is found.
🎙️ Realtime warm-up + load API	Sessions block-warm the full pipeline at start (errors surface up front); new `POST /backend/load` / `POST /v1/backend/load`, a "Load into memory" UI action, and a `load_model` MCP tool. Opt out per pipeline with `disable_warmup: true`.
🌿 Forking chat	Regenerate any assistant answer (not just the last), branch a new chat from any turn, duplicate a chat, or copy it as Markdown - all client-side in the React UI.
🛡️ Process & distributed lifecycle	A dead worker no longer pins the per-model PostgreSQL advisory lock (bounded load ceiling + context-scoped `lock_timeout`); backend workers self-terminate on parent death (`LOCALAI_BACKEND_PARENT_WATCH`); the watchdog stops logging optional `Free()` as an error.
⚙️ Idempotent backend installs	`POST /backends/apply` and the `LOCALAI_EXTERNAL_BACKENDS` boot loop no longer re-pull an already-installed backend unless `force: true`.
📊 PII/audit Prometheus counter	`localai_pii_events_total{kind,origin,action,direction}` on `/metrics`, complementing the `/api/pii/events` ring buffer.
🔒 Gallery SSRF hardening	Gallery config URL fetches run through `ValidateExternalURL`, blocking private, loopback, link-local, and cloud-metadata addresses.
🧩 Tool-calling & reasoning fixes	Non-streaming vLLM tool calls restored; MLX/Python backends decode tool-call arguments for chat templates and split closing-only `</think>` reasoning blocks.

🚀 New Features & Major Enhancements

🔴 AMD ROCm backends run correctly on-GPU

Four coupled fixes make ROCm/hipBLAS backends actually run on AMD hardware, and at full speed, instead of silently falling back to CPU or slow generic kernels:

GPU offload for ggml audio backends (#10667): rocm-qwen3-tts-cpp, rocm-omnivoice-cpp, acestep-cpp, and vibevoice-cpp were building CPU-only because their Makefiles passed the no-op -DGGML_HIPBLAS=ON (upstream ggml only understands -DGGML_HIP=ON) and the CMake link loop omitted hip. They now use the same hipblas recipe as llama-cpp and link the HIP backend.
hipBLASLt kernel-tuning data (#10660, #10672): the packager bundled rocBLAS data but not the parallel hipBLASLt TensileLibrary_lazy_gfx*.dat files, so every arch silently used slow kernels and logged Cannot read "TensileLibrary_lazy_gfx*.dat". The data is now bundled and HIPBLASLT_TENSILE_LIBPATH is exported by the llama-cpp and turboquant run.sh.
rocm-vllm installs the right wheel (#10642, #10651): the backend was pulling the CUDA-only PyPI vllm (fatal ModuleNotFoundError: No module named 'vllm' on AMD). It now pins CPython 3.12 and installs vLLM from the ROCm wheel index (https://wheels.vllm.ai/rocm/).
ASIC ID table found (#10624, #10627): the compute-only hipblas image lacks /opt/amdgpu/share/libdrm/amdgpu.ids, so every model load warned. Ubuntu's libdrm-common copy is now symlinked into place.

🔗 PRs: #10667, #10672, #10651, #10627

🎙️ Realtime: eager pipeline warm-up + a load-into-memory API

Realtime voice sessions now eagerly and blockingly warm the entire pipeline (VAD, transcription, LLM, TTS, sound detection, voice recognition) at session start instead of lazy-loading each sub-model on first use. The first turn no longer pays per-model cold-start stalls, and model-load errors surface up front at session start (as model_load_error) rather than mid-stream. Pipeline sub-models load concurrently, so a session warms in the time of its slowest stage, not the sum, and a failed stage names every broken model in a joined error.

This also adds a LocalAI-native POST /backend/load (and /v1/backend/load), the inverse of /backend/shutdown, exposed as a "Load into memory" UI action and a load_model MCP admin tool, so admins can pre-warm any model (including full pipelines) on demand. The --load-to-memory startup flag now routes through the same engine. Opt out per pipeline with disable_warmup: true.

🔗 PRs: #10662

🌿 Forking chat in the built-in UI

The React chat UI gains conversation-management tools: regenerate any assistant answer (not just the last), branch a new chat from any answer, duplicate a chat into an independent copy, or copy the whole conversation to the clipboard as Markdown. Retrying a mid-conversation answer correctly truncates the conversation before re-asking, both in the DOM and in the request payload (this also fixes a latent stale-closure bug where a mid-conversation retry sent the downstream turns back to the model). All client-side, no backend changes.

🔗 PRs: #10654

🛡️ Sturdier process and distributed lifecycle

Dead-worker advisory-lock wedge (#10600): a distributed worker going mid-load could pin a per-model PostgreSQL advisory lock and fail every subsequent request to that model with 55P03 for ~15 minutes. The detached load context is now bounded by a model-load ceiling, the install wait honors cancellation via singleflight.DoChan, and lock_timeout is scoped to the caller's context budget instead of a deployment-global GUC.
Parent-death safety net (#10639): if LocalAI is SIGKILLed before teardown, spawned backend workers used to get reparented to init and linger, holding VRAM and their port. Each backend now polls its parent PID and self-terminates on reparenting. Configurable via LOCALAI_BACKEND_PARENT_WATCH (default on, auto-off on Windows) and LOCALAI_BACKEND_PARENT_WATCH_INTERVAL (default 2s). C++ coverage is llama-cpp for now; Python covers all backends.
Quieter watchdog (#10602, #10607): the optional Free() RPC returns gRPC Unimplemented for many backends and the federation proxy, so the watchdog no longer logs a misleading Error freeing GPU resources on eviction. A new grpcerrors.IsUnimplemented helper distinguishes it from genuine failures.
Idempotent backend installs (#10643): POST /backends/apply and the LOCALAI_EXTERNAL_BACKENDS boot loop no longer re-download and re-extract an already-installed backend on every apply/boot. Pass "force": true (the UI's install button still does, doubling as "Reinstall").

🔗 PRs: #10600, #10639, #10607, #10643

📊 PII/audit events as a Prometheus counter

The PII middleware / MITM audit pipeline now emits a single monotonic counter, localai_pii_events_total{kind, origin, action, direction}, on /metrics, instrumented at the EventStore.Record choke point. Labels are cardinality-bounded (no pattern or user IDs). This complements the capacity-bound /api/pii/events ring buffer and, crucially, makes silent filter failure alertable: rate() on the counter detects that the PII filter stopped firing after a deploy.

🔗 PRs: #10641

🔒 Gallery SSRF hardening

POST /models/apply with an empty id fetches the supplied url directly; in a default Docker setup (no API key) any reachable client could probe internal services or cloud-metadata (169.254.169.254) and exfiltrate a slice via the job error. Gallery config fetches now run through the existing ValidateExternalURL guard (the same one protecting the CORS proxy and media downloads), blocking private, loopback, link-local, unspecified, and metadata addresses. Only plain http(s):// is validated; huggingface://, github:, oci://, ollama://, and file:// are untouched.

🔗 PRs: #10673

🐛 Bug Fixes (recap)

fix(vllm): restore non-streaming tool-call extraction that regressed after #10351 (a capability flag was mistaken for run state) - #10638
fix(python-backends): decode tool-call arguments for chat templates (unbreaks MLX/Qwen3.5 agent loops) and split reasoning when a model emits only a closing </think> - #10658
fix(cloud-proxy): drop temperature/top_p and send max_completion_tokens so routing to the newest reasoning models (Claude Opus 4.x, GPT-5.x) stops 400ing - #10640
fix(config): revert defaulting swa_full:true for sliding-window-attention models (restores the memory-light reduced KV cache; still available as an explicit per-model opt-in) - #10674
fix(kokoros): implement the AudioTranscriptionLive trait stub so the backend compiles against the updated proto - #10612
fix(launcher): keep the desktop launcher's data/config under ~/.localai instead of the GUI's working directory - #10610, #10613

👒 Dependencies

Submodule and backend bumps this cycle:

ggml-org/llama.cpp x4
ikawrakow/ik_llama.cpp x4
CrispStrobe/CrispASR x4
leejet/stable-diffusion.cpp x3
vllm-metal (darwin) x3
ggml-org/whisper.cpp x2
mudler/parakeet.cpp x1
localai-org/privacy-filter.cpp x1
vllm-project/vllm cu130 wheel to 0.24.0

Plus new gallery models added via the gallery agent (#10663, #10644).

📖 Documentation

Docs version bump for the release - #10614

🙌 New Contributors

@alaningtrump made their first contribution in #10657

Full Changelog: v4.5.6...v4.6.0

mudler/LocalAI v4.6.0 on GitHub