🎉 LocalAI 4.5.0 Release! 🚀

LocalAI 4.5.0 is out!

This release widens what LocalAI can perceive, sharpens the realtime voice API, and makes multi-user serving fast with zero configuration. Four new backends land, the React UI redesign ships in full, and distributed mode gets a robustness pass.

Highlights:

👁️ See depth - new depth-anything backend (Depth Anything 3): monocular metric depth + camera pose, with a typed Depth RPC and POST /v1/depth.
🔊 Hear events - new ced backend tags 527 AudioSet sound classes (baby cry, glass breaking, alarms) over REST and a VAD-decoupled realtime stream.
🗣️ Speak on-device - new supertonic ONNX TTS backend: multilingual, espeak-free, fast cold start.
🛡️ Filter PII with NER - new privacy-filter.cpp engine adds named-entity token classification alongside a regex secret detector.
🎙️ Smarter realtime - sessions become speaker-aware (identity surfaced to the client and the LLM) and stay cheap on long calls through summarize-then-drop compaction.
⚙️ Concurrent by default - prefix caching, Blackwell-tuned batch sizes, and VRAM-scaled concurrency turn continuous batching on without any config.
🖼️ A redesigned UI - the UX overhaul lands end to end, while we keep improving user experience release after release.

Plus model aliases, word-level ASR timestamps, self-contained Vulkan backends, ds4 SSD streaming for 128 GB-class models, hardened distributed staging, and a broad set of fixes.

The redesigned Home: console with a built-in assistant and chat.

📌 TL;DR

Area	Summary
👁️ Depth perception	New `depth-anything` C++/ggml backend (Depth Anything 3) - metric depth + camera pose, typed `Depth` RPC + `POST /v1/depth`, 8 GGUFs. Plus Depth Anything V2 gallery models.
🔊 Sound-event tagging	New `ced` backend (CED AudioSet tagger, 527 classes) - `POST /v1/audio/classification` + VAD-decoupled realtime sound detection.
🗣️ On-device TTS	New `supertonic` ONNX backend - multilingual, no espeak/G2P, 10 voices, fast cold start (CPU).
🛡️ PII gets a NER tier	New `privacy-filter.cpp` backend - encoder/NER token classification scanning whole conversations, alongside a restricted-regex secret detector; NER-centric PII editor in the UI.
🎙️ Smarter realtime	Speaker-aware conversations (identity → client and LLM), conversation compaction (summarize-then-drop), and OpenAI `item.delete` / `item.truncate` / `input_audio_buffer.clear`.
⚙️ Multi-user serving by default	Prefix caching on by default, Blackwell batch (2048), VRAM-scaled `n_parallel` (continuous batching on out of the box) - concurrent throughput with no KV blow-up.
🔀 Model aliases	Redirect/rename a model name to another configured model, swappable live, no client reconfig.
⏱️ Word-level ASR timestamps	NeMo + CrispASR word timestamps, plumbed through the gRPC transcription path.
🖼️ The UI, redesigned	A calmer, sharper interface lands end to end: new design language, shell/nav, ops/admin data-viz, sortable/mobile tables, unsaved-changes guards, restructured Cluster Nodes.
🛰️ Distributed staging hardened	Cold-load staging detached from the request context (large models actually finish), staging progress broadcast across replicas, resumable downloader.

🚀 New Features & Major Enhancements

👁️ Depth Perception: `depth-anything`

A new native Go gRPC backend (#10352) dlopens depth-anything.cpp (a ggml port of Depth Anything 3) via purego - no Python at inference - for monocular metric depth + camera pose estimation on CPU. Depth has no native OpenAI endpoint, so the model is exposed three ways:

A typed Depth gRPC RPC + POST /v1/depth that returns the full output surface (depth map, stats, camera extrinsics 3×4 / intrinsics 3×3).
GenerateImage(src, dst) writes a min-max-normalized grayscale depth PNG.
Predict returns the depth + pose JSON blob.

Eight Depth Anything 3 GGUFs ship at mudler/depth-anything.cpp-gguf (base/small/large/giant + a monocular mono-large, q4_k/q8_0/f16/f32), with per-CPU-variant self-contained .so builds and the full hardware matrix (cpu, cuda12/13, intel-sycl, vulkan, l4t-arm64). This cycle also adds Depth Anything V2 gallery models (#10413, native version bump) and metric-large + nested metric entries (#10363).

🔗 PRs: #10352, #10413, #10363.

🔊 Sound-Event Classification: `ced`

A new backend (#10425) backed by ced.cpp - a C++/ggml port of CED (Xiaomi), a 527-class AudioSet tagger (baby cry, footsteps, glass breaking, alarms, dog bark...) with full PyTorch parity (f32 e2e 1.7e-7) and Apache-2.0 weights. CPU perf: f16 is ~1.55× faster than the PyTorch reference (~100× realtime), q8_0 uses 6.5× less memory.

REST: POST /v1/audio/classification (fully capability-registered: swagger, /api/instructions, auth feature, React capabilities.js, docs).
Realtime: opt-in pipeline.sound_detection emits conversation.item.sound_detection events, decoupled from VAD (a sound-only session runs with turn_detection: none, activating on sounds not speech), with client-driven or server-side windowing.
Gallery: 8 entries (ced-{base,tiny,mini,small}-{f16,q8}, 6 MB → 86 MB) at mudler/ced-gguf.

🔗 PR: #10425.

🗣️ On-Device TTS: `supertonic`

A new native Go gRPC TTS backend (#10342) runs Supertone's supertonic-3 flow-matching model (4 ONNX graphs) via ONNX Runtime - no Python, no espeak-ng / G2P (text preprocessing is NFKD + a Unicode-codepoint→token-id lookup). Upstream's MIT Go pipeline is vendored at a pinned commit and driven from a LocalAI gRPC server, mirroring sherpa-onnx's ONNX-runtime bundling - small image, fast cold start. Ships a supertonic-3 gallery entry (4 ONNX + 10 voice styles F1-F5/M1-M5, SHA256-pinned), with voice / language request mapping and steps/speed/silence knobs. CPU-only in this release; CUDA wiring is scaffolded for a follow-up.

🔗 PR: #10342.

🛡️ PII Filtering Gets a NER Tier: `privacy-filter.cpp`

PII filtering moves off the patched llama.cpp TokenClassify path onto a new standalone GGML backend, privacy-filter.cpp (#10360), serving OpenAI Privacy Filter NER token-classification models (CPU/CUDA/Vulkan). The filter is reworked to be NER-centric - an encoder/NER detection tier scans whole conversations as a single document - alongside a bounded restricted-regex secret-matching detector tier. Detections are labelled by source (ner vs pattern) with backend trace / confidence / debug observability, analyze/redact exposed as a synchronous API, and request filtering extended to completions, embeddings, edits and Ollama. The React UI gains a NER-centric PII editor, detector-models table, and middleware default-policy controls; the gallery gets a privacy-filter-multilingual token-classify model + an /import-model auto-detect importer. A post-merge pass (#10401) added live NER e2e coverage and review fixes.

🔗 PRs: #10360, #10401.

🎙️ Realtime Voice: Speaker-Aware and Self-Compacting

Speaker-aware conversations (#10424). The realtime voice-recognition gate now surfaces the recognized speaker to the client (a new conversation.item.speaker event - a non-breaking LocalAI extension) and feeds identity to the LLM for personalized replies (per-message OpenAI name field and/or a The current speaker is <Name>. system note). New pipeline.voice_recognition keys decouple surfacing from authorization: enforce: false resolves and surfaces a speaker without ever dropping a turn, while the gate still fails closed when enforcing. Multi-speaker histories stay correctly attributed (each user item carries its own speaker).

Conversation compaction - summarize-then-drop (#10446). Long realtime sessions used to either feed the whole growing buffer to the LLM (expensive on CPU as it grows) or silently forget old turns. Now the server can fold aged-out turns into a rolling summary instead, via an async, post-turn snapshot → summarize → commit compactor that never holds the conversation lock across the summarizer call and never evicts items without a summary replacing them. Plus the OpenAI-parity history events that were missing: conversation.item.delete, conversation.item.truncate, input_audio_buffer.clear.

pipeline:
  max_history_items: 6          # live window - recent turns kept verbatim
  compaction:
    enabled: true
    trigger_items: 12           # high-water mark; summarize overflow back down
    summary_model: ""           # optional small/cheap CPU model; default = pipeline LLM
    max_summary_tokens: 512

Also: configurable pipeline.max_history_items (#10331) and a WebRTC data-channel max-message-size raise + keep-alive fix (#10407).

🔗 PRs: #10424, #10446, #10331, #10407.

⚙️ Multi-User Serving, On by Default

Two related, config-only (no kernel) changes make concurrent serving fast without any tuning. Both only fill values the user left unset - explicit config always wins.

Hardware-tuned defaults (#10411). When batch: is unset, default n_batch/n_ubatch to 2048 on NVIDIA Blackwell consumer GPUs (sm_120/121, incl. GB10 / DGX Spark) for a higher prefill ceiling. More importantly, the llama.cpp backend ships n_parallel = 1, which serializes concurrent requests and leaves continuous batching off - so multi-user serving was effectively disabled. This folds in a VRAM-scaled parallel-slot default:

VRAM	parallel slots
≥ 32 GiB	8
≥ 8 GiB	4
≥ 4 GiB	2
< 4 GiB / unknown	1 (unchanged)

Because the unified KV cache makes slots share the context budget, this is concurrency without multiplying KV memory. Works for both single-host (LocalGPU()) and distributed (the worker reports compute capability + VRAM on registration, and the router re-applies the heuristics for the selected node).

Prefix caching on by default (#10415). The backend ships n_cache_reuse = 0 (cross-request KV prefix reuse disabled). This enables it by default (256), so system prompts, RAG context, agent scaffolds and multi-turn chat aren't recomputed every request - a TTFT + throughput win for shared-prefix workloads, no-op for unique prompts. Same PR consolidates SetDefaults into clean domain-grouped tiers (ApplyInferenceDefaults / ApplyHardwareDefaults / ApplyServingDefaults / ApplyGenericDefaults), completed by a single-source-of-truth defaults refactor (#10418).

🔗 PRs: #10411, #10415, #10418.

🔀 Model Aliases

A new alias: field (#10414) makes a model config transparently route all its traffic to another configured model, so operators can rename or redirect a model without reconfiguring any clients - and swap the target live.

name: gpt-4
alias: my-llama-3

1:1 and runtime-swappable; strict (target must be an existing, enabled, non-alias model; alias→alias chains are rejected at load, request and create/swap time). Both names appear in GET /v1/models, usage accounting records requested=alias / served=target, and resolution lives in the universal request middleware so all modalities (chat, completions, embeddings, audio, images) inherit it - including composition with the Router.

🔗 PR: #10414.

⏱️ Word-Level ASR Timestamps, Everywhere

NeMo word-level timestamps (#10297) for ASR models.
CrispASR word-level timestamps (#10403), with the gRPC AudioTranscription wrapper now forwarding word timestamps end to end (#10402) and a filter for garbage words on the parakeet path (#10421).

🔗 PRs: #10297, #10403, #10402, #10421.

🧰 vLLM, ds4, Vulkan & Watchdog

vLLM progressive tool-call streaming (#10351) via parser.extract_tool_calls_streaming (follow-up to #10346), plus a fix for structured outputs silently ignored on vLLM ≥ 0.23 (GuidedDecodingParams removed upstream) (#10343).
ds4 SSD streaming + quality knobs (#10374). ds4's engine options are now reachable from model YAML through a declarative table - including SSD streaming (run a model larger than RAM by streaming routed MoE experts off the GGUF), so the 153 GB DeepSeek Flash quant runs on a 128 GB machine. Adds 128 GB-class DeepSeek gallery models.
Self-contained Vulkan backends (#10404). Vulkan backends now bundle their Mesa ICD driver .so + deps (rewriting library_path to bare sonames), so you can mix a CPU/native/Intel core image with a Vulkan backend and actually get the GPU instead of a silent CPU fallback.
Size-aware watchdog eviction (#9527). Opt-in mode evicts the largest resident model first (LRU as tiebreaker) instead of strict LRU, so a tiny embedding model isn't dropped while a 13 GB chat model stays put. --size-aware-eviction / LOCALAI_SIZE_AWARE_EVICTION, live-reloadable via POST /api/settings.

🔗 PRs: #10351, #10343, #10374, #10404, #9527.

🖼️ A Calmer, Sharper Interface

The React UI (core/http/react-ui) gets a top-to-bottom redesign this cycle - a calmer, more editorial look with the rough edges sanded off:

Design language + shell/nav + conversation/canvas (#10390). Keeps the Nord identity with a calmer, editorial, all-sans point of view: typography v2 (Geist), un-overloaded accent semantics (tint-only active nav, single AA focus ring), orchestrated page-reveal motion (reduced-motion safe), reusable primitives (PageHeader, SectionHeading, EmptyState, Skeleton, StatusPill), a rebuilt Home landing page, and PageHeader rolled across all ~29 pages with a navigation scroll-reset fix.
Ops/admin data-viz + tables (#10398). Distinct hues for prompt vs completion in usage charts, sortable accessible admin tables (users/traces), a ResponsiveTable that reflows into label/value cards on mobile, and an UnsavedChangesGuard protecting Settings / Agent / Fine-Tuning forms. 195/195 Playwright specs green.
Restructured Cluster Nodes (#10447). A calm one-line ClusterPulse header + conditional attention callout replace the metric-card grid; a NodePanel roster shows per-node models without a click (new GET /api/nodes/models); a deep-linkable /app/nodes/:id detail page replaces nested table drawers; Scheduling moves to its own /app/scheduling page. Nodes.jsx drops from ~1743 to ~360 lines.

Left: the conversation canvas. Right: the Operate console (system resources, sortable model tables).

- **More:** localized model strings + "Import" typo fix (#10341), paste images from the clipboard into chat (#10428), and console-based navigation + a drop-in API endpoint section (#10377).

🔗 PRs: #10390, #10398, #10447, #10341, #10428, #10377.

🛰️ Distributed Staging Robustness

Detach cold-load staging from the request context (#10438). A model staged lazily on the inference path was bound to the triggering request's context, so a browser refresh, LB idle-timeout, or round-robined retry cancelled multi-GB uploads mid-transfer - large models never finished staging. Cold loads now run on context.WithoutCancel(ctx) (keeping request values, dropping cancellation), each long step keeping its own bound.
Broadcast staging progress across replicas (#10440). Staging progress lived only in the originating replica's in-memory tracker, so the progress line flickered as /api/operations polls rotated between frontends. Now mirrored over NATS (staging.<model>.progress) with leading-edge debounce, TTL'd remote ops, and locally-owned ops staying authoritative - the same pattern as gallery-install progress.
Persist cancellable ops (#10454) so restarted in-flight gallery operations stay cancellable (with the e2e UpdateProgress signature updated in #10460), plus staging of backend companion assets to remote nodes (#10330).
Resumable, self-healing downloader (#10406). Big GGUF installs on slow/flaky links no longer hang forever or leak disk: a stall timeout (DownloadStallTimeout, 60s) turns an indefinite hang into a fast retryable error, cancellation keeps the .partial so the next attempt resumes via Range, and stale partials older than 24h are reaped on startup.

🔗 PRs: #10438, #10440, #10454, #10460, #10330, #10406.

🧩 Other Enhancements

Generic chat_template_kwargs (#10359). Pass arbitrary jinja chat-template variables (e.g. Qwen3's preserve_thinking) from model YAML (chat_template_kwargs:) or per-request via the OpenAI metadata field - no more hardcoded template levers in grpc-server.cpp. (Closes #10329.)
LocalAI User-Agent on registry pulls (#10434). OCI registry, Ollama registry and OCI blob pulls now identify themselves as LocalAI/<version> (via a new oci.UserAgent() helper) so operators can attribute traffic. (Implements #6258.)
More gallery voices (#10332): all Italian + all UK English sherpa-onnx Piper voices, and a fresh run of gallery-agent model additions + checksum refreshes.

🐛 Bug Fixes (recap)

🛰️ fix(distributed): detach cold-load staging from the request context - #10438
🛰️ fix(distributed): broadcast file-staging progress across replicas - #10440
🛰️ fix(distributed): stage backend companion assets to remote nodes - #10330
📦 fix(galleryop): persist cancellable so restarted in-flight ops stay cancellable - #10454
⬇️ fix(downloader): stall timeout, resume-safe cancel, and stale-partial reaping - #10406
🧰 fix(vllm): structured outputs silently ignored on vLLM >= 0.23 (GuidedDecodingParams removed) - #10343
⏱️ fix(grpc): forward word-level timestamps in AudioTranscription wrapper - #10402
🎙️ fix(crispasr): filter garbage words from parakeet word-level timestamps - #10421
🎙️ fix(whisperx): use whisperx.diarize.DiarizationPipeline with token kwarg - #10389
🧮 fix(diffusers): pin diffusers and transformers to a known-good pair - #10442
🏋️ fix: the trl backend's _do_training method directly initializes the trainer - #10422
🔊 fix(realtime): raise WebRTC data-channel max-message-size + keep sendLoop alive - #10407
⚙️ fix(settings): merge partial /api/settings updates instead of overwriting - #10463
🐕 fix(settings): start watchdog on cold-enable from the React UI - #10287
🪟 fix(ui): keep row action menu anchored and stop scroll snap on /app/manage - #10419
🪟 fix(react-ui): restore sidebar collapse in dev + stop Talk page auto-scroll - #10383
🚀 fix(launcher): truncate download status labels to stop progress dialog blowout - #10357
🧊 fix(backend): call vram.EstimateModelMultiContext (master build broken) - #10426
❄️ fix(nix flake): ensure nix flake builds successfully - #10399
🧠 fix(gallery): hide broken Gemma 4 QAT MTP entries - #10348

👒 Dependencies

Another steady bump cycle across submodules and Go/Python deps:

ggml-org/llama.cpp: 7 bumps · ikawrakow/ik_llama.cpp: 7 bumps
ggml-org/whisper.cpp: 5 bumps · leejet/stable-diffusion.cpp: 5 bumps
antirez/ds4: 3 bumps · mudler/parakeet.cpp: 2 bumps · CrispStrobe/CrispASR: 2 bumps · ServeurpersoCom/qwentts.cpp: 2 bumps
ServeurpersoCom/omnivoice.cpp: 1 bump · localai-org/privacy-filter.cpp: 1 bump
Python: grpcio 1.81.0→1.81.1 (vllm)
CI actions: actions/checkout 6→7
LocalRecall bump (fixes PostgreSQL collection name with :)
Model gallery: 11 gallery-agent model additions + checksum refreshes

📖 Documentation

docs: document all available backends and add "built by us" list - #10376
docs: document the privacy-filter.cpp backend - #10386
docs: mention apex-quant in the README - #10412
docs: add translated README links - #10353
fix(docs): use relearn notice shortcode instead of unsupported alert - #10364
docs: update docs version - #10333
Inline docs folded into the feature PRs above (depth, ced, supertonic, PII NER, realtime speaker/compaction, serving defaults, model aliases).

🙌 New Contributors

@vjsai made their first contribution in #10434
@SuperMarioYL made their first contribution in #9527
@Souheab made their first contribution in #10399
@dowithless made their first contribution in #10353

Enjoy!

Full Changelog: v4.4.3...v4.5.0

mudler/LocalAI v4.5.0 on GitHub