🎉 LocalAI 4.3.0 Release! 🚀

LocalAI 4.3.0 is out!

This release hardens the trust boundary and improves defaults for speed. Backend OCI images now ship with keyless cosign signatures and a per-gallery verification: policy, with an opt-in strict mode that fails closed.
The llama-cpp server-side prompt cache works by default: repeated system prompts (agents, OpenAI/Anthropic-compatible CLIs, coding assistants) collapse from minutes to seconds without touching YAML. Distributed mode gets rounds of optimizations. Usage tracking grows a per-API-key + per-user Sources view so admins can finally answer "who is burning the GPU?". And, for everyone on a Jetson/DGX box, the L4T13 (cu130/aarch64) backends are back.

📌 TL;DR

Feature	Summary
🔐 Signed Backends	Keyless cosign + sigstore-go verification for backend OCI images, OCI 1.1 referrers, `not_before` revocation, opt-in strict mode.
⚡ Prompt Cache by Default	`llama-cpp` server-side prompt cache works out of the box. Repeated system prompts go from 5-8 min to seconds.
📊 Usage per API Key	New Sources tab attributes traffic to keys and users. Revoked keys stay readable in history.
🛰️ Distributed v3	Per-request replica routing, cached `probeHealth`, async per-node installs with streaming progress, unified backend-logs entry point.
🩺 Traces UI Stays Snappy	`LOCALAI_TRACING_MAX_BODY_BYTES` caps API + backend trace payloads. Admin Traces page stops drowning in 40 MB embeddings.
🧊 Nix Flake	Dockerless setup for NixOS users via `flake.nix` + dev shell.
🦾 Jetson Thor Restored	`vllm` / `sglang` / `vllm-omni` L4T13 backends switched to PyPI aarch64+cu130 wheels (torch 2.10 ABI fix).

🚀 New Features & Major Enhancements

🔐 Signed Backends with Keyless Cosign

LocalAI now verifies that backend OCI images came from our CI, not a compromised registry or MITM. This closes a real trust gap: the gallery YAML told LocalAI which image to pull, but nothing checked the bytes.

The producer side (.github/workflows/backend_merge.yml) signs every merged backend image (and every per-arch entry under the manifest list) with sigstore/cosign keyless via Fulcio + Rekor, using OCI 1.1 referrers (no legacy :tag.sig). The consumer side (pkg/oci/cosignverify, built on sigstore-go) verifies signatures against a per-gallery verification: policy:

verification:
  issuer_regex: "^https://token\\.actions\\.githubusercontent\\.com$"
  identity_regex: "^https://github\\.com/mudler/LocalAI/\\.github/workflows/backend_merge\\.yml@.*$"
  not_before: "2026-05-22T00:00:00Z"

TUF trusted root cached process-wide, so N backends from one gallery do 1 fetch, not N.
not_before is the revocation lever: keyless Fulcio certs are ephemeral, so revocation is policy-side. Advance the date in the gallery YAML and every signature predating the cutoff is invalidated.
Digest pinning closes the TOCTOU window between verify and pull.
Strict mode: --require-backend-integrity (or LOCALAI_REQUIRE_BACKEND_INTEGRITY=true) escalates missing policy / empty SHA256 from warn to hard-fail.

Rollout is backward-compatible: until a gallery ships a verification: block, installs proceed with a warning. The default backend/index.yaml will be populated next, and strict mode is opt-in. See .agents/backend-signing.md for the full producer + consumer story.

🔗 PRs: #9823 (consumer + producer + plumbing), #9957 (fix for current cosign releases).

⚡ Prompt Cache: On by Default

llama-cpp ships with a server-side prompt cache, but until now LocalAI was not enabling it by default. Repeated system prompts (agents, Claude-Code-style coding assistants, OpenAI-compatible CLIs with long instructions) were re-prefilled on every call. With this release, the same workload collapses to seconds without no specific configuration on your side.

Two changes, one default flip each:

kv_unified=true by default in grpc-server.cpp. The previous false was silently force-disabling cache_idle_slots at server init (the host prompt cache was being allocated but never written across requests).
prompt_cache_all defaults to true at the YAML layer, matching upstream llama.cpp's own common.h default. The per-request cache_prompt knob is now on out of the box.

You can still opt out with options: ["kv_unified:false"] or prompt_cache_all: false, and there are new option keys (cache_idle_slots, checkpoint_every_nt) for tuning. Docs in docs/content/advanced/model-configuration.md got a worked example for the repeated-system-prompt workload and a proper explanation of how kv_unified, cache_ram, and cache_idle_slots interact.

🔗 PRs: #9925 (kv_unified + cache_idle_slots defaults + docs), #9951 (prompt_cache_all tristate default).

📊 Per-API-Key Usage Tracking

Closes #9862. The usage page now answers "who spent these tokens?", not just "how many tokens were spent".

usage_records gained Source (apikey / web / legacy), APIKeyID, APIKeyName, plus an idempotent backfill of pre-feature rows on InitDB.
Auth middleware plumbs the resolved *UserAPIKey and the request source through the Echo context. Usage middleware snapshots the key id + name, so revoked keys stay readable in history (rendered as (revoked)).
New endpoints: GET /api/auth/usage/sources (self, no legacy) and GET /api/auth/admin/usage/sources (admin, with user_id / api_key_id filters, 200-key truncation).
React Usage page gains a Sources tab with a source-mix ribbon, a top-7 + Other time chart, and a searchable/sortable table with drill-in chip.
Admin view (follow-up in #9935) also rolls up (source, user_id, user_name) so Web UI session traffic is split per user instead of lumped into one global "Web UI" row, and every named-key row shows the owning account.

Docs: features/authentication.md gained a full Usage Tracking section with the new tab, endpoints, response shape and migration notes.

🔗 PRs: #9920 (core + Sources tab), #9935 (per-user attribution in admin view).

🛰️ Distributed Mode v3

Distributed mode keeps hardening. This release fixes the two things that bit operators hardest in practice and lays the groundwork for the next round of UX.

Per-request routing across replicas (#9968) restores cross-node load balancing. The bug: ModelLoader.Load cached a *Model whose embedded InFlightTrackingClient was bound to a single (nodeID, replicaIndex). After the first request, every subsequent call reused that wrapper and pinned to whichever node won the first pick, even after the reconciler scaled the model out. The reproducer from the report:

dgx-spark1     loaded   in_flight=6
nvidia-thor1   loaded   in_flight=0       (← idle, never gets traffic)

Now SmartRouter.Route runs per request, the existing in_flight ASC, last_used ASC, available_vram DESC round-robin actually fires, and the replica-selection rule lives in one place (PickBestReplica) with a mirror spec asserting the SQL ORDER BY and the Go picker agree on a seeded dataset. probeHealth is now memoized per (nodeID, addr) with a 30s TTL and singleflight coalescing, so a burst of new requests doesn't stall on a HealthCheck that llama.cpp serializes against in-flight Predict.

Async per-node installs via the gallery job queue (#9928). POST /api/nodes/:id/backends/install used to block the request for up to 3 minutes while the worker pulled the image, freezing the React UI's Backends picker. It now returns HTTP 202 + jobID immediately, scoped to a one-element targetNodeIDs allowlist, with a node-scoped opcache row so concurrent installs on different nodes don't collide. The Operations panel surfaces a nodeID field for attribution.

Resilient backend installs with streaming progress (#9958). Two phases:

Phase 1: LOCALAI_NATS_BACKEND_INSTALL_TIMEOUT / LOCALAI_NATS_BACKEND_UPGRADE_TIMEOUT env vars (default 15m, previously hardcoded 3m). A NATS round-trip timeout while the worker is still pulling no longer reports as a hard failure: per-node status becomes running_on_worker, the queue row stays alive without bumping Attempts, and ListBackends proactively clears install rows whose intent is satisfied (so the UI updates instantly instead of waiting up to 15m for the next reconciler tick).
Phase 2: workers publish debounced (~250ms) BackendInstallProgressEvent values on a transient nodes.<nodeID>.backend.install.<opID>.progress subject. The master subscribes for the duration of the request and forwards each event into OpStatus.UpdateStatus, so the admin UI gets per-byte progress for distributed installs the same way local-mode does, with no UI changes. Backward compatible: old workers stay silent, new masters tolerate silence.

Unified backend-logs entry point (#9949). /app/backend-logs/:modelId is now a single, mode-aware route. In standalone it's the local WebSocket view, unchanged. In distributed it probes nodesApi.getModels, filters by model_name, then routes: 0 hits → empty state with a link to Nodes; 1 hit → <Navigate replace> to the per-node logs URL preserving the ?from= deep-link timestamp; N hits → a picker listing each hosting worker with node id, replica index and load state. Every view that links to backend logs now points at the same URL.

Bug-hunt harness. A new distributed test harness landed in tests/distributed/ to catch the kind of regressions the #9968 reproducer surfaced.

🔗 PRs: #9968, #9928, #9958, #9949, plus the tests(distributed): add bug-hunt harness commit.

🩺 Admin Traces UI: Stays Responsive Under Load

Two complementary caps fix the symptom where the admin Traces page sat in "loading" forever on a chatty agent-pool RAG deployment.

API-side (#9946): LOCALAI_TRACING_MAX_BODY_BYTES (default 64 KiB) caps each captured request/response body in the trace middleware. The full payload still flows to the real client; only the trace copy is bounded. body_truncated + original body_bytes are recorded so the dashboard can surface that truncation happened. Observed before the fix on a live deployment: /api/traces returned 44.6 MB (466 traces, 447 /embeddings, top body 1.38 MB). The Traces UI Clear button is also kept enabled during loading, which is exactly when you need it.
Backend-side (#9960): RecordBackendTrace walks the Data map and replaces any string value larger than the cap with <truncated: N bytes>. Producers (core/backend/llm.go, core/trace/audio_snippet.go) apply head-preserving truncation upstream so the UI still shows useful leading content. TTS / audio_transform traces drop the base64 snippet when the encoded blob exceeds the cap (truncated base64 is undecodable; the React WaveformPlayer already no-ops without it).

Both knobs are live-tunable from the Traces settings panel.

🔗 PRs: #9946 (API side), #9960 (backend side).

🧊 Nix Flake for NixOS Users

New flake.nix + flake.lock ship a reproducible, dockerless setup for NixOS, plus a dev shell for hacking on LocalAI without a container.

🔗 PRs: #9851 (initial flake), #9894 (correct src path + dev shell).

🦾 Jetson Thor (L4T13) Backends Restored

The cuda13-nvidia-l4t-arm64-vllm / sglang / vllm-omni backends crashed at import with an undefined c10::MessageLogger symbol after the pypi.jetson-ai-lab.io/sbsa/cu130 mirror started shipping torch 2.11 next to vllm/sglang wheels built against torch 2.10. Per the PyTorch April 2026 announcement, all three backends now pull from PyPI's official aarch64 + cu130 wheels instead, with the L4T13 pyproject.toml retired in favor of the standard requirements-${profile}.txt pattern used everywhere else.

🔗 PR: #9950.

📎 Chat: File Attachments + Stream Usage + Selection

Three independent fixes that together make the chat experience visibly better:

Text-file attachments actually reach the model (regression from react-ui port) (#9896). .txt, .md, .csv, .json content was silently dropped in useChat.js (only image_url and audio_url branches added content; the else branch only pushed metadata). Home.jsx also never called file.text() for files attached from the home screen. Both fixed. PDF files still need a parser (PDF.js or server-side extraction) and are flagged as a known limitation.
Stream include_usage returns non-zero with tools (#9941). Fixes #9927. processTools discarded the cumulative TokenUsage from ComputeChoices, so the streaming trailer reported {0, 0, 0} whenever a tools array was present. The fix forwards the authoritative final usage via a sentinel chunk before close(responses), with the outer loop updated to capture before the empty-Choices skip. The OpenAI streaming spec contract is preserved (intermediate chunks still carry no usage).
Chat selection stops getting wiped every second (#9917). React 19 dropped the old lastHtml === nextHtml short-circuit in its DOM diff, so the 1s /api/operations poll re-assigning setOperations with a fresh array reference was collapsing text selection on every assistant message. Now JSON-compared and short-circuited. Bonus: the per-message copy button works over plain HTTP via a hidden-textarea + execCommand('copy') fallback when navigator.clipboard is unavailable.

🔧 llama.cpp Stability + Refactors

tensor_buft_overrides sentinel terminator (#9919). Mirror upstream common/arg.cpp:645-658: pad placeholders at the end of the main vector so back().pattern == nullptr holds, and append a single {nullptr, nullptr} to the draft vector when non-empty.
refactor(agents): bump skillserver, drop redundant Name (#9916). list_skills and search_skills now return the same shape (only id, no duplicated name). Adds a Ginkgo regression that drives the LocalAI FilesystemManager through an in-process MCP session.
Swagger refreshes (#9872, #9962) keep the OpenAPI surface in sync with the routes added this cycle.

🛠️ CI & Image Plumbing

Chronologically-orderable master tags. Master images are now also tagged master-<epoch>-<sha> so they sort by build time. The pre-existing master tag still moves with HEAD.
Backend signing CI: COSIGN_EXPERIMENTAL=1 is set for the oci-1-1 referrers mode in the backend-signing job to keep current cosign versions happy.

🐛 Bug Fixes (recap)

🔁 fix(distributed): route per request across loaded replicas + cache probeHealth - #9968
🧰 fix(distributed): make admin backend installs resilient and observable - #9958
⏳ fix(nodes): make per-node backend install async via gallery job queue - #9928
🧾 fix(traces): cap backend trace Data to keep admin UI responsive - #9960
🧾 fix(traces): cap captured body size to keep admin Traces UI responsive - #9946
🪟 fix(react-ui): unify backend-logs entry point for distributed mode - #9949
📎 fix: inject text-file content into chat completions messages - #9896
🧮 fix(openai): stream usage non-zero when tools are enabled - #9941
🧷 fix(react-ui/chat): stop wiping selection on every /api/operations poll - #9917
🧱 fix(llama-cpp): terminate tensor_buft_overrides with sentinel - #9919
🦾 fix(L4T13 backends): switch vllm/sglang/vllm-omni to PyPI aarch64+cu130 wheels - #9950
🧊 fix(nix): correct flake src path and add dev shell - #9894
🔐 Fix backend manifest merge signing on current cosign releases - #9957
🧯 [utils] Fail immediately on extraction errors - #9926

👒 Dependencies

Heavy bump cycle across submodules and Go/Python deps:

ggml-org/llama.cpp: 7 bumps (#9855, #9876, #9897, #9915, #9934, #9952, #9963)
ggml-org/whisper.cpp: 4 bumps (#9877, #9898, #9929, #9954)
ikawrakow/ik_llama.cpp: 7 bumps (#9866, #9875, #9899, #9914, #9930, #9953, #9966)
leejet/stable-diffusion.cpp: 4 bumps (#9907, #9933, #9955, #9965)
antirez/ds4: 5 bumps (#9864, #9900, #9909, #9932, #9964)
ace-step/acestep.cpp: bumped to ed53caf with wrapper API adapted (#9908, #9913)
Model gallery: checksum refreshes (#9901, #9910)
Go modules: alecthomas/kong 1.14→1.15 (#9881), aws/aws-sdk-go-v2 1.41.6→1.41.7 (#9892), onsi/ginkgo/v2 2.28.2→2.29.0 (#9882), golang.org/x/crypto 0.50→0.51 (#9886)
Python: transformers ≥5.8.1 (#9883), sentence-transformers 5.4→5.5 (#9888)

📖 Documentation

docs: :arrow_up: update docs version mudler/LocalAI - #9863
Plus inline docs updates folded into the feature PRs above (prompt-cache explainer, authentication / usage tracking section, backend signing guide).

🙌 New Contributors

@Azteczek made their first contribution in #9851
@inquam made their first contribution in #9896
@RinZ27 made their first contribution in #9926

Enjoy!

Full Changelog: v4.2.6...v4.3.0

mudler/LocalAI v4.3.0 on GitHub