๐ LocalAI 4.3.0 Release! ๐
LocalAI 4.3.0 is out!
This release hardens the trust boundary and improves defaults for speed. Backend OCI images now ship with keyless cosign signatures and a per-gallery verification: policy, with an opt-in strict mode that fails closed.
The llama-cpp server-side prompt cache works by default: repeated system prompts (agents, OpenAI/Anthropic-compatible CLIs, coding assistants) collapse from minutes to seconds without touching YAML. Distributed mode gets rounds of optimizations. Usage tracking grows a per-API-key + per-user Sources view so admins can finally answer "who is burning the GPU?". And, for everyone on a Jetson/DGX box, the L4T13 (cu130/aarch64) backends are back.
๐ TL;DR
| Feature | Summary |
|---|---|
| ๐ Signed Backends | Keyless cosign + sigstore-go verification for backend OCI images, OCI 1.1 referrers, not_before revocation, opt-in strict mode.
|
| โก Prompt Cache by Default | llama-cpp server-side prompt cache works out of the box. Repeated system prompts go from 5-8 min to seconds.
|
| ๐ Usage per API Key | New Sources tab attributes traffic to keys and users. Revoked keys stay readable in history. |
| ๐ฐ๏ธ Distributed v3 | Per-request replica routing, cached probeHealth, async per-node installs with streaming progress, unified backend-logs entry point.
|
| ๐ฉบ Traces UI Stays Snappy | LOCALAI_TRACING_MAX_BODY_BYTES caps API + backend trace payloads. Admin Traces page stops drowning in 40 MB embeddings.
|
| ๐ง Nix Flake | Dockerless setup for NixOS users via flake.nix + dev shell.
|
| ๐ฆพ Jetson Thor Restored | vllm / sglang / vllm-omni L4T13 backends switched to PyPI aarch64+cu130 wheels (torch 2.10 ABI fix).
|
๐ New Features & Major Enhancements
๐ Signed Backends with Keyless Cosign
LocalAI now verifies that backend OCI images came from our CI, not a compromised registry or MITM. This closes a real trust gap: the gallery YAML told LocalAI which image to pull, but nothing checked the bytes.
The producer side (.github/workflows/backend_merge.yml) signs every merged backend image (and every per-arch entry under the manifest list) with sigstore/cosign keyless via Fulcio + Rekor, using OCI 1.1 referrers (no legacy :tag.sig). The consumer side (pkg/oci/cosignverify, built on sigstore-go) verifies signatures against a per-gallery verification: policy:
verification:
issuer_regex: "^https://token\\.actions\\.githubusercontent\\.com$"
identity_regex: "^https://github\\.com/mudler/LocalAI/\\.github/workflows/backend_merge\\.yml@.*$"
not_before: "2026-05-22T00:00:00Z"- TUF trusted root cached process-wide, so N backends from one gallery do 1 fetch, not N.
not_beforeis the revocation lever: keyless Fulcio certs are ephemeral, so revocation is policy-side. Advance the date in the gallery YAML and every signature predating the cutoff is invalidated.- Digest pinning closes the TOCTOU window between verify and pull.
- Strict mode:
--require-backend-integrity(orLOCALAI_REQUIRE_BACKEND_INTEGRITY=true) escalates missing policy / empty SHA256 from warn to hard-fail.
Rollout is backward-compatible: until a gallery ships a verification: block, installs proceed with a warning. The default backend/index.yaml will be populated next, and strict mode is opt-in. See .agents/backend-signing.md for the full producer + consumer story.
๐ PRs: #9823 (consumer + producer + plumbing), #9957 (fix for current cosign releases).
โก Prompt Cache: On by Default
llama-cpp ships with a server-side prompt cache, but until now LocalAI was not enabling it by default. Repeated system prompts (agents, Claude-Code-style coding assistants, OpenAI-compatible CLIs with long instructions) were re-prefilled on every call. With this release, the same workload collapses to seconds without no specific configuration on your side.
Two changes, one default flip each:
kv_unified=trueby default ingrpc-server.cpp. The previousfalsewas silently force-disablingcache_idle_slotsat server init (the host prompt cache was being allocated but never written across requests).prompt_cache_alldefaults totrueat the YAML layer, matching upstreamllama.cpp's owncommon.hdefault. The per-requestcache_promptknob is now on out of the box.
You can still opt out with options: ["kv_unified:false"] or prompt_cache_all: false, and there are new option keys (cache_idle_slots, checkpoint_every_nt) for tuning. Docs in docs/content/advanced/model-configuration.md got a worked example for the repeated-system-prompt workload and a proper explanation of how kv_unified, cache_ram, and cache_idle_slots interact.
๐ PRs: #9925 (kv_unified + cache_idle_slots defaults + docs), #9951 (prompt_cache_all tristate default).
๐ Per-API-Key Usage Tracking
Closes #9862. The usage page now answers "who spent these tokens?", not just "how many tokens were spent".
usage_recordsgainedSource(apikey/web/legacy),APIKeyID,APIKeyName, plus an idempotent backfill of pre-feature rows onInitDB.- Auth middleware plumbs the resolved
*UserAPIKeyand the request source through the Echo context. Usage middleware snapshots the key id + name, so revoked keys stay readable in history (rendered as(revoked)). - New endpoints:
GET /api/auth/usage/sources(self, no legacy) andGET /api/auth/admin/usage/sources(admin, withuser_id/api_key_idfilters, 200-key truncation). - React Usage page gains a Sources tab with a source-mix ribbon, a top-7 + Other time chart, and a searchable/sortable table with drill-in chip.
- Admin view (follow-up in #9935) also rolls up
(source, user_id, user_name)so Web UI session traffic is split per user instead of lumped into one global "Web UI" row, and every named-key row shows the owning account.
Docs: features/authentication.md gained a full Usage Tracking section with the new tab, endpoints, response shape and migration notes.
๐ PRs: #9920 (core + Sources tab), #9935 (per-user attribution in admin view).
๐ฐ๏ธ Distributed Mode v3
Distributed mode keeps hardening. This release fixes the two things that bit operators hardest in practice and lays the groundwork for the next round of UX.
Per-request routing across replicas (#9968) restores cross-node load balancing. The bug: ModelLoader.Load cached a *Model whose embedded InFlightTrackingClient was bound to a single (nodeID, replicaIndex). After the first request, every subsequent call reused that wrapper and pinned to whichever node won the first pick, even after the reconciler scaled the model out. The reproducer from the report:
dgx-spark1 loaded in_flight=6
nvidia-thor1 loaded in_flight=0 (โ idle, never gets traffic)
Now SmartRouter.Route runs per request, the existing in_flight ASC, last_used ASC, available_vram DESC round-robin actually fires, and the replica-selection rule lives in one place (PickBestReplica) with a mirror spec asserting the SQL ORDER BY and the Go picker agree on a seeded dataset. probeHealth is now memoized per (nodeID, addr) with a 30s TTL and singleflight coalescing, so a burst of new requests doesn't stall on a HealthCheck that llama.cpp serializes against in-flight Predict.
Async per-node installs via the gallery job queue (#9928). POST /api/nodes/:id/backends/install used to block the request for up to 3 minutes while the worker pulled the image, freezing the React UI's Backends picker. It now returns HTTP 202 + jobID immediately, scoped to a one-element targetNodeIDs allowlist, with a node-scoped opcache row so concurrent installs on different nodes don't collide. The Operations panel surfaces a nodeID field for attribution.
Resilient backend installs with streaming progress (#9958). Two phases:
- Phase 1:
LOCALAI_NATS_BACKEND_INSTALL_TIMEOUT/LOCALAI_NATS_BACKEND_UPGRADE_TIMEOUTenv vars (default 15m, previously hardcoded 3m). A NATS round-trip timeout while the worker is still pulling no longer reports as a hard failure: per-node status becomesrunning_on_worker, the queue row stays alive without bumpingAttempts, andListBackendsproactively clears install rows whose intent is satisfied (so the UI updates instantly instead of waiting up to 15m for the next reconciler tick). - Phase 2: workers publish debounced (~250ms)
BackendInstallProgressEventvalues on a transientnodes.<nodeID>.backend.install.<opID>.progresssubject. The master subscribes for the duration of the request and forwards each event intoOpStatus.UpdateStatus, so the admin UI gets per-byte progress for distributed installs the same way local-mode does, with no UI changes. Backward compatible: old workers stay silent, new masters tolerate silence.
Unified backend-logs entry point (#9949). /app/backend-logs/:modelId is now a single, mode-aware route. In standalone it's the local WebSocket view, unchanged. In distributed it probes nodesApi.getModels, filters by model_name, then routes: 0 hits โ empty state with a link to Nodes; 1 hit โ <Navigate replace> to the per-node logs URL preserving the ?from= deep-link timestamp; N hits โ a picker listing each hosting worker with node id, replica index and load state. Every view that links to backend logs now points at the same URL.
Bug-hunt harness. A new distributed test harness landed in tests/distributed/ to catch the kind of regressions the #9968 reproducer surfaced.
๐ PRs: #9968, #9928, #9958, #9949, plus the
tests(distributed): add bug-hunt harnesscommit.
๐ฉบ Admin Traces UI: Stays Responsive Under Load
Two complementary caps fix the symptom where the admin Traces page sat in "loading" forever on a chatty agent-pool RAG deployment.
- API-side (#9946):
LOCALAI_TRACING_MAX_BODY_BYTES(default 64 KiB) caps each captured request/response body in the trace middleware. The full payload still flows to the real client; only the trace copy is bounded.body_truncated+ originalbody_bytesare recorded so the dashboard can surface that truncation happened. Observed before the fix on a live deployment:/api/tracesreturned 44.6 MB (466 traces, 447/embeddings, top body 1.38 MB). The Traces UI Clear button is also kept enabled during loading, which is exactly when you need it. - Backend-side (#9960):
RecordBackendTracewalks theDatamap and replaces any string value larger than the cap with<truncated: N bytes>. Producers (core/backend/llm.go,core/trace/audio_snippet.go) apply head-preserving truncation upstream so the UI still shows useful leading content. TTS /audio_transformtraces drop the base64 snippet when the encoded blob exceeds the cap (truncated base64 is undecodable; the ReactWaveformPlayeralready no-ops without it).
Both knobs are live-tunable from the Traces settings panel.
๐ง Nix Flake for NixOS Users
New flake.nix + flake.lock ship a reproducible, dockerless setup for NixOS, plus a dev shell for hacking on LocalAI without a container.
๐ PRs: #9851 (initial flake), #9894 (correct src path + dev shell).
๐ฆพ Jetson Thor (L4T13) Backends Restored
The cuda13-nvidia-l4t-arm64-vllm / sglang / vllm-omni backends crashed at import with an undefined c10::MessageLogger symbol after the pypi.jetson-ai-lab.io/sbsa/cu130 mirror started shipping torch 2.11 next to vllm/sglang wheels built against torch 2.10. Per the PyTorch April 2026 announcement, all three backends now pull from PyPI's official aarch64 + cu130 wheels instead, with the L4T13 pyproject.toml retired in favor of the standard requirements-${profile}.txt pattern used everywhere else.
๐ PR: #9950.
๐ Chat: File Attachments + Stream Usage + Selection
Three independent fixes that together make the chat experience visibly better:
- Text-file attachments actually reach the model (regression from react-ui port) (#9896).
.txt,.md,.csv,.jsoncontent was silently dropped inuseChat.js(only image_url and audio_url branches added content; theelsebranch only pushed metadata).Home.jsxalso never calledfile.text()for files attached from the home screen. Both fixed. PDF files still need a parser (PDF.js or server-side extraction) and are flagged as a known limitation. - Stream
include_usagereturns non-zero with tools (#9941). Fixes #9927.processToolsdiscarded the cumulativeTokenUsagefromComputeChoices, so the streaming trailer reported{0, 0, 0}whenever atoolsarray was present. The fix forwards the authoritative final usage via a sentinel chunk beforeclose(responses), with the outer loop updated to capture before the empty-Choices skip. The OpenAI streaming spec contract is preserved (intermediate chunks still carry nousage). - Chat selection stops getting wiped every second (#9917). React 19 dropped the old
lastHtml === nextHtmlshort-circuit in its DOM diff, so the 1s/api/operationspoll re-assigningsetOperationswith a fresh array reference was collapsing text selection on every assistant message. Now JSON-compared and short-circuited. Bonus: the per-message copy button works over plain HTTP via a hidden-textarea +execCommand('copy')fallback whennavigator.clipboardis unavailable.
๐ง llama.cpp Stability + Refactors
tensor_buft_overridessentinel terminator (#9919). Mirror upstreamcommon/arg.cpp:645-658: pad placeholders at the end of the main vector soback().pattern == nullptrholds, and append a single{nullptr, nullptr}to the draft vector when non-empty.refactor(agents): bump skillserver, drop redundant Name(#9916).list_skillsandsearch_skillsnow return the same shape (onlyid, no duplicatedname). Adds a Ginkgo regression that drives the LocalAIFilesystemManagerthrough an in-process MCP session.- Swagger refreshes (#9872, #9962) keep the OpenAPI surface in sync with the routes added this cycle.
๐ ๏ธ CI & Image Plumbing
- Chronologically-orderable master tags. Master images are now also tagged
master-<epoch>-<sha>so they sort by build time. The pre-existingmastertag still moves withHEAD. - Backend signing CI:
COSIGN_EXPERIMENTAL=1is set for the oci-1-1 referrers mode in the backend-signing job to keep current cosign versions happy.
๐ Bug Fixes (recap)
- ๐
fix(distributed): route per request across loaded replicas + cache probeHealth- #9968 - ๐งฐ
fix(distributed): make admin backend installs resilient and observable- #9958 - โณ
fix(nodes): make per-node backend install async via gallery job queue- #9928 - ๐งพ
fix(traces): cap backend trace Data to keep admin UI responsive- #9960 - ๐งพ
fix(traces): cap captured body size to keep admin Traces UI responsive- #9946 - ๐ช
fix(react-ui): unify backend-logs entry point for distributed mode- #9949 - ๐
fix: inject text-file content into chat completions messages- #9896 - ๐งฎ
fix(openai): stream usage non-zero when tools are enabled- #9941 - ๐งท
fix(react-ui/chat): stop wiping selection on every /api/operations poll- #9917 - ๐งฑ
fix(llama-cpp): terminate tensor_buft_overrides with sentinel- #9919 - ๐ฆพ
fix(L4T13 backends): switch vllm/sglang/vllm-omni to PyPI aarch64+cu130 wheels- #9950 - ๐ง
fix(nix): correct flake src path and add dev shell- #9894 - ๐
Fix backend manifest merge signing on current cosign releases- #9957 - ๐งฏ
[utils] Fail immediately on extraction errors- #9926
๐ Dependencies
Heavy bump cycle across submodules and Go/Python deps:
ggml-org/llama.cpp: 7 bumps (#9855, #9876, #9897, #9915, #9934, #9952, #9963)ggml-org/whisper.cpp: 4 bumps (#9877, #9898, #9929, #9954)ikawrakow/ik_llama.cpp: 7 bumps (#9866, #9875, #9899, #9914, #9930, #9953, #9966)leejet/stable-diffusion.cpp: 4 bumps (#9907, #9933, #9955, #9965)antirez/ds4: 5 bumps (#9864, #9900, #9909, #9932, #9964)ace-step/acestep.cpp: bumped toed53cafwith wrapper API adapted (#9908, #9913)- Model gallery: checksum refreshes (#9901, #9910)
- Go modules:
alecthomas/kong1.14โ1.15 (#9881),aws/aws-sdk-go-v21.41.6โ1.41.7 (#9892),onsi/ginkgo/v22.28.2โ2.29.0 (#9882),golang.org/x/crypto0.50โ0.51 (#9886) - Python:
transformersโฅ5.8.1 (#9883),sentence-transformers5.4โ5.5 (#9888)
๐ Documentation
docs: :arrow_up: update docs version mudler/LocalAI- #9863- Plus inline docs updates folded into the feature PRs above (prompt-cache explainer, authentication / usage tracking section, backend signing guide).
๐ New Contributors
- @Azteczek made their first contribution in #9851
- @inquam made their first contribution in #9896
- @RinZ27 made their first contribution in #9926
Enjoy!
Full Changelog: v4.2.6...v4.3.0
