github mudler/LocalAI v4.1.0

4 hours ago

🎉 LocalAI 4.1.0 Release! 🚀




LocalAI 4.1.0 is out! 🔥

Just weeks after the landmark 4.0, we're back with another massive drop. This release turns LocalAI into a production-grade AI platform: spin up a distributed cluster with smart routing and autoscaling, lock it down with built-in auth and per-user quotas, fine-tune models without leaving the UI, and much more. If 4.0 was the foundation, 4.1 is the control tower.

Feature Summary
🌐 Distributed Mode Run LocalAI as a cluster — smart routing, node groups, drain/resume, min/max autoscaling.
🔐 Users & Auth Built-in user management with OIDC, invite mode, API keys, and admin impersonation.
📊 Quota System Per-user usage quotas with predictive analytics and breakdown dashboards.
🧪 Fine-Tuning (experimental) Fine-tune models with TRL, auto-export to GGUF, and import back — all from the UI.
⚗️ Quantization (experimental) New backend for on-the-fly model quantization.
🔧 Pipeline Editor Visual model pipeline editor in the React UI.
🤖 Standalone Agents Run agents from the CLI with local-ai agent run.
🧠 Smart Inferencing Auto inference defaults from Unsloth, tool parsing fallback, and min_p support.
🎬 Media History Browse past generated images and media in Studio pages.

New (long version) Full setup walktrough: https://www.youtube.com/watch?v=cMVNnlqwfw4

🚀 Key Features

🌐 Distributed Mode: scaling LocalAI horizontally

Run LocalAI as a distributed cluster and let it figure out where to send your requests. No more single-node bottlenecks.

  • Smart Routing: Requests are routed to nodes ordered by available VRAM — the beefiest, free GPU gets the job.
  • Node Groups: Pin models to specific node groups for workload isolation (e.g., "gpu-heavy" vs "cpu-light").
  • Autoscaling: Built-in min/max autoscaler with a node reconciler that manages the lifecycle automatically.
  • Drain & Resume: Gracefully drain nodes for maintenance and bring them back with a single API call.
  • Cluster Dashboard: See your entire cluster status at a glance from the home page.
  • Smart Model transfer: Use S3 or transfer via peer to peer
distributed-mode.mp4

🔐 Users, Authentication & Quotas

LocalAI now ships with a complete multi-user platform — perfect for teams, classrooms, or any shared deployment.

  • User Management: Create, edit, and manage users from the React UI.
  • OIDC/OAuth: Plug in your identity provider for SSO — Google, Keycloak, Authentik, you name it.
  • Invite Mode: Restrict registration to invite-only with admin approval.
  • API Keys: Per-user API key management.
  • Admin Powers: Admins can impersonate users for debugging.
  • Quota System: Set per-user usage quotas and enforce limits.
  • Usage Analytics: Predictive usage dashboard with per-user breakdown statistics.

Users and quota:

usersquota-1775167475876.mp4

Usage metrics per user:

usage.mp4

🧪 Fine-Tuning & Quantization

No more juggling external tools. Fine-tune and quantize directly inside LocalAI.

  • Fine-Tuning with TRL (Experimental): Train LoRA adapters with Hugging Face TRL, auto-export to GGUF, and import the result straight back into LocalAI. Includes a built-in evals framework to validate your work.
  • Quantization Backend: Spin up the new quantization backend to create optimized model variants on-the-fly.
quantize-fine-tune.mp4

🎨 UI

The React UI keeps getting better. This release adds serious power-user features:

  • Model Pipeline Editor: Visually wire up model pipelines — no YAML editing required.
  • Per-Model Backend Logs: Drill into logs scoped to individual models for laser-focused debugging.
  • Media History: Studio pages now remember your past generations — images, audio, and more.
  • Searchable Model/Backend Selector: Quickly find models and backends with inline search and filtering.
  • Structured Error Toasts: Errors now link directly to traces — one click from "something broke" to "here's why."
  • Tracing Settings: Inline tracing config restored with a cleaner UI.
talk.mp4

🤖 Agents & Inference

  • Standalone Agent Mode: Run agents straight from the terminal with local-ai agent run. Supports single-turn --prompt mode and pool-based configurations from pool.json.
  • Streaming Tool Calls: Agent mode tool calls now stream in real-time, with interleaved thinking fixed.
  • Inferencing Defaults: Automatic inference parameters sourced from Unsloth and applied to all endpoints and gallery models, your models just work better out of the box.
  • Tool Parsing Fallback: When native tool call parsing fails, an iterative fallback parser kicks in automatically.

🛠️ Under the Hood

  • Repeated Log Merging: Noisy terminals? Repeated log lines are now collapsed automatically.
  • Jetson/Tegra GPU Detection: First-class NVIDIA Jetson/Tegra platform detection.
  • Intel SYCL Fix: Auto-disables mmap for SYCL backends to prevent crashes.
  • llama.cpp Portability: Bundled libdl, librt, libpthread for improved cross-platform support.
  • HF_ENDPOINT Mirror: Downloader now rewrites HuggingFace URIs with HF_ENDPOINT for corporate/mirror setups.
  • Transformers >5.0: Bumped to HuggingFace Transformers >5.0 with generic model loading.
  • API Improvements: Proper 404s for missing models, unescaped model names, unified inferencing paths with automatic retry on transient errors.

🐞 Fixes & Improvements

  • Embeddings: Implemented encoding_format=base64 for the embeddings endpoint.
  • Kokoro TTS: Fixed phonemization model not downloading during installation.
  • Realtime API: Fixed Opus codec backend selection alias in development mode.
  • Gallery Filtering: Fixed exact tag matching for model gallery filters.
  • Open Responses: Fixed required ORItemParam.Arguments field being omitted; ORItemParam.Summary now always populated.
  • Tracing: Fixed settings not loading from runtime_settings.json.
  • UI: Fixed watchdog field mapping, model list refresh on deletion, backend display in model config, MCP button ordering.
  • Downloads: Fixed directory removal during fallback attempts; improved retry logic.
  • Model Paths: Fixed baseDir assignment to use ModelPath correctly.

❤️ Thank You

LocalAI is a community-powered FOSS movement. Every star, every PR, every bug report matters.

If you believe in privacy-first, self-hosted AI:

  • Star the repo — it helps more than you think
  • 🛠️ Contribute code, docs, or feedback
  • 📣 Share with your team, your community, your world

Let's keep building the future of open AI — together. 💪


✅ Full Changelog

📋 Click to expand full changelog

What's Changed

Bug fixes 🐛

  • fix: Change baseDir assignment to use ModelPath by @mudler in #9010
  • fix(ui): correctly map watchdog fields by @mudler in #9022
  • fix(api): unescape model names by @mudler in #9024
  • fix(ui): Add tracing inline settings back and create UI tests by @richiejp in #9027
  • Always populate ORItemParam.Summary by @tv42 in #9049
  • fix(ui): correctly display backend if specified in the model config, re-order MCP buttons by @mudler in #9053
  • fix(ui): Refresh model list on deletion by @richiejp in #9059
  • fix(openresponses): do not omit required field ORItemParam.Arguments by @tv42 in #9074
  • fix: Add tracing settings loading from runtime_settings.json by @localai-bot in #9081
  • fix: use exact tag matching for model gallery tag filtering by @majiayu000 in #9041
  • fix(realtime): Set the alias for opus so the development backend can be selected by @richiejp in #9083
  • fix(llama.cpp): bundle libdl, librt, libpthread in llama-cpp backend by @mudler in #9099
  • fix(download): do not remove dst dir until we try all fallbacks by @mudler in #9100
  • fix(auth): do not allow to register in invite mode by @mudler in #9101
  • fix(downloader): Rewrite full https HF URI with HF_ENDPOINT by @richiejp in #9107
  • fix: implement encoding_format=base64 for embeddings endpoint by @walcz-de in #9135
  • fix(coqui,nemo,voxcpm): Add dependencies to allow CI to progress by @richiejp in #9142
  • fix(voxcpm): Force using a recent voxcpm version to kick the dependency solver by @richiejp in #9150
  • fix: huggingface repo change the file name so Update index.yaml is needed by @ER-EPR in #9163
  • fix(kokoro): Download phonemization model during installation by @richiejp in #9165
  • fix(oauth/invite): do not register user (prending approval) without correct invite by @mudler in #9189
  • fix(inflight): count inflight from load model, but release afterwards by @mudler in #9194

Exciting New Features 🎉

  • feat: support streaming mode for tool calls in agent mode, fix interleaved thinking stream by @mudler in #9023
  • feat(ui): Per model backend logs and various fixes by @richiejp in #9028
  • feat(ui, gallery): Show model backends and add searchable model/backend selector by @richiejp in #9060
  • feat: add users and authentication support by @mudler in #9061
  • feat(ui, openai): Structured errors and link to traces in error toast by @richiejp in #9068
  • feat(ui): Add model pipeline editor by @richiejp in #9070
  • feat: add (experimental) fine-tuning support with TRL by @mudler in #9088
  • feat(ui): add predictor for usage, user-breakdown statistics by @mudler in #9091
  • feat: add quota system by @mudler in #9090
  • feat(quantization): add quantization backend by @mudler in #9096
  • feat: inferencing default, automatic tool parsing fallback and wire min_p by @mudler in #9092
  • feat: Merge repeated log lines in the terminal by @richiejp in #9141
  • feat: add distributed mode by @mudler in #9124
  • feat(ui): Add media history to studio pages (e.g. past images) by @richiejp in #9151
  • feat: add node reconciler, allow to schedule to group of nodes, min/max autoscaler by @mudler in #9186
  • feat(api): Return 404 when model is not found except for model names in HF format by @richiejp in #9133
  • feat(distributed): Avoid resending models to backend nodes by @richiejp in #9193
  • feat: add resume endpoint to undrain nodes by @mudler in #9197

👒 Dependencies

  • chore(deps): bump github.com/anthropics/anthropic-sdk-go from 1.26.0 to 1.27.0 by @dependabot[bot] in #9035
  • chore(deps): bump github.com/ebitengine/purego from 0.9.1 to 0.10.0 by @dependabot[bot] in #9034
  • chore(deps): bump actions/upload-artifact from 4 to 7 by @dependabot[bot] in #9030
  • chore(deps): bump github.com/google/go-containerregistry from 0.21.1 to 0.21.2 by @dependabot[bot] in #9033
  • chore(deps): bump playwright from 1.52.0 to 1.58.2 in /core/http/react-ui in the npm_and_yarn group across 1 directory by @dependabot[bot] in #9055
  • chore(deps): bump github.com/google/go-containerregistry from 0.21.2 to 0.21.3 by @dependabot[bot] in #9121
  • chore(deps): bump github.com/modelcontextprotocol/go-sdk from 1.4.0 to 1.4.1 by @dependabot[bot] in #9118
  • chore(deps): bump actions/checkout from 4 to 6 by @dependabot[bot] in #9110
  • chore(deps): bump peter-evans/create-pull-request from 7 to 8 by @dependabot[bot] in #9114
  • chore(deps): bump github.com/mudler/skillserver from 0.0.5 to 0.0.6 by @dependabot[bot] in #9116
  • chore(deps): bump actions/checkout from 4 to 6 by @dependabot[bot] in #9173
  • chore(deps): bump actions/deploy-pages from 4 to 5 by @dependabot[bot] in #9172
  • chore(deps): bump actions/configure-pages from 5 to 6 by @dependabot[bot] in #9174
  • chore(deps): bump google.golang.org/grpc from 1.79.1 to 1.79.3 by @dependabot[bot] in #9175
  • chore(deps): bump github.com/nats-io/nats.go from 1.49.0 to 1.50.0 by @dependabot[bot] in #9183
  • chore(deps): bump github.com/pion/webrtc/v4 from 4.2.9 to 4.2.11 by @dependabot[bot] in #9185
  • chore(deps): bump go.opentelemetry.io/otel/exporters/prometheus from 0.62.0 to 0.64.0 by @dependabot[bot] in #9178
  • chore(deps): bump actions/upload-pages-artifact from 3 to 4 by @dependabot[bot] in #9179
  • chore(deps): bump grpcio from 1.78.1 to 1.80.0 in /backend/python/transformers by @dependabot[bot] in #9180
  • chore(deps): bump grpcio from 1.78.1 to 1.80.0 in /backend/python/vllm by @dependabot[bot] in #9177
  • chore(deps): bump grpcio from 1.78.1 to 1.80.0 in /backend/python/coqui by @dependabot[bot] in #9182
  • chore(deps): bump grpcio from 1.78.1 to 1.80.0 in /backend/python/rerankers by @dependabot[bot] in #9181
  • chore(deps): bump grpcio from 1.78.1 to 1.80.0 in /backend/python/common/template by @dependabot[bot] in #9176

Other Changes

  • docs: ⬆️ update docs version mudler/LocalAI by @localai-bot in #9008
  • chore: ⬆️ Update ggml-org/llama.cpp to 3a6f059909ed5dab8587df5df4120315053d57a4 by @localai-bot in #9009
  • fix: Automatically disable mmap for Intel SYCL backends (#9012) by @localai-bot in #9015
  • chore: ⬆️ Update leejet/stable-diffusion.cpp to 862a6586cb6fcec037c14f9ed902329ecec7d990 by @localai-bot in #9019
  • chore: ⬆️ Update ggml-org/llama.cpp to 88915cb55c14769738fcab7f1c6eaa6dcc9c2b0c by @localai-bot in #9020
  • chore: refactor endpoints to use same inferencing path, add automatic retrial mechanism in case of errors by @mudler in #9029
  • chore: ⬆️ Update ggml-org/whisper.cpp to 79218f51d02ffe70575ef7fba3496dfc7adda027 by @localai-bot in #9037
  • chore: ⬆️ Update ggml-org/llama.cpp to 9b342d0a9f2f4892daec065491583ec2be129685 by @localai-bot in #9039
  • chore: ⬆️ Update ace-step/acestep.cpp to 15740f4301b3ec3020875f1fb975a6cfdb2f6767 by @localai-bot in #9038
  • chore: ⬆️ Update leejet/stable-diffusion.cpp to 545fac4f3fb0117a4e962b1a04cf933a7e635933 by @localai-bot in #9036
  • chore: ⬆️ Update ggml-org/llama.cpp to ee4801e5a6ee7ee4063144ab44ab4e127f76fba8 by @localai-bot in #9044
  • chore: ⬆️ Update ggml-org/whisper.cpp to dc9611662265870df22a7230b7586176a99c1955 by @localai-bot in #9045
  • chore: ⬆️ Update ace-step/acestep.cpp to ab020a9aefcd364423e0665da12babc6b0c7b507 by @localai-bot in #9046
  • feat: Add standalone agent run mode inspired by LocalAGI by @localai-bot in #9056
  • chore: ⬆️ Update ggml-org/whisper.cpp to ef3463bb29ef90d25dfabfd1e75993111c52412d by @localai-bot in #9062
  • chore: ⬆️ Update ggml-org/llama.cpp to 5744d7ec430e2f875a393770195fda530560773f by @localai-bot in #9063
  • docs: Add troubleshooting guide for embedding models (fixes #9064) by @localai-bot in #9065
  • feat(swagger): update swagger by @localai-bot in #9075
  • chore: ⬆️ Update ggml-org/whisper.cpp to 9386f239401074690479731c1e41683fbbeac557 by @localai-bot in #9077
  • chore(deps): bump llama-cpp to 'a0bbcdd9b6b83eeeda6f1216088f42c33d464e38' by @mudler in #9079
  • feat(swagger): update swagger by @localai-bot in #9085
  • chore: ⬆️ Update ggml-org/llama.cpp to 4cb7e0bd61e7e1101e8ab10db5dee70c5717a386 by @localai-bot in #9087
  • chore: ⬆️ Update ace-step/acestep.cpp to 7326a7bea0c2037982ec924f7364e998df70450c by @localai-bot in #9086
  • chore: ⬆️ Update ggml-org/whisper.cpp to 76684141a5d059be71cbe23dc2f0ed552213ba2d by @localai-bot in #9094
  • chore: ⬆️ Update ggml-org/llama.cpp to 990e4d96980d0b016a2b07049cc9031642fb9903 by @localai-bot in #9095
  • chore(transformers): bump to >5.0 and generically load models by @mudler in #9097
  • feat(swagger): update swagger by @localai-bot in #9103
  • chore: ⬆️ Update ggml-org/llama.cpp to 49bfddeca18e62fa3d39114a23e9fcbdf8a22388 by @localai-bot in #9102
  • chore: ⬆️ Update ggml-org/llama.cpp to 1772701f99dd3fc13f5783b282c2361eda8ca47c by @localai-bot in #9123
  • chore: ⬆️ Update ggml-org/llama.cpp to 9f102a1407ed5d73b8c954f32edab50f8dfa3f58 by @localai-bot in #9127
  • chore: ⬆️ Update ace-step/acestep.cpp to 6f35c874ee11e86d511b860019b84976f5b52d3a by @localai-bot in #9128
  • fix(docs): Use notice instead of alert by @richiejp in #9134
  • chore: ⬆️ Update ggml-org/llama.cpp to a970515bdb0b1d09519106847660b0d0c84d2472 by @localai-bot in #9137
  • feat(swagger): update swagger by @localai-bot in #9136
  • chore: ⬆️ Update ggml-org/llama.cpp to 59d840209a5195c2f6e2e81b5f8339a0637b59d9 by @localai-bot in #9144
  • chore: ⬆️ Update leejet/stable-diffusion.cpp to f16a110f8776398ef23a2a6b7b57522c2471637a by @localai-bot in #9167
  • chore: ⬆️ Update ggml-org/whisper.cpp to 95ea8f9bfb03a15db08a8989966fd1ae3361e20d by @localai-bot in #9168
  • chore: ⬆️ Update ggml-org/llama.cpp to 7c203670f8d746382247ed369fea7fbf10df8ae0 by @localai-bot in #9160
  • chore(workers): improve logging, set header timeouts by @mudler in #9171
  • chore(ci): Scope tests extras backend tests by @richiejp in #9170
  • feat(swagger): update swagger by @localai-bot in #9187
  • chore: ⬆️ Update ggml-org/llama.cpp to 08f21453aec846867b39878500d725a05bd32683 by @localai-bot in #9190
  • stablediffusion-ggml: replace hand-maintained enum string arrays with upstream API calls by @Copilot in #9192
  • chore: ⬆️ Update leejet/stable-diffusion.cpp to 09b12d5f6d51d862749e8e0ee8baac8f012089e2 by @localai-bot in #9195
  • chore: ⬆️ Update ggml-org/llama.cpp to 0fcb3760b2b9a3a496ef14621a7e4dad7a8df90f by @localai-bot in #9196

New Contributors

Full Changelog: v4.0.0...v4.1.0

Don't miss a new LocalAI release

NewReleases is sending notifications on new releases.