You can use local LLMs with tools like Claude Code and Codex by connecting them to Unsloth’s API endpoint. This lets you run models like Qwen and Gemma locally, with additional features such as self-healing tool calling, code execution, and web search. Unsloth makes it easy to deploy a fast API inference endpoint that provides:
- Self-healing tool calling, which helps reduce broken or malformed tool calls by 50%
- Code execution support, allowing Bash and Python execution for more accurate code outputs.
- Advanced Web search that visits and actually reads webpages to gather in-depth info.
- Automatic inference settings for GGUF models (temp, top-k etc.)
Models loaded in Unsloth (including GGUFs) are exposed as an authenticated API via llama-server. A long API key is generated for security reasons like how OpenAI provides one. Your local models can then be used directly in your preferred AI agent, SDK, or chat client. Unsloth speaks two dialects on the same port:
- Anthropic-compatible
/v1/messagesfor Claude Code, OpenClaw, the Anthropic SDK, and any client that expects the Messages API. - OpenAI-compatible
/v1/chat/completionsand/v1/responsesfor the OpenAI SDK, OpenCode, Cursor, Continue, Cline, Open WebUI, SillyTavern, and any OpenAI-compatible tool. - Both support streaming, tool calling (OpenAI tools / Anthropic tools), and vision inputs.
New models
We've also got a handful of new models to run including NVIDIA Nemotron 3 Nano Omni, IBM Granite 4.1 and Mistral 3.5 Medium. We helped Mistral solve some issues with implementation in transformers and GGUFs.
Unsloth Updates
- Stopped Studio training runs can now resume from checkpoints.
- Chat threads now autosave and persist more reliably.
- DPO training hangs in multi-process setups were fixed.
- VLM GRPO support improved with MROPE updates.
- Studio’s stop button now properly stops generation.
- Fix chat template disappearing after browser refresh
What's Changed in Unsloth
- Studio: use (gguf) context length before max seq length by @G07cha in #5111
- chore: fix typo cleanup across tests and backend strings by @luojiyin1987 in #5152
- fix: guard resolve_model_class fallback against unresolvable transformers AutoModel entries by @Etherll in #5155
- Studio: kill in-flight llama-server before spawning a new one by @danielhanchen in #5171
- Studio: stop currency escape from breaking inline LaTeX by @danielhanchen in #5170
- Studio: probe AMD GPUs in llama-server VRAM detection by @danielhanchen in #5172
- Studio: make stop button actually stop generation by @danielhanchen in #5069
- Studio: add github_repo seed reader and GitHub Support Bot recipe by @danielhanchen in #5169
- fix(studio): use endswith for mmproj F16 variant selection by @LeoBorcherding in #5184
- [pre-commit.ci] pre-commit autoupdate by @pre-commit-ci[bot] in #5204
- Fix Windows install when paths contain spaces or Python 3.14 is on PATH by @Etherll in #5201
- Studio: Preserve transparency in uploaded profile avatars by @Imagineer99 in #5200
- UX: single chat header error placement and selector alignment by @Imagineer99 in #5173
- Studio: Refine chat preset and group built-in presets by @Imagineer99 in #5159
- Studio: Fix image-only chat requests failing validation by @Imagineer99 in #5212
- Studio: fix 7 failing studio_unit_tests on main by @danielhanchen in #5216
- Patch checkpoint reload init functions to strip unsupported args by @Datta0 in #5167
- Studio: Fix clipped model selector text descenders by @Imagineer99 in #5210
- Fix DPO trainer multi process hang by @Datta0 in #5199
- Studio: Pin assistant-ui core for fresh installs by @Imagineer99 in #5229
- Fix local model scanner to handle ollama cloud models by @Anish9901 in #5220
- Fix Studio desktop tray installer and titlebar and bux fixes by @wasimysaid in #5179
- MROPE for VLM GRPO by @Datta0 in #5198
- install: overlay unsloth-zoo from git main on --local by @rolandtannous in #5242
- Studio: Fix chat template disappearing after browser refresh by @Imagineer99 in #5209
- studio: add --local to setup.sh + overlay unsloth-zoo from git main by @rolandtannous in #5252
- Fix/windowsprebuilt by @mmathew23 in #5241
- Studio: Add dataset upload dropzone and update preserve think copy by @Imagineer99 in #5253
- Add Qwen3.6 support by @rolandtannous in #5257
- Studio: Chat thread autosave persistence by @Imagineer99 in #5256
- Studio: Enable deleting fine-tuned chat models by @Imagineer99 in #5234
- Studio: Add checkpoint resume for stopped training runs by @Imagineer99 in #5255
- Studio: Polish spacing and profile input radius by @Imagineer99 in #5222
- Fix check for libcurl headers in install.sh by @LFd3v in #5251
- Default Studio host to 127.0.0.1 and prompt before auto-start by @rolandtannous in #5267
- Studio: forward llama-server args from
unsloth studio run, activateunsloth run, and allow passing model:quant to load models by @rolandtannous in #5271 - Studio: Always show API usage examples and docs links by @Imagineer99 in #5270
- Studio: Change API Keys settings to API Access by @Imagineer99 in #5268
- unsloth run: add --enable-tools/--disable-tools server-side tool policy by @rolandtannous in #5277
- fix: use % 8 instead of // 8 in FP8 weight shape check by @Ricardo-M-L in #5243
- Pin Studio GGUF export to llama.cpp's local convert script by @mmathew23 in #5275
- fix KVCache estimates for gemma4 style sliding window models by @Datta0 in #5225
- Update VRAM estimator to cater to broader model configs by @Datta0 in #5175
- Fix FastSentenceTransformer loading with newer sentence-transformers by @Etherll in #5259
- Studio: Preserve chat history during autosave by @Imagineer99 in #5278
What's changed in Unsloth-Zoo
- Fix fused CE grad scaling under DDP by @danielhanchen in unslothai/unsloth-zoo#434
- Fused CE backward: guard scaling=0, drop tensor path, use out-of-place mul by @mmathew23 in unslothai/unsloth-zoo#610
- Fix/gemma4moefix by @mmathew23 in unslothai/unsloth-zoo#612
- MROPE for VLM GRPO by @Datta0 in unslothai/unsloth-zoo#614
- Double-buffer GPU activations for overlapping H2D copy with backward compute by @ruixiang63 in unslothai/unsloth-zoo#534
- fix(temporary_patches/utils): add missing comma in all (raise_error / Unpack) by @Anai-Guo in unslothai/unsloth-zoo#617
- Fix qwen lora extractor for diff peft versions by @Datta0 in unslothai/unsloth-zoo#618
- fix: use backend device type in GGUF merge path by @andomeder in unslothai/unsloth-zoo#615
- Add unsloth_compiled_cache to gitignore by @Datta0 in unslothai/unsloth-zoo#622
- Allow local convert_hf_to_gguf.py via UNSLOTH_LLAMA_CPP_SCRIPTS_DIR by @mmathew23 in unslothai/unsloth-zoo#621
Full Changelog: v0.1.37-beta...v0.1.38-beta