koboldcpp-1.111

Gemma 4 models are now supported. Note that Gemma 4 is very format sensitive, using the wrong format will likely cause bad outputs. If you want to ensure the correct format is used (chat completions mode), you can use it with --jinja. Otherwise, the default AutoGuess template will be non-thinking by default. Vision is supported.
- Recommended variants: gemma-4-E4B for smaller devices, or gemma-4-26B-A4B for larger devices. Vision mmprojs can be found here.
- If running inside KoboldAI Lite, Separate end tags toggle is recommended, or use Jinja. Also, it really only works well in Instruct mode.
NEW: Qwen3 TTS CustomVoice and VoiceDesign are now supported! This allows for creation of narration with instructions describing the voices.
- Download Q3TTS VoiceDesign and the wav tokenizer.
- When creating your TTS prompt, add instructions at the start in square brackets: e.g. [A depressed woman is crying and sobbing] I want to go home!
- Alternatively, use the included musicui at http://localhost:5001/musicui and select TTS tab.
- Implementation referenced from @paoletto qwen3tts fork.
NEW: Added basic /v1/responses and /v1/messages compatibility API support.
Added fixes for Jinja based tool calling, which should work for more models now. Enable with --jinjatools. If not enabled, universal tool calling is used instead.
Added support for BF16 KV type, select it with --quantkv 3 or in the GUI launcher.
Added a non-thinking AutoGuess template
Added config overwriting for admin mode, you can now specify 2 config files on admin API reload (one base and one target) and koboldcpp will combine them both before switching.
Fixed jinja prefills for chat completions
Added support for Jinja chat template Kwargs like in llama.cpp, use --jinja-kwargs or --chat-template-kwargs just like in llama.cpp e.g. --chat-template-kwargs '{"enable_thinking":false}'
When using pipeline parallel, the logical batch size is doubled (physical batch size is unchanged), this will improve performance on multi-gpu setups.
Added a section for popular community models in the help button menu. If you wish to contribute a suggestion, please prepare a .kcppt template and submit it to the Discussions page.
Added --autoswap functionality, when running multi-feature configs e.g. Text+Images+Music, this allows swapping the currently loaded feature on and off, for each request type, saving VRAM. Requires router mode enabled. (credits: @esolithe)
Credentials can now be optionally supplied by environment variables KOBOLDCPP_ADMINPASSWORD and KOBOLDCPP_PASSWORD during launch from command line (thanks @shoaib42)
Image Gen: Added --sdmaingpu allowing image models to be independently placed on any gpu
Image Gen: ESRGAN passthrough added, upscale-only mode can be done with img2img and denoise 0.0 with 1 step
Image Gen: Return metadata and upstream updates by @wbruna
Music Gen: Fixed stop tokens by @dysangel
Music Gen: Added planner mode that uses your main LLM to generate better lyrics instead, toggle in musicUI advanced settings.
Music Gen: Added API key support
TTS Gen: Allow embedded music UI to do both music and TTS generations (2 tabs)
Fixes for Colab
Fixed CPU incorrect selection in OldCPU mode.
WSL socket timeout fix, thanks @scottf007
Router mode can now auto-wake a few other endpoints if put to sleep by auto-unload
Increased max vision image limit
Increase GUI launcher max context size slider limit
Breaking Change: Detected thinking content is now sent via reasoning_content instead of content over the chat completions API, to align with most other providers. To disable this behavior, set encapsulate_thinking to false in your request.
Updated Kobold Lite, multiple fixes and improvements
Merged fixes, new model support, and improvements from upstream

Download and run the koboldcpp.exe (Windows) or koboldcpp-linux-x64 (Linux), which is a one-file pyinstaller for NVIDIA GPU users.
If you have an older CPU or older NVIDIA GPU and koboldcpp does not work, try oldpc version instead (Cuda11 + AVX1).
If you don't have an NVIDIA GPU, or do not need CUDA, you can use the nocuda version which is smaller.
If you're using AMD, we recommend trying the Vulkan option in the nocuda build first, for best support. Alternatively, you can download our rolling ROCm binary here if you use Linux.
If you're on a modern MacOS (M-Series) you can use the koboldcpp-mac-arm64 MacOS binary.
Click here for .gguf conversion and quantization tools
Newer rolling experimental Builds can be found at https://github.com/LostRuins/koboldcpp/releases/tag/rolling, these are auto-updated automatically and can be unstable.

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag. You can also refer to the readme and the wiki.

LostRuins/koboldcpp v1.111 koboldcpp-1.111 on GitHub

koboldcpp-1.111

LostRuins/koboldcpp v1.111
koboldcpp-1.111

on GitHub