koboldcpp-1.111
- Gemma 4 models are now supported. Note that Gemma 4 is very format sensitive, using the wrong format will likely cause bad outputs. If you want to ensure the correct format is used (chat completions mode), you can use it with
--jinja. Otherwise, the default AutoGuess template will be non-thinking by default. Vision is supported.- Recommended variants: gemma-4-E4B for smaller devices, or gemma-4-26B-A4B for larger devices. Vision mmprojs can be found here.
- If running inside KoboldAI Lite,
Separate end tagstoggle is recommended, or use Jinja. Also, it really only works well in Instruct mode.
- NEW: Qwen3 TTS CustomVoice and VoiceDesign are now supported! This allows for creation of narration with instructions describing the voices.
- Download Q3TTS VoiceDesign and the wav tokenizer.
- When creating your TTS prompt, add instructions at the start in square brackets: e.g.
[A depressed woman is crying and sobbing] I want to go home! - Alternatively, use the included musicui at http://localhost:5001/musicui and select TTS tab.
- Implementation referenced from @paoletto qwen3tts fork.
- NEW: Added basic
/v1/responsesand/v1/messagescompatibility API support. - Added fixes for Jinja based tool calling, which should work for more models now. Enable with
--jinjatools. If not enabled, universal tool calling is used instead. - Added support for BF16 KV type, select it with
--quantkv 3or in the GUI launcher. - Added a non-thinking AutoGuess template
- Added config overwriting for admin mode, you can now specify 2 config files on admin API reload (one base and one target) and koboldcpp will combine them both before switching.
- Fixed jinja prefills for chat completions
- Added support for Jinja chat template Kwargs like in llama.cpp, use
--jinja-kwargsor--chat-template-kwargsjust like in llama.cpp e.g.--chat-template-kwargs '{"enable_thinking":false}' - When using pipeline parallel, the logical batch size is doubled (physical batch size is unchanged), this will improve performance on multi-gpu setups.
- Added a section for popular community models in the help button menu. If you wish to contribute a suggestion, please prepare a
.kcppttemplate and submit it to the Discussions page. - Added
--autoswapfunctionality, when running multi-feature configs e.g. Text+Images+Music, this allows swapping the currently loaded feature on and off, for each request type, saving VRAM. Requires router mode enabled. (credits: @esolithe) - Credentials can now be optionally supplied by environment variables
KOBOLDCPP_ADMINPASSWORDandKOBOLDCPP_PASSWORDduring launch from command line (thanks @shoaib42) - Image Gen: Added
--sdmaingpuallowing image models to be independently placed on any gpu - Image Gen: ESRGAN passthrough added, upscale-only mode can be done with img2img and denoise 0.0 with 1 step
- Image Gen: Return metadata and upstream updates by @wbruna
- Music Gen: Fixed stop tokens by @dysangel
- Music Gen: Added planner mode that uses your main LLM to generate better lyrics instead, toggle in musicUI advanced settings.
- Music Gen: Added API key support
- TTS Gen: Allow embedded music UI to do both music and TTS generations (2 tabs)
- Fixes for Colab
- Fixed CPU incorrect selection in OldCPU mode.
- WSL socket timeout fix, thanks @scottf007
- Router mode can now auto-wake a few other endpoints if put to sleep by auto-unload
- Increased max vision image limit
- Increase GUI launcher max context size slider limit
- Breaking Change: Detected thinking content is now sent via
reasoning_contentinstead ofcontentover the chat completions API, to align with most other providers. To disable this behavior, setencapsulate_thinkingtofalsein your request. - Updated Kobold Lite, multiple fixes and improvements
- Merged fixes, new model support, and improvements from upstream
Download and run the koboldcpp.exe (Windows) or koboldcpp-linux-x64 (Linux), which is a one-file pyinstaller for NVIDIA GPU users.
If you have an older CPU or older NVIDIA GPU and koboldcpp does not work, try oldpc version instead (Cuda11 + AVX1).
If you don't have an NVIDIA GPU, or do not need CUDA, you can use the nocuda version which is smaller.
If you're using AMD, we recommend trying the Vulkan option in the nocuda build first, for best support. Alternatively, you can download our rolling ROCm binary here if you use Linux.
If you're on a modern MacOS (M-Series) you can use the koboldcpp-mac-arm64 MacOS binary.
Click here for .gguf conversion and quantization tools
Newer rolling experimental Builds can be found at https://github.com/LostRuins/koboldcpp/releases/tag/rolling, these are auto-updated automatically and can be unstable.
Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help flag. You can also refer to the readme and the wiki.