koboldcpp-1.112
Finally made it edition
KoboldCpp is now finally the top fork in the original list of linked llama.cpp forks! We have finally crossed 10k stars and overtaken alpaca.cpp (and it only took us 3 years to catch up)
- NEW: Added support for AceStepXL models.
- AceStep XL uses the same AceStep LM, Embedder and VAE as AceStep1.5 which you can get here, (implementation referenced from @ServeurpersoCom)
- Also fixed a bug affecting music quality on Vulkan, and further reduced memory footprint in
--musiclowvrammode
- NEW: Added support for reasoning budget/reasoning effort - This is now supported when generating with a thinking model over the API. Pass the field
reasoning_effortto set the budget, supported values arehigh,medium,low,minimalandnone. If unspecified, no reasoning effort budget is enforced.- In KoboldAI Lite, this control can be found in Settings > Tokens > Thinking > Reasoning Effort
- If you're using a third party frontend, it should be settable from their settings as
reasoning_effortis a known payload field name. - Otherwise, you can also set it manually from KoboldCpp by passing it from the launcher as a Default Param, e.g.
--gendefaults '{\"reasoning_effort\":\"minimal\"}'or simply setting Default Params to'{"reasoning_effort":"minimal"}in the GUI launcher.
- NEW: Added the
--swapaddingparameter: Do you want to use SWA but you find the SWA window too small? This allows you to extend it while still keeping a relatively small KV memory footprint. Extends the SWA ctx by specified number of tokens. - NEW: Added support for q5_1 KV cache (Breaking Change) - Now you should specify
--quantkvwith the cache type instead, e.g.--quantkv q5_1. Valid values aref16/bf16/q8_0/q5_1/q4_0. The old single digit values are considered deprecated, avoid using them. - NEW: Streaming now works along with Jinja tool calling when using
--jinjatools. - Fixed a potential incoherent state when attempting to rewind too far while SWA is enabled. If you had weird outputs with both FastForward and SWA enabled, this might fix it. If not, disable one of them or increase SWA padding.
- Added
--baseconfig, allowing a base config to be pre-loaded on every model swap. The config will be merged with the config you are attempting to load. This can be overridden by passing abaseconfigparameter over the/api/admin/reload_configAPI. - Added
--image-min-tokensand--image-max-tokensflags to allow setting min/max vision tokens for gemma4, similar to llama.cpp, credits @pi6am - Gemma4 E4B and E2B now support audio inputs.
- Added
--jinjatemplate/--chat-template-file- This allows you to replace the Jinja template in your model with a custom template. - Increased multiuser default from 7 to 10.
- Autoswap: fixed some edge conditions
- Post-Generate summary added processed tokens count.
- Fix for
/api/extra/tokencount, also allow input as OpenAI messages instead of a raw prompt. Will return compiled prompt. - Improvements to image handling in chat completions.
- Fixed a crash with very large
--preloadstory - Multiple Jinja tool calling fixes and improvements.
- Jinja tool calling improved for GPT-OSS, Qwen3, Qwen3.5, GLM models and Gemma4 models.
- If you notice any tool call parsing issues with a model, please report them.
- Reminder: Use
--jinjatoolsto enable jinja template for tool calling (better quality). With that, tool calling should work optimally.
- Adjusted gemma4 fallback handling, handle new gemma4 templates
- Updated image generation from upstream, fixes to sampler handling by @wbruna
- Chunk qwen3tts inputs longer than 1024 frames into multiple batches. This should allow for longer Qwen3TTS generation lengths.
- Updated Kobold Lite, multiple fixes and improvements, especially with thinking rendering.
- Merged fixes, new model support, and improvements from upstream
Download and run the koboldcpp.exe (Windows) or koboldcpp-linux-x64 (Linux), which is a one-file pyinstaller for NVIDIA GPU users.
If you have an older CPU or older NVIDIA GPU and koboldcpp does not work, try oldpc version instead (Cuda11 + AVX1).
If you don't have an NVIDIA GPU, or do not need CUDA, you can use the nocuda version which is smaller.
If you're using AMD, we recommend trying the Vulkan option in the nocuda build first, for best support. Alternatively, you can download our rolling ROCm binary here if you use Linux.
If you're on a modern MacOS (M-Series) you can use the koboldcpp-mac-arm64 MacOS binary.
Click here for .gguf conversion and quantization tools
Newer rolling experimental builds can be found here, these are auto-updated and may be unstable.
Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help flag. You can also refer to the readme and the wiki.