koboldcpp-1.112

Finally made it edition

KoboldCpp is now finally the top fork in the original list of linked llama.cpp forks! We have finally crossed 10k stars and overtaken alpaca.cpp (and it only took us 3 years to catch up)

NEW: Added support for AceStepXL models.
- AceStep XL uses the same AceStep LM, Embedder and VAE as AceStep1.5 which you can get here, (implementation referenced from @ServeurpersoCom)
- Also fixed a bug affecting music quality on Vulkan, and further reduced memory footprint in --musiclowvram mode
NEW: Added support for reasoning budget/reasoning effort - This is now supported when generating with a thinking model over the API. Pass the field reasoning_effort to set the budget, supported values are high, medium, low, minimal and none. If unspecified, no reasoning effort budget is enforced.
- In KoboldAI Lite, this control can be found in Settings > Tokens > Thinking > Reasoning Effort
- If you're using a third party frontend, it should be settable from their settings as reasoning_effort is a known payload field name.
- Otherwise, you can also set it manually from KoboldCpp by passing it from the launcher as a Default Param, e.g. --gendefaults '{\"reasoning_effort\":\"minimal\"}' or simply setting Default Params to'{"reasoning_effort":"minimal"} in the GUI launcher.
NEW: Added the --swapadding parameter: Do you want to use SWA but you find the SWA window too small? This allows you to extend it while still keeping a relatively small KV memory footprint. Extends the SWA ctx by specified number of tokens.
NEW: Added support for q5_1 KV cache (Breaking Change) - Now you should specify --quantkv with the cache type instead, e.g. --quantkv q5_1. Valid values are f16/bf16/q8_0/q5_1/q4_0. The old single digit values are considered deprecated, avoid using them.
NEW: Streaming now works along with Jinja tool calling when using --jinjatools.
Fixed a potential incoherent state when attempting to rewind too far while SWA is enabled. If you had weird outputs with both FastForward and SWA enabled, this might fix it. If not, disable one of them or increase SWA padding.
Added --baseconfig, allowing a base config to be pre-loaded on every model swap. The config will be merged with the config you are attempting to load. This can be overridden by passing a baseconfig parameter over the /api/admin/reload_config API.
Added --image-min-tokens and --image-max-tokens flags to allow setting min/max vision tokens for gemma4, similar to llama.cpp, credits @pi6am
Gemma4 E4B and E2B now support audio inputs.
Added --jinjatemplate / --chat-template-file - This allows you to replace the Jinja template in your model with a custom template.
Increased multiuser default from 7 to 10.
Autoswap: fixed some edge conditions
Post-Generate summary added processed tokens count.
Fix for /api/extra/tokencount, also allow input as OpenAI messages instead of a raw prompt. Will return compiled prompt.
Improvements to image handling in chat completions.
Fixed a crash with very large --preloadstory
Multiple Jinja tool calling fixes and improvements.
- Jinja tool calling improved for GPT-OSS, Qwen3, Qwen3.5, GLM models and Gemma4 models.
- If you notice any tool call parsing issues with a model, please report them.
- Reminder: Use --jinjatools to enable jinja template for tool calling (better quality). With that, tool calling should work optimally.
Adjusted gemma4 fallback handling, handle new gemma4 templates
Updated image generation from upstream, fixes to sampler handling by @wbruna
Chunk qwen3tts inputs longer than 1024 frames into multiple batches. This should allow for longer Qwen3TTS generation lengths.
Updated Kobold Lite, multiple fixes and improvements, especially with thinking rendering.
Merged fixes, new model support, and improvements from upstream

Download and run the koboldcpp.exe (Windows) or koboldcpp-linux-x64 (Linux), which is a one-file pyinstaller for NVIDIA GPU users.
If you have an older CPU or older NVIDIA GPU and koboldcpp does not work, try oldpc version instead (Cuda11 + AVX1).
If you don't have an NVIDIA GPU, or do not need CUDA, you can use the nocuda version which is smaller.
If you're using AMD, we recommend trying the Vulkan option in the nocuda build first, for best support. Alternatively, you can download our rolling ROCm binary here if you use Linux.
If you're on a modern MacOS (M-Series) you can use the koboldcpp-mac-arm64 MacOS binary.
Click here for .gguf conversion and quantization tools
Newer rolling experimental builds can be found here, these are auto-updated and may be unstable.

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag. You can also refer to the readme and the wiki.

LostRuins/koboldcpp v1.112 koboldcpp-1.112 on GitHub

koboldcpp-1.112

LostRuins/koboldcpp v1.112
koboldcpp-1.112

on GitHub