koboldcpp-1.90
Qwen of the line edition
- NEW: Android Termux Auto-Installer - You can now setup KoboldCpp via Termux on Android with a single command, which triggers an automated installation script. Check it out here. Install Termux from F-Droid, then run the command with internet accessible, and everything will be setup, downloaded, compiled and configured for instant use with a Gemma3-1B model.
- Merged support for Qwen3. Now also triggers
--nobostoken
automatically if a model metadata explicitly indicates no_bos_token, it can still be enabled manually for other models. - Fixes for THUDM GLM-4, note that this model enforces
--blasbatchsize 16
or smaller in order to get coherent output. - Merged overhaul to Qwen2.5vl projector. Both old (HimariO version) and new (ngxson version) mmprojs should work, retaining backwards compatibility. However, you should update to the new projectors.
- Merged functioning Pixtral support. Note that pixtral is very token heavy, about 4000 tokens for a 1024px image, you can try increasing max
--contextsize
or lowering--visionmaxres
. - Added support for OpenAI Structured Outputs in chat completions API, also accepts the schema when sent as a stringified JSON object in the "grammar" field. You can use this to enforce JSON outputs with specific schema.
--blasbatchsize -1
now exclusively uses a batch size of 1 when processing prompt. Also permitted--blasbatchsize 16
which replicates the old behavior (batch of 16 does not trigger GEMM).- KCPP API server now correctly handles explicitly set nulled fields.
- Fixed Zenity/YAD detection not working correctly in the previous version.
- Improved input sanitization when launching and passing url as a model param, Also for better security,
--onready
shell commands can still be used as a CLI parameter, but cannot be embedded into a .kcppt or .kcpps file. - More robust checks for system glslc when building vulkan shaders.
- Improved auto gpu layers when loading multi-part GGUF models (on 1 gpu), also slightly tightened memory estimation, and accounts for quantized KV when guessing layers.
- Added new flag
--mmprojcpu
that allows you to load and run the projector on CPU while keeping the main model on GPU. - noscript mode randomizes generated image names to prevent browser caching.
- Updated Kobold Lite, multiple fixes and improvements
- Increased default tokens generated and slider limits (can be overridden)
- ChatGLM-4 and Qwen3 (chatml think/nothinking) presets added. You can disable thinking in Qwen3 by swapping between ChatML (No Thinking) and normal ChatML.
- Added toggle to disable LaTeX while leaving markdown enabled
- Merged fixes and improvements from upstream
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're on a modern MacOS (M1, M2, M3) you can try the koboldcpp-mac-arm64 MacOS binary.
If you're using AMD, we recommend trying the Vulkan option (available in all releases) first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag. You can also refer to the readme and the wiki.