koboldcpp-1.90

Qwen of the line edition

NEW: Android Termux Auto-Installer - You can now setup KoboldCpp via Termux on Android with a single command, which triggers an automated installation script. Check it out here. Install Termux from F-Droid, then run the command with internet accessible, and everything will be setup, downloaded, compiled and configured for instant use with a Gemma3-1B model.
Merged support for Qwen3. Now also triggers --nobostoken automatically if a model metadata explicitly indicates no_bos_token, it can still be enabled manually for other models.
Fixes for THUDM GLM-4, note that this model enforces --blasbatchsize 16 or smaller in order to get coherent output.
Merged overhaul to Qwen2.5vl projector. Both old (HimariO version) and new (ngxson version) mmprojs should work, retaining backwards compatibility. However, you should update to the new projectors.
Merged functioning Pixtral support. Note that pixtral is very token heavy, about 4000 tokens for a 1024px image, you can try increasing max --contextsize or lowering --visionmaxres.
Added support for OpenAI Structured Outputs in chat completions API, also accepts the schema when sent as a stringified JSON object in the "grammar" field. You can use this to enforce JSON outputs with specific schema.
--blasbatchsize -1 now exclusively uses a batch size of 1 when processing prompt. Also permitted --blasbatchsize 16 which replicates the old behavior (batch of 16 does not trigger GEMM).
KCPP API server now correctly handles explicitly set nulled fields.
Fixed Zenity/YAD detection not working correctly in the previous version.
Improved input sanitization when launching and passing url as a model param, Also for better security, --onready shell commands can still be used as a CLI parameter, but cannot be embedded into a .kcppt or .kcpps file.
More robust checks for system glslc when building vulkan shaders.
Improved auto gpu layers when loading multi-part GGUF models (on 1 gpu), also slightly tightened memory estimation, and accounts for quantized KV when guessing layers.
Added new flag --mmprojcpu that allows you to load and run the projector on CPU while keeping the main model on GPU.
noscript mode randomizes generated image names to prevent browser caching.
Updated Kobold Lite, multiple fixes and improvements
- Increased default tokens generated and slider limits (can be overridden)
- ChatGLM-4 and Qwen3 (chatml think/nothinking) presets added. You can disable thinking in Qwen3 by swapping between ChatML (No Thinking) and normal ChatML.
- Added toggle to disable LaTeX while leaving markdown enabled
Merged fixes and improvements from upstream

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're on a modern MacOS (M1, M2, M3) you can try the koboldcpp-mac-arm64 MacOS binary.
If you're using AMD, we recommend trying the Vulkan option (available in all releases) first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag. You can also refer to the readme and the wiki.

LostRuins/koboldcpp v1.90 koboldcpp-1.90 on GitHub

koboldcpp-1.90

LostRuins/koboldcpp v1.90
koboldcpp-1.90

on GitHub