koboldcpp-1.99.1
a darker shade of blue edition

- NEW: - The bundled KoboldAI Lite UI has received a substantial design overhaul in an effort to make it look more modern and polished. The default color scheme has been changed, however the old color scheme is still available (set 'Nostalgia' color scheme in advanced settings). A few extra custom color schemes have also been added (Thanks Lakius, TwistedShadows, toastypigeon, @PeterPeet). Please report any UI bugs you encounter.
- QOL Change: - Added aliases for llama.cpp command-line flags. To reduce the learning curve for llama.cpp users, the following llama.cpp compatibility flags have been added:
-m
,-t
,--ctx-size
,-c
,--gpu-layers
,--n-gpu-layers
,-ngl
,--tensor-split
,-ts
,--main-gpu
,-mg
,--batch-size
,-b
,--threads-batch
,--no-context-shift
,--mlock
,-p
,--no-mmproj-offload
,--model-draft
,-md
,--draft-max
,--draft-n
,--gpu-layers-draft
,--n-gpu-layers-draft
,-ngld
,--flash-attn
,-fa
,--n-cpu-moe
,-ncmoe
,--override-kv
,--override-tensor
,-ot
,--no-mmap
. They should behave as you'd expect from llama.cpp. - Renamed
--promptlimit
to--genlimit
, now applies to API requests as well, can be set in the UI launcher. - Added a new parameter
--ratelimit
that will apply per-IP based rate limiting (to help prevent abuse of public instances). - Fixed Automatic VRAM detection for rocm and vulkan backends on AMD systems (thanks @lone-cloud)
- Hide API info display if running in CLI mode.
Flash attention is now checked by default when using GUI launcher.(Reverted in 1.99.1 by popular demand)- Try fix some embedding models using too much memory.
- Standardize model file download locations to the koboldcpp executable's directory. This should help solve issues about non-writable system paths when launching from a different working directory. If you prefer the old behavior, please send some feedback, but I think standardizing it is better than adding special exceptions for some directory paths.
- Add psutil to conda environment. Please report if this breaks any setups.
- Added
/v1/audio/voices
endpoint, fixed dia wrong voice mapping - Updated Kobold Lite, multiple fixes and improvements
- UI design rework, as mentioned above
- Fixes for markdown renderer
- Added a popup to allow enabling TTS or image generation if it's disabled but available.
- Added new scenario "Aletheia"
- Increased default context size and amount generated
- Fix for GPT-OSS instruct format.
- Smarter automatic detection for "Enter Sends" default based on platform. Toggle moved into advanced settings.
- Fix for Palemoon browser compatibility
- Reworked best practices recommendation to think tags - now provides Think/NoThink instruct tags for each instruct sequence. You are now recommended to explicitly select the correct Think/NoThink instruct tags instead of using the
<think>
forced/prevented prefill. This will provide better results for preventing reasoning than simply injecting a blank<think></think>
since some models require specialized reasoning trace formats. - For example, to prevent thinking in GLM-Air, you're simply recommended to set the instruct tag to
GLM-4.5 Non-Thinking
and leave "Insert Thinking" as "Normal" instead of manually messing with the tag injections. This ensures the correct postfix tags for each format are used. - By default, KoboldCppAutomatic template permits thinking in models that use it.
- Merged new model support, fixes and improvements from upstream
Hotfix 1.99.1 - Fix for chroma, revert FA default off, revert ggml-org#16056, fixed rocm compile issues.
Download and run the koboldcpp.exe (Windows) or koboldcpp-linux-x64 (Linux), which is a one-file pyinstaller for NVIDIA GPU users.
If you have an older CPU or older NVIDIA GPU and koboldcpp does not work, try oldpc version instead (Cuda11 + AVX1).
If you don't have an NVIDIA GPU, or do not need CUDA, you can use the nocuda version which is smaller.
If you're using AMD, we recommend trying the Vulkan option in the nocuda build first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here if you are a Windows user or download our rolling ROCm binary here if you use Linux.
If you're on a modern MacOS (M-Series) you can use the koboldcpp-mac-arm64 MacOS binary.
Click here for .gguf conversion and quantization tools
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag. You can also refer to the readme and the wiki.