koboldcpp-1.80
End of the year edition
- NEW: Added support for image Multimodal with Qwen2-VL! You can grab the quantized mmproj here for the 2B and 7B models, and then grab the 2B or 7B Instruct models from Bartowski.
- Note: Qwen2-VL vision is not working on Vulkan currently. The model will load and generate text fine, but it's unable to recognize anything. Works fine on CUDA and CPU. Follow ggerganov#10843
- For a quick start, here's a working template you can use
- NEW: Vulkan now has coopmat1 support, making it significantly faster on modern Nvidia cards (credits @0cc4m)
- Added a few new QoL flags:
--moeexperts
- Overwrite the number of experts to use in MoE models--failsafe
- A proper way to set failsafe mode, which disables all CPU intrinsics and GPU usage.--draftgpulayers
- Set number of layers to offload for speculative decoding draft model--draftgpusplit
- GPU layer distribution ratio for draft model (default=same as main). Only works if using multi-GPUs.
- Fixes for buggy tkinter GUI launcher window in Linux (thanks @henk717)
- Restored support for ARM quants in Kobold (e.g. Q4_0_4_4), but you should consider switching to q4_0 eventually.
- Fixed a bug that caused context corruption when aborting a generation while halfway processing a prompt
- Added new field
suppress_non_speech
to Whisper allowing banning "noise annotation" logits (e.g. Barking, Doorbell, Chime, Muzak) - Improved compile flags on ARM, self-compiled builds now use correct native flags and should be significantly faster (tested on Pi and Termux). Simply run
make
for native ARM builds, ormake LLAMA_PORTABLE=1
for a slower portable build. trim_stop
now defaults to true (output will no longer contain stop sequence by default)- Debugmode shows drafted tokens and allow incompatibles vocab for speculative decoding when enabled (not recommended)
- Handle more generation parameters in ollama API emulation
- Handle pyinstaller temp paths for chat adapters when saving a kcpps config file
- Default image gen sampler set to Euler
- MMQ is now the default for CLI as well. Use
nommq
flag to disable (e.g.--usecublas all nommq
). Old flags still work. - Upgrade build to use C++17
- Always use PCI Bus ID order for CUDA GPU listing consistency (match nvidia-smi)
- Updated Kobold Lite, multiple fixes and improvements
- NEW: Added LaTeX rendering together with markdown. Uses standard
\[...\]
\(...\)
and$$...$$
syntax. - You can now manually upload an audio file to transcribe in settings.
- Better regex to trigger image generation
- Aesthetic UI fixes
- Added
q
as an alias toquery
for direct URL querying (e.g. http://localhost:5001?q=what+is+love) - Added support for AllTalk v2 API. AllTalk v1 is still supported automatically (credits @erew123)
- Added support for Mantella XTTS (XTTS fork)
- Toggle to disable "non-speech" whisper output (see above)
- Consolidated Instruct templates (Mistral V3 merged to V7)
- NEW: Added LaTeX rendering together with markdown. Uses standard
- Merged fixes and improvements from upstream
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're on a modern MacOS (M1, M2, M3) you can try the koboldcpp-mac-arm64 MacOS binary.
If you're using AMD, we recommend trying the Vulkan option (available in all releases) first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag. You can also refer to the readme and the wiki.