koboldcpp-1.30.3
A.K.A The "Back from the dead" edition.
KoboldCpp Changes:
- Added full OpenCL / CLBlast support for K-Quants, both prompt processing and GPU offloading for all K-quant formats (credits: @0cc4m)
- Added RWKV Sequence Mode enhancements for over 3X FASTER prompt processing in RWKV (credits: @LoganDark)
- Added support for the RWKV World Tokenizer and associated RWKV-World models. It will be automatically detected and selected as necessary.
- Added a true SSE-streaming endpoint (Agnaistic compatible) that can stream tokens in realtime while generating. Integrators can find it at
/api/extra/generate/stream
. (Credits @SammCheese) - Added an enhanced polled-streaming endpoint to fetch in-progress results without disrupting generation, which is now the default for Kobold Lite when using streaming in KoboldCpp. Integrators can find it at
/api/extra/generate/check
. The old 8-token-chunked-streaming can still be enabled by setting the parameterstreamamount=8
in the URL. Also, the original KoboldAI United compatible/api/v1/generate
endpoint is still available. - Added a new abort endpoint at
/api/extra/abort
which aborts any in-progress generation without stopping the server. It has been integrated into Lite, by pressing the "abort" button below the Submit button. - Added support for lora base, which is now added as an optional second parameter e.g.
--lora [lora_file] [base_model]
- Updated to latest Kobold Lite (required for new endpoints).
- Pulled other various enhancements from upstream, plus a few RWKV bugfixes .
1.30.2 Hotfix - Added a fix for RWKV crashing in seq mode, pulled upstream bugfixes, rebuild CUDA version. For those wondering why CUDA exe version is not always included, apart from size, dependencies and only supporting nvidia, that's partially also because it's a pain to build for me, since it can only be done in a dev environment with CUDA toolkit and visual studio on windows.
1.30.3 Hotfix - Disabled RWKV seq mode for now, due to multiple complaints about speed and memory issues with bigger quantized models. I will keep a copy of 1.30.2 here in case anyone still wants it.
CUDA Bonus
Bonus: An alternative CUDA build has also been provided for this version, capable of running all latest formats including K-Quants. To use, download and run the koboldcpp_CUDA_only.exe, which is a one-file pyinstaller.
Extra Bonus: CUDA now also supports the older ggjtv2 models as well, as support has been back ported in! Note that CUDA builds will still not be generated by default, and support for them will be limited.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program with the --help
flag.