koboldcpp-1.35
Note: This build adds significant changes for CUDA and may be less stable than normal - please report any performance regressions or bugs you encounter. It may be slower than usual. If that is the case, please use the previous version for now.
- Enabled the CUDA 8Bit MMV mode (see ggerganov#2067) , now that it seems stable enough and works correctly, this approach uses quantized dot products instead of the traditional DMMV approach for the formats
q4_0
,q4_1
,q5_0
andq5_1
. If you're able to do a full GPU offload, then CUDA for such models will likely be significantly faster than before. K-quants and CL are not affected. - Exposed performance information through the API (prompt processing and generation timing), access it with
/api/extra/perf
- Added support for linear RoPE as an alternative to NTK-Aware RoPE (similar to in 1.33, but using 2048 as a base). This is triggered by the launcher parameter
--linearrope
. The RoPE scale is determined by the--contextsize
parameter, thus for best results on SuperHOT models, you should launch with--linearrope --contextsize 8192
which provides a0.25
linear scale as the SuperHOT finetune suggests. If--linearrope
is not specified, then NTK-aware RoPE is used by default. - Added a Save and Load settings option to the GUI launcher.
- Added the ability to select "All Devices" in the GUI for CUDA. You are still recommended to select a specific device - split GPU is usually slower.
- Displays a warning if poor sampler orders are used, as the default configuration will give much better results.
- Updated Kobold Lite, pulled other upstream fixes and optimizations.
1.35.H Henk-Cuda Hotfix: This is an alternative version from Henk that you can try if you encounter speed reductions. Please let me know if it's better for you.
Henk may have newer versions at https://github.com/henk717/koboldcpp/releases/tag/1.35 please check that out for now. I will be able to upstream any fixes only in a few days.
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.