koboldcpp-1.48.1
Harder Better Faster Stronger Edition
- NEW FEATURE: Context Shifting (A.K.A. EvenSmarterContext) - This feature utilizes KV cache shifting to automatically remove old tokens from context and add new ones without requiring any reprocessing. So long as you use no memory/fixed memory and don't use world info, you should be able to avoid almost all reprocessing between consecutive generations even at max context. This does not consume any additional context space, making it superior to SmartContext.
- Note: Context Shifting is enabled by default, and will override smartcontext if both are enabled. Context Shifting still needs more testing. Your outputs may be different with shifting enabled, but both seem equally coherent. To disable Context Shifting, use the flag
--noshift
. If you observe a bug, please report and issue or send a PR fix.
- Note: Context Shifting is enabled by default, and will override smartcontext if both are enabled. Context Shifting still needs more testing. Your outputs may be different with shifting enabled, but both seem equally coherent. To disable Context Shifting, use the flag
- 'Tensor Core' Changes: KoboldCpp now handles MMQ/Tensor Cores differently from upstream. Here's a breakdown:
- old approach (everybody): if mmq is enabled, just use mmq. If cublas is enabled, just use cublas. MMQ dimensions set to "FAVOR BIG"
- new approach (upstream llama.cpp): you cannot toggle mmq anymore. It is always enabled. MMQ dimensions set to "FAVOR SMALL". CuBLAS always kicks in if batch > 32.
- new approach (koboldcpp): you CAN toggle MMQ. It is always enabled, until batch > 32, then CuBLAS only kicks in if MMQ flag is false, otherwise it still uses MMQ for all batches. MMQ dimensions set to "FAVOR BIG".
- Added GPU Info Display and Auto GPU Layer Selection For Newbies - Uses a combination of
clinfo
andnvidia-smi
queries to automatically determine and display the user's GPU name in the GUI, and helps newbies suggest the GPU layers to use when first choosing a model, based on available VRAM and model filesizes. Not optimal, but it should give usable defaults and be even more newbie friendly now. You can thereafter edit the actual GPU layers to use. (Credit: Original concept adapted from @YellowRoseCx ) - Added Min-P sampler - It is now available over the API, and can also be set in Lite from the Advanced settings tab. (Credit: @kalomaze)
- Added
--remotetunnel
flag, which downloads and creates a TryCloudFlare remote tunnel, allowing you to access koboldcpp remotely over the internet even behind a firewall. Note: This downloads a tool calledCloudflared
to the same directory. - Added a new build target for Windows exe users
koboldcpp_clblast_noavx2
, now providing a "CLBlast NoAVX2 (Old CPU)" option that finally supports CLBlast acceleration for windows devices without AVX2 intrinsics. - Include
Content-Length
header in responses. - Fixed some crashes with other uncommon models in cuda mode.
- Retained support for GGUFv1, but you're encouraged to update as upstream has removed support.
- Minor tweaks and optimizations to streaming timings. Fixed segfault that happens when streaming in multiuser mode and aborting connection halfway.
freq_base_train
is now taken into account when setting automatic rope scale, that should handle codellama correctly now.- Updated Kobold Lite, added support for selecting Min-P and Sampler Seeds (for proper deterministic generation).
- Improved KoboldCpp Colab, now with prebuilt CUDA binaries. Time to load after launch is less than a minute, excluding model downloads. Added a few more default model options, you can also use any custom GGUF model URL. (Try it here!)
Hotfix 1.48.1 - Fixed issues with Multi-GPU setups. GUI defaults to CuBLAS if available. Other minor fixes
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here
Run it from the command line with the desired launch parameters (see --help
), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program from command line with the --help
flag.