koboldcpp-1.36

Reverted an upstream change to sched_yield() that caused slowdowns for certain systems. This should fix speed regressions in 1.35. If you're still experiencing poorer speeds compared to earlier versions, please raise an issue with details.
Reworked command line args on RoPE for extended context to be similar to upstream. Thus, --linearrope has been removed. Instead, you can now use --ropeconfig to customize both RoPE frequency scale (Linear) and RoPE frequency base (NTK-Aware) values, e.g. --ropeconfig 0.5 10000 for a 2x linear scale. By default, long context NTK-Aware RoPE will be automatically configured based on your --contextsize parameter, similar to previously. If you're using LLAMA2 at 4K context, you'd probably want to use --ropeconfig 1.0 10000 to take advantage of the native 4K tuning without scaling. For ease of use, this can be set in the GUI too.
Expose additional token counter information through the API /api/extra/perf
The warning for poor sampler orders has been limited to show only once per session, and excludes mirostat. I've heard some people have issues with it, so please let me know if it's still causing problems, though it's only a text warning and should not affect actual operation.
Model busy flag replaced by Thread Lock, credits @ycros.
Tweaked scratch and KV buffer allocation sizes for extended context.
Updated Kobold Lite, with better whitespace trim support and a new toggle for partial chat responses.
Pulled other upstream fixes and optimizations.
Downgraded CUDA windows libraries to 11.4 for smaller exe filesizes, same version previously tried by @henk717. Please do report any issues or regressions encountered with this version.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag.

LostRuins/koboldcpp v1.36 koboldcpp-1.36 on GitHub

koboldcpp-1.36

LostRuins/koboldcpp v1.36
koboldcpp-1.36

on GitHub