LostRuins/koboldcpp v1.12 on GitHub

koboldcpp-1.12
This is a bugfix release

Fixed a few more scenarios where GPT2/GPTJ/GPTNeoX will go out of memory when using BLAS. Also, the max blas batch for non llama models currently capped to 256.
Minor CLBlast optimizations should slightly increase speed

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.

and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program with the --help flag.

Alternative Options:
Non-AVX2 version now included in the same .exe file, enable with --noavx2 flags
Big context too slow? Try the --smartcontext flag to reduce prompt processing frequency
Run with your GPU using CLBlast, with --useclblast flag for a speedup

Disclaimer: This version has Cloudflare Insights in the Kobold Lite UI, which was subsequently removed in v1.17

LostRuins/koboldcpp v1.12 koboldcpp-1.12 on GitHub

LostRuins/koboldcpp v1.12
koboldcpp-1.12

on GitHub