koboldcpp-1.9
This was such a good update that I had to make a new version, so there are 2 new releases today.
- Now has support for stopping sequences fully implemented in the API! They've been implemented in a similar and compatible way to my United PR one-some/KoboldAI-united#5 and they should be shortly usable in online Lite as well as (eventually) the main kobold client when it gets merged. What this means is that now the AI will be able to finish a response early even if not all the response tokens are consumed, and save time by sending the reply instead of generating excess unneeded tokens. Automatically integrates into the latest version of Kobold Lite which sets the correct stop sequences from Chat and Instruct mode, which is also updated here.
- GPT-J and GPT2 models now support BLAS mode! They will use a smaller batch size than llama models, but the effect should still be very noticeably faster!
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program with the --help
flag.
Alternative Options:
Non-AVX2 version now included in the same .exe file, enable with --noavx2
flags
Big context too slow? Try the --smartcontext
flag to reduce prompt processing frequency
Run with your GPU using CLBlast, with --useclblast
flag for a speedup! (Credits to Occam)