koboldcpp-1.24
A.K.A The "He can't keep getting away with it!" edition.
KoboldCpp Changes:
- Added support for the new GGJT v3 (q4_0, q4_1 and q8_0) quantization format changes.
- Still retains backwards compatibility with every single historical GGML format (GGML, GGHF, GGJT v1,2,3 + all other formats from supported architectures).
- Fixed F16 format detection in NeoX, including a fix for use_parallel_residual.
- Various small fixes and improvements, sync to upstream and updated Kobold Lite.
Embedded Kobold Lite has also been updated, with the following changes:
- Improved the spinning circle waiting animation to use less processing.
- Fixed a bug with stopping sequences when in streaming mode.
- Added a toggle to avoid inserting newlines in Instruct mode (good for Pygmalion and OpenAssistant based instruct models).
- Added a toggle to enable basic markdown in instruct mode (off by default).
To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001
For more information, be sure to run the program with the --help
flag.
This release also includes a zip file containing the libraries and the koboldcpp.py
script, for those who prefer not use to the one-file pyinstaller.
EDIT: An alternative CUDA build has been provided by Henky for this version, to allow access to the latest quantizations for CUDA users. Do note that it only supports the latest version of LLAMA based models. CUDA builds will still not be generated by default, and support for them will be limited.