LostRuins/koboldcpp v1.23.1 on GitHub

koboldcpp-1.23.1

A.K.A The "Is Pepsi Okay?" edition.

Changes:

Integrated support for the new quantization formats for GPT-2, GPT-J and GPT-NeoX
Integrated Experimental OpenCL GPU Offloading via CLBlast (Credits to @0cc4m)
- You can only use this in combination with --useclblast, combine with --gpulayers to pick number of layers to offload
- Currently works for new quantization formats of LLAMA models only
- Should work on all GPUs
Still supports all older GGML models, however they will not be able to enjoy new features.
Updated Lite, integrated various fixes and improvements from upstream.

1.23.1 Edit:

Pulled Occam's fix for the q8 dequant kernels, so now q8 formats can enjoy GPU offloading as well.
Disabled fp16 prompt processing as it appears to be slower. Please compare!

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
Alternatively, drag and drop a compatible ggml model on top of the .exe, or run it and manually select the model in the popup dialog.

and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program with the --help flag.
This release also includes a zip file containing the libraries and the koboldcpp.py script, for those who prefer not use to the one-file pyinstaller.

Please share your Performance Bencharks for CLBlast GPU offloading or issues here: #179 . Do include whether your GPU supports F16.

LostRuins/koboldcpp v1.23.1 koboldcpp-1.23.1 on GitHub

LostRuins/koboldcpp v1.23.1
koboldcpp-1.23.1

on GitHub