koboldcpp-1.64.1

Added fixes for Llama 3 tokenization: Support updated Llama 3 GGUFs with pre-tokenizations.
- Note: In order to benefit from the tokenizer fix, the GGUF models need to be reconverted after this commit. A warning will be displayed if the model was created before this fix.
Automatically support and apply both EOS and EOT tokens. EOT tokens are also correctly biased when EOS is banned.
finish_reason is now correctly communicated in both sync and SSE streamed modes responses when token generation is stopped by EOS/EOT. Also, Kobold Lite no longer trims sentences if a EOS/EOT is detected as the stop reason in instruct mode.
Added proper support for trim_stop in SSE streaming modes. Stop sequences will no longer be exposed even during streaming when trim_stop is enabled. Additionally, using the Chat Completions endpoint automatically applies trim stop to the instruct tag format used. This allows better out-of-box compatibility with third party clients like LibreChat.
--bantokens flag has been removed. Instead, you can now submit banned_tokens dynamically via the generate API, for each specific generation, and all matching tokens will be banned for that generation.
Added render_special to the generate API, which allows you to enable rendering of special tokens like <|start_header_id|> or <|eot_id|> if enabled.
Added new experimental flag --flashattention to enable Flash Attention for compatible models.
Added support for resizing the GUI launcher, all GUI elements will auto-scale to fit. This can be useful for high DPI screens.
Improved speed of rep pen sampler.
Added additional debug information in --debugmode.
Added a button for starting the benchmark feature in GUI launcher mode.
Fixed slow clip processing speed issue on Colab
Fixed quantization tool compilation again
Updated Kobold Lite:
- Improved stop sequence and EOS handling
- Fixed instruct tag dropdown
- Added token filter feature
- Added enhanced regex replacement (now also allowed for submitted text)
- Support custom {{placeholder}} tags.
- Better max context handling when used in Kcpp
- Support for Inverted world info secondary keys (triggers when NOT present)
- Language customization for XTTS

Hotfix 1.64.1: Fixed LLAVA being incoherent on the second generation onwards. Also, the gui launcher has been tidied up, lowvram is now removed from quick launch tab and only in hardware tab. --benchmark includes version and gives clearer exit instructions in console output now. Fixed some tkinter error outputs on quit.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag.

LostRuins/koboldcpp v1.64.1 koboldcpp-1.64.1 on GitHub

koboldcpp-1.64.1

LostRuins/koboldcpp v1.64.1
koboldcpp-1.64.1

on GitHub