koboldcpp-1.44.2

A.K.A The "Mom: we have SillyTavern at home edition"

Added multi-user mode with --multiuser which allows up to 5 concurrent incoming /generate requests from multiple clients to be queued up and processed in sequence, instead of rejecting other requests while busy. Note that the /check and /abort endpoints are inactive while multiple requests are in-queue, this is to prevent one user from accidentally reading or cancelling a different user's request.
Added a new launcher argument --onready which allows you to pass a terminal command (e.g. start a python script) to be executed after Koboldcpp has finished loading. This runs as a subprocess, and can be useful for starting cloudflare tunnels, displaying URLs etc.
Added Grammar Sampling for all architectures, which can be accessed via the web API (also in Lite). Older models are also supported.
Added a new API endpoint /api/extra/true_max_context_length which allows fetching the true max context limit, separate from the horde-friendly value.
Added support for selecting from a 4th GPU from the UI and command line (was max 3 before).
Tweaked automatic RoPE scaling
Pulled other fixes and improvements from upstream.
Note: Using --usecublas with the prebuilt Windows executables here are only intended for Nvidia devices. For AMD users, please check out @YellowRoseCx koboldcpp-rocm fork instead.

Major Update for Kobold Lite:

Kobold Lite has undergone a massive overhaul, renamed and rearranged elements for a cleaner UI.
Added Aesthetic UI for chat mode, which is now automatically selected when importing Tavern cards. You can easily switch between the different UIs for chat and instruct modes from the settings panel.
Added Mirostat UI configs to settings panel.
Allowed Idle Responses in all modes, it is now a global setting. Also fixed an idle response detection bug.
Smarter group chats, mentioning a specific name when inside a group chat will cause that user to respond, instead of being random.
Added support for automagically increasing the max context size slider limit, if a larger context is detected.
Added scenario for importing characters from Chub.Ai
Added a settings checkbox to enable streaming whenever applicable without requiring messing with URLs. Streaming can be easily toggled from the settings UI now, similar to EOS unbanning, although the --stream flag is still kept for compatibility.
Added a few Instruct Tag Presets in a dropdown.
Supports instruct placeholders, allowing easy switching between instruct formats without rewriting the text. Added a toggle option to use "Raw Instruct Tags" (the old method) as an alternative to placeholder tags like {{[INPUT]}} and {{[OUTPUT]}}
Added a toggle for "Newline After Memory" which can be set in the memory panel.
Added a toggle for "Show Rename Save File" which shows a popup the user can use to rename the json save file before saving.
You can specify a BNDF grammar string in settings to use when generating, this controls grammar sampling.
Various minor bugfixes, also fixed stop_sequences still appearing in the AI outputs, they should be correctly truncated now.

v1.44.1 update - added queue number to perf endpoint, and updated lite to fix a few formatting bugs.
v1.44.2 update - fixed a speed regression from sched_yield again.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag.

LostRuins/koboldcpp v1.44.2 koboldcpp-1.44.2 on GitHub

koboldcpp-1.44.2

A.K.A The "Mom: we have SillyTavern at home edition"

Major Update for Kobold Lite:

LostRuins/koboldcpp v1.44.2
koboldcpp-1.44.2

on GitHub