oobabooga/text-generation-webui v3.3 on GitHub

Changes

Estimate the VRAM for GGUF models using a statistical model + autoset gpu-layers on NVIDIA GPUs (#6980).
- When you select a GGUF model in the UI, you will see an estimate for its VRAM usage, and the number of layers will be set based on the available (free, not total) VRAM on your system.
- If you change ctx-size or cache-type in the UI, the number of layers will be recalculated and updated in real time.
- If you load a model through the command line with e.g. --model model.gguf --ctx-size 32768 --cache-type q4_0, the number of GPU layers will also be automatically calculated, without the need to set --gpu-layers.
- It works even with multipart GGUF models or systems with multiple GPUs.
Greatly simplify the Model tab by splitting settings between "Main options" and "Other options", where "Other options" is in a closed accordion by default.
Tools support for the OpenAI compatible API (#6827). Thanks, @jkrauss82.
Dynamic Chat Message UI update speed (#6952). This is a major UI optimization in Chat mode that renders max_updates_second obsolete. Thanks, @mamei16 for the very clever idea.
Optimize the Chat tab JavaScript, reducing its CPU usage (#6948).
Add the top_n_sigma sampler to the llama.cpp loader.
Streamline the UI in portable builds: Hide things that do not work such as training, only show the llama.cpp loader, and do not include extensions that do not work. The latter should reduce the build sizes.
Invert user/assistant message colors in instruct mode to make assistant messages darker and more readable.
Improve the light theme colors.
Add a minimum height to the streaming reply to prevent constant scrolling during chat streaming, similar to how ChatGPT and Claude work.
Show the list of files if the user tries to download an entire GGUF repository instead of a specific file.
llama.cpp: Handle short arguments in --extra-flags, like ot.
Save the chat history right after sending a message and periodically during streaming to prevent losing messages.

Bug fixes

API: Fix llama.cpp continuing to generate in the background after cancelling the request, improve disconnect detection, fix deadlock on simultaneous requests.
Fix typical_p in the llama.cpp sampler priority.
Fix manual random seeds in llama.cpp.
Add a retry mechanism when using the /internal/logits API endpoint with the llama.cpp loader to fix random failures.
Ensure environment isolation in portable builds to avoid conflicts.
docker: Fix app UID typo in docker composes (#6957 and #6958). Thanks, @enovikov11.
Docker fix for NVIDIA (#6964). Thanks, @phokur.
SuperboogaV2: Minor update to avoid JSON serialization errors (#6945). Thanks, @alirezagsm.
Fix model config loading in shared.py for Python 3.13 (#6961). Thanks, @Downtown-Case.

Backend updates

llama.cpp: Update to ggml-org/llama.cpp@c6a2c9e.
ExLlamaV3: Update to turboderp-org/exllamav3@a905cff.

Portable builds

Below you can find portable builds: self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.

Choosing the right build:

Windows/Linux:
- NVIDIA GPU: Use cuda12.4 for newer GPUs or cuda11.7 for older GPUs and systems with older drivers.
- AMD/Intel GPU: Use vulkan builds.
- CPU only: Use cpu builds.
Mac:
- Apple Silicon: Use macos-arm64.
- Intel CPU: Use macos-x86_64.

Updating a portable install:

Download and unzip the latest version.
Replace the user_data folder with the one in your existing install. All your settings and models will be moved.