oobabooga/text-generation-webui v3.1 on GitHub

✨ Changes

Add speculative decoding to the llama.cpp loader.
- In tests with google_gemma-3-27b-it-Q8_0.gguf using google_gemma-3-1b-it-Q4_K_M.gguf as the draft model (both fully offloaded to GPU), the text generation speed went from 24.17 to 45.61 tokens/second (+88.7%).
- Speed improvements vary by setup and prompt. Previous tests of mine showed increases of +64% and +34% in tokens/second for different combinations of models.
- I highly recommend trying this feature.
Add speculative decoding to the non-HF ExLlamaV2 loader (#6899).
Prevent llamacpp defaults from locking up consumer hardware (#6870). This change should provide a slight increase text generation speed in most cases when using llama.cpp. Thanks, @Matthew-Jenkins.
llama.cpp: Add a --extra-flags parameter for passing additional flags to llama-server, such as override-tensor=exps=CPU, which is useful for MoE models.
llama.cpp: Add StreamingLLM (--streaming-llm). This prevents complete prompt reprocessing when the context length is filled, making it especially useful for role-playing scenarios.
- This is called --cache-reuse in llama.cpp. You can learn more about it here: ggml-org/llama.cpp#9866
llama.cpp: Add prompt processing progress messages.
ExLlamaV3: Add KV cache quantization (#6903).
Add Vulkan portable builds (see below). These should work on AMD and Intel Arc cards on both Windows and Linux.
UI:
- Add a collapsible thinking block to messages with <think> steps.
- Make 'instruct' the default chat mode.
- Add a greeting when the web UI launches in instruct mode with an empty chat history.
- Make the model menu display only part 00001 of multipart GGUF files.
Make llama-cpp-binaries wheels compatible with any Python >= 3.7 (useful for manually installing the requirements under requirements/portable/).
Add an universal --ctx-size flag to specify context size across all loaders.
Implement host header validation when using the UI / API on localhost (which is the default).
- This is an important security improvement. It is recommended that you update your local install to the latest version.
- Credits to security researcher Laurian Duma for discovering this issue and reaching out by email.
Restructure the project to have all user data on text-generation-webui/user_data, including models, characters, presets, and saved settings.
- This was done to make it possible to update portable installs in the future by just moving the user_data folder.
- It has the additional benefit of making the repository more organized.
- This is a breaking change. You will need to manually move your models from models to user_data/models, your presets from presets to user_data/presets, etc, after this update.

🔧 Bug fixes

Fix an issue where portable installations ignored the CMD_FLAGS.txt file.
extensions/superboogav2: existing embedding check bug fix (#6898). Thanks, @ZiyaCu.
ExLlamaV2_HF: Add another torch.cuda.synchronize() call to prevent errors during text generation.
Fix the Notebook tab not loading its default prompt.

🔄 Backend updates

llama.cpp: Update to ggml-org/llama.cpp@295354e
ExLlamaV3: Update to turboderp-org/exllamav3@de83084.
ExLlamaV2: Update to version 0.2.9.

Portable builds

Below you can find portable builds: self-contained packages that work with GGUF models (llama.cpp) and require no installation. Just download the right version for your system, unzip, and run.

Choosing the right build:

Windows/Linux:
- NVIDIA GPU: Use cuda12.4 for newer GPUs or cuda11.7 for older GPUs and systems with older drivers.
- AMD/Intel GPU: Use vulkan builds.
- CPU only: Use cpu builds.
Mac:
- Apple Silicon: Use macos-arm64.
- Intel CPU: Use macos-x86_64.