✨ Changes
- Add speculative decoding to the llama.cpp loader.
- In tests with
google_gemma-3-27b-it-Q8_0.ggufusinggoogle_gemma-3-1b-it-Q4_K_M.ggufas the draft model (both fully offloaded to GPU), the text generation speed went from 24.17 to 45.61 tokens/second (+88.7%). - Speed improvements vary by setup and prompt. Previous tests of mine showed increases of +64% and +34% in tokens/second for different combinations of models.
- I highly recommend trying this feature.
- In tests with
- Add speculative decoding to the non-HF ExLlamaV2 loader (#6899).
- Prevent llamacpp defaults from locking up consumer hardware (#6870). This change should provide a slight increase text generation speed in most cases when using llama.cpp. Thanks, @Matthew-Jenkins.
- llama.cpp: Add a
--extra-flagsparameter for passing additional flags tollama-server, such asoverride-tensor=exps=CPU, which is useful for MoE models. - llama.cpp: Add StreamingLLM (
--streaming-llm). This prevents complete prompt reprocessing when the context length is filled, making it especially useful for role-playing scenarios.- This is called
--cache-reusein llama.cpp. You can learn more about it here: ggml-org/llama.cpp#9866
- This is called
- llama.cpp: Add prompt processing progress messages.
- ExLlamaV3: Add KV cache quantization (#6903).
- Add Vulkan portable builds (see below). These should work on AMD and Intel Arc cards on both Windows and Linux.
- UI:
- Add a collapsible thinking block to messages with
<think>steps. - Make 'instruct' the default chat mode.
- Add a greeting when the web UI launches in instruct mode with an empty chat history.
- Make the model menu display only part 00001 of multipart GGUF files.
- Add a collapsible thinking block to messages with
- Make
llama-cpp-binarieswheels compatible with any Python >= 3.7 (useful for manually installing the requirements underrequirements/portable/). - Add an universal
--ctx-sizeflag to specify context size across all loaders. - Implement host header validation when using the UI / API on localhost (which is the default).
- This is an important security improvement. It is recommended that you update your local install to the latest version.
- Credits to security researcher Laurian Duma for discovering this issue and reaching out by email.
- Restructure the project to have all user data on
text-generation-webui/user_data, including models, characters, presets, and saved settings.- This was done to make it possible to update portable installs in the future by just moving the
user_datafolder. - It has the additional benefit of making the repository more organized.
- This is a breaking change. You will need to manually move your models from
modelstouser_data/models, your presets frompresetstouser_data/presets, etc, after this update.
- This was done to make it possible to update portable installs in the future by just moving the
🔧 Bug fixes
- Fix an issue where portable installations ignored the CMD_FLAGS.txt file.
- extensions/superboogav2: existing embedding check bug fix (#6898). Thanks, @ZiyaCu.
- ExLlamaV2_HF: Add another
torch.cuda.synchronize()call to prevent errors during text generation. - Fix the Notebook tab not loading its default prompt.
🔄 Backend updates
- llama.cpp: Update to ggml-org/llama.cpp@295354e
- ExLlamaV3: Update to turboderp-org/exllamav3@de83084.
- ExLlamaV2: Update to version 0.2.9.
Portable builds
Below you can find portable builds: self-contained packages that work with GGUF models (llama.cpp) and require no installation. Just download the right version for your system, unzip, and run.
Choosing the right build:
-
Windows/Linux:
- NVIDIA GPU: Use
cuda12.4for newer GPUs orcuda11.7for older GPUs and systems with older drivers. - AMD/Intel GPU: Use
vulkanbuilds. - CPU only: Use
cpubuilds.
- NVIDIA GPU: Use
-
Mac:
- Apple Silicon: Use
macos-arm64. - Intel CPU: Use
macos-x86_64.
- Apple Silicon: Use