github oobabooga/text-generation-webui v3.1

latest releases: v3.21, v3.20, v3.19...
7 months ago

✨ Changes

  • Add speculative decoding to the llama.cpp loader.
    • In tests with google_gemma-3-27b-it-Q8_0.gguf using google_gemma-3-1b-it-Q4_K_M.gguf as the draft model (both fully offloaded to GPU), the text generation speed went from 24.17 to 45.61 tokens/second (+88.7%).
    • Speed improvements vary by setup and prompt. Previous tests of mine showed increases of +64% and +34% in tokens/second for different combinations of models.
    • I highly recommend trying this feature.
  • Add speculative decoding to the non-HF ExLlamaV2 loader (#6899).
  • Prevent llamacpp defaults from locking up consumer hardware (#6870). This change should provide a slight increase text generation speed in most cases when using llama.cpp. Thanks, @Matthew-Jenkins.
  • llama.cpp: Add a --extra-flags parameter for passing additional flags to llama-server, such as override-tensor=exps=CPU, which is useful for MoE models.
  • llama.cpp: Add StreamingLLM (--streaming-llm). This prevents complete prompt reprocessing when the context length is filled, making it especially useful for role-playing scenarios.
  • llama.cpp: Add prompt processing progress messages.
  • ExLlamaV3: Add KV cache quantization (#6903).
  • Add Vulkan portable builds (see below). These should work on AMD and Intel Arc cards on both Windows and Linux.
  • UI:
    • Add a collapsible thinking block to messages with <think> steps.
    • Make 'instruct' the default chat mode.
    • Add a greeting when the web UI launches in instruct mode with an empty chat history.
    • Make the model menu display only part 00001 of multipart GGUF files.
  • Make llama-cpp-binaries wheels compatible with any Python >= 3.7 (useful for manually installing the requirements under requirements/portable/).
  • Add an universal --ctx-size flag to specify context size across all loaders.
  • Implement host header validation when using the UI / API on localhost (which is the default).
    • This is an important security improvement. It is recommended that you update your local install to the latest version.
    • Credits to security researcher Laurian Duma for discovering this issue and reaching out by email.
  • Restructure the project to have all user data on text-generation-webui/user_data, including models, characters, presets, and saved settings.
    • This was done to make it possible to update portable installs in the future by just moving the user_data folder.
    • It has the additional benefit of making the repository more organized.
    • This is a breaking change. You will need to manually move your models from models to user_data/models, your presets from presets to user_data/presets, etc, after this update.

🔧 Bug fixes

  • Fix an issue where portable installations ignored the CMD_FLAGS.txt file.
  • extensions/superboogav2: existing embedding check bug fix (#6898). Thanks, @ZiyaCu.
  • ExLlamaV2_HF: Add another torch.cuda.synchronize() call to prevent errors during text generation.
  • Fix the Notebook tab not loading its default prompt.

🔄 Backend updates


Portable builds

Below you can find portable builds: self-contained packages that work with GGUF models (llama.cpp) and require no installation. Just download the right version for your system, unzip, and run.

Choosing the right build:

  • Windows/Linux:

    • NVIDIA GPU: Use cuda12.4 for newer GPUs or cuda11.7 for older GPUs and systems with older drivers.
    • AMD/Intel GPU: Use vulkan builds.
    • CPU only: Use cpu builds.
  • Mac:

    • Apple Silicon: Use macos-arm64.
    • Intel CPU: Use macos-x86_64.

Don't miss a new text-generation-webui release

NewReleases is sending notifications on new releases.