github oobabooga/text-generation-webui v3.2

latest releases: v3.21, v3.20, v3.19...
7 months ago

✨ Changes

  • Add an option to enable/disable thinking for Qwen3 models (and all future models with this feature). You can find it as a checkbox under Parameters > enable_thinking.
    • By default, thinking is enabled.
    • This works directly with the Jinja2 template.
  • Make <think> UI blocks closed by default.
  • Set max_updates_second to 12 by default. This prevents CPU bottlenecking when reasoning models that generate extremely long replies generate at 50 tokens/second.
  • Find a new API port automatically if the default one is taken.
  • Make --verbose print the llama-server launch command to the console.

🔧 Bug fixes

  • Fix ExLlamaV3_HF leaking memory, especially for long prompts/conversations.
  • Fix the streaming_llm UI checkbox not being interactive.
  • Fix the max_updates_second UI parameter not working.
  • Fix getting the llama.cpp token probabilities for Qwen3-30B-A3B through the API.
  • Fix CFG with ExLlamaV2_HF.

🔄 Backend updates


Portable builds

Below you can find portable builds: self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.

Choosing the right build:

  • Windows/Linux:

    • NVIDIA GPU: Use cuda12.4 for newer GPUs or cuda11.7 for older GPUs and systems with older drivers.
    • AMD/Intel GPU: Use vulkan builds.
    • CPU only: Use cpu builds.
  • Mac:

    • Apple Silicon: Use macos-arm64.
    • Intel CPU: Use macos-x86_64.

Updating a portable install:

  1. Download and unzip the latest version.
  2. Replace the user_data folder with the one in your existing install. All your settings and models will be moved.

Don't miss a new text-generation-webui release

NewReleases is sending notifications on new releases.