oobabooga/text-generation-webui v3.2 on GitHub

✨ Changes

Add an option to enable/disable thinking for Qwen3 models (and all future models with this feature). You can find it as a checkbox under Parameters > enable_thinking.
- By default, thinking is enabled.
- This works directly with the Jinja2 template.
Make <think> UI blocks closed by default.
Set max_updates_second to 12 by default. This prevents CPU bottlenecking when reasoning models that generate extremely long replies generate at 50 tokens/second.
Find a new API port automatically if the default one is taken.
Make --verbose print the llama-server launch command to the console.

🔧 Bug fixes

Fix ExLlamaV3_HF leaking memory, especially for long prompts/conversations.
Fix the streaming_llm UI checkbox not being interactive.
Fix the max_updates_second UI parameter not working.
Fix getting the llama.cpp token probabilities for Qwen3-30B-A3B through the API.
Fix CFG with ExLlamaV2_HF.

🔄 Backend updates

llama.cpp: Update to ggml-org/llama.cpp@3e168be
ExLlamaV3: Update to turboderp-org/exllamav3@4724b86.

Portable builds

Below you can find portable builds: self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.

Choosing the right build:

Windows/Linux:
- NVIDIA GPU: Use cuda12.4 for newer GPUs or cuda11.7 for older GPUs and systems with older drivers.
- AMD/Intel GPU: Use vulkan builds.
- CPU only: Use cpu builds.
Mac:
- Apple Silicon: Use macos-arm64.
- Intel CPU: Use macos-x86_64.

Updating a portable install:

Download and unzip the latest version.
Replace the user_data folder with the one in your existing install. All your settings and models will be moved.