oobabooga/textgen v4.6 on GitHub

Changes

Tool call confirmation: Add inline approve/reject/always-approve buttons that appear before each tool call is executed. Enable via the new "Confirm tool calls" checkbox in the Chat tab.
Stdio MCP server support: In addition to HTTP MCP servers, you can now configure local subprocess-based MCP servers via user_data/mcp.json, using the same format as Claude Desktop and Cursor. [Tutorial]
preserve_thinking chat template parameter: New UI checkbox and --preserve-thinking CLI flag to control whether thinking blocks from prior turns are kept in the context.
UI: Sidebars overhaul: Sidebars now toggle independently and persist their state on page refresh. Default visibility adapts to viewport width.
llama.cpp: Pass --draft-min 48 by default for draftless speculative decoding.
Only show the "Reasoning effort" and "Enable thinking" controls for models whose chat template actually uses them.
Cache MCP tool discovery to avoid re-querying servers on each generation.
Add model download branch handling in download_model_wrapper (#7506). Thanks, @Th-Underscore.
UI: Improve border colors in light theme, fix code block copy button colors and centering, fix code block scrollbar flash during page load, improve past chats menu spacing.

Security

Fix SSRF vulnerabilities in URL fetching: add backslash and userinfo rejection, validate every redirect hop.

Bug fixes

Fix Gemma 4 thinking tags not hidden after tool calls (#7509).
Fix GPT-OSS channel tokens leaking in UI after tool calls.

Dependency updates

Update llama.cpp to ggml-org/llama.cpp@6217b49
Update ik_llama.cpp to ikawrakow/ik_llama.cpp@286ce32
Update ExLlamaV3 to 0.0.30

Portable builds

Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip/extract, and run.

Note

NVIDIA GPU: If nvidia-smi reports CUDA Version >= 13.1, use the cuda13.1 build. Otherwise, use cuda12.4.

ik_llama.cpp is a llama.cpp fork with new quant types. If unsure, use the llama.cpp column.

Windows

GPU/Platform	llama.cpp	ik_llama.cpp
NVIDIA (CUDA 12.4)	Download (766 MB)	Download (1.1 GB)
NVIDIA (CUDA 13.1)	Download (686 MB)	Download (1.19 GB)
AMD/Intel (Vulkan)	Download (196 MB)	—
AMD (ROCm 7.2)	Download (499 MB)	—
CPU only	Download (178 MB)	Download (194 MB)

Linux

GPU/Platform	llama.cpp	ik_llama.cpp
NVIDIA (CUDA 12.4)	Download (747 MB)	Download (1.09 GB)
NVIDIA (CUDA 13.1)	Download (696 MB)	Download (1.21 GB)
AMD/Intel (Vulkan)	Download (208 MB)	—
AMD (ROCm 7.2)	Download (307 MB)	—
CPU only	Download (190 MB)	Download (217 MB)

macOS

Architecture	llama.cpp
Apple Silicon (arm64)	Download (156 MB)
Intel (x86_64)	Download (162 MB)

Updating a portable install:

Download and extract the latest version.
Replace the user_data folder with the one in your existing install. All your settings and models will be moved.

Starting with 4.0, you can also move user_data one folder up, next to the install folder. It will be detected automatically, making updates easier:

text-generation-webui-4.0/
text-generation-webui-4.1/
user_data/                    <-- shared by both installs