Changes
- Tool call confirmation: Add inline approve/reject/always-approve buttons that appear before each tool call is executed. Enable via the new "Confirm tool calls" checkbox in the Chat tab.
- Stdio MCP server support: In addition to HTTP MCP servers, you can now configure local subprocess-based MCP servers via
user_data/mcp.json, using the same format as Claude Desktop and Cursor. [Tutorial] preserve_thinkingchat template parameter: New UI checkbox and--preserve-thinkingCLI flag to control whether thinking blocks from prior turns are kept in the context.- UI: Sidebars overhaul: Sidebars now toggle independently and persist their state on page refresh. Default visibility adapts to viewport width.
- llama.cpp: Pass
--draft-min 48by default for draftless speculative decoding. - Only show the "Reasoning effort" and "Enable thinking" controls for models whose chat template actually uses them.
- Cache MCP tool discovery to avoid re-querying servers on each generation.
- Add model download branch handling in download_model_wrapper (#7506). Thanks, @Th-Underscore.
- UI: Improve border colors in light theme, fix code block copy button colors and centering, fix code block scrollbar flash during page load, improve past chats menu spacing.
Security
- Fix SSRF vulnerabilities in URL fetching: add backslash and userinfo rejection, validate every redirect hop.
Bug fixes
- Fix Gemma 4 thinking tags not hidden after tool calls (#7509).
- Fix GPT-OSS channel tokens leaking in UI after tool calls.
Dependency updates
- Update llama.cpp to ggml-org/llama.cpp@6217b49
- Update ik_llama.cpp to ikawrakow/ik_llama.cpp@286ce32
- Update ExLlamaV3 to 0.0.30
Portable builds
Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip/extract, and run.
Note
NVIDIA GPU: If nvidia-smi reports CUDA Version >= 13.1, use the cuda13.1 build. Otherwise, use cuda12.4.
ik_llama.cpp is a llama.cpp fork with new quant types. If unsure, use the llama.cpp column.
Windows
| GPU/Platform | llama.cpp | ik_llama.cpp |
|---|---|---|
| NVIDIA (CUDA 12.4) | Download (766 MB) | Download (1.1 GB) |
| NVIDIA (CUDA 13.1) | Download (686 MB) | Download (1.19 GB) |
| AMD/Intel (Vulkan) | Download (196 MB) | — |
| AMD (ROCm 7.2) | Download (499 MB) | — |
| CPU only | Download (178 MB) | Download (194 MB) |
Linux
| GPU/Platform | llama.cpp | ik_llama.cpp |
|---|---|---|
| NVIDIA (CUDA 12.4) | Download (747 MB) | Download (1.09 GB) |
| NVIDIA (CUDA 13.1) | Download (696 MB) | Download (1.21 GB) |
| AMD/Intel (Vulkan) | Download (208 MB) | — |
| AMD (ROCm 7.2) | Download (307 MB) | — |
| CPU only | Download (190 MB) | Download (217 MB) |
macOS
| Architecture | llama.cpp |
|---|---|
| Apple Silicon (arm64) | Download (156 MB) |
| Intel (x86_64) | Download (162 MB) |
Updating a portable install:
- Download and extract the latest version.
- Replace the
user_datafolder with the one in your existing install. All your settings and models will be moved.
Starting with 4.0, you can also move user_data one folder up, next to the install folder. It will be detected automatically, making updates easier:
text-generation-webui-4.0/
text-generation-webui-4.1/
user_data/ <-- shared by both installs