oobabooga/text-generation-webui v4.1 on GitHub

Changes

Tool-calling in the UI!: Models can now call custom functions during chat. Each tool is a single .py file in user_data/tools, and five examples are provided: web_search, fetch_webpage, calculate, get_datetime, and roll_dice. During streaming, each tool call appears as a collapsible accordion similar to the existing thinking blocks, showing the called function, the arguments chosen by the LLM, and the output. [Tutorial]
Replace html2text with trafilatura for extracting text from web pages, reducing boilerplate like navigation bars significantly and saving tokens in agentic tool-calling loops.
OpenAI API improvements:
- Rewrite logprobs support for full spec compliance across llama.cpp, ExLlamaV3, and Transformers backends. Both streaming and non-streaming responses now return token-by-token logprobs.
- Add a reasoning_content field for thinking blocks in both streaming and non-streaming chat completions. Now thinking blocks go exclusively in this field, and content only shows the post-thinking reply, even when tool calls are present.
- Add tool_choice support and fix the tool_calls response format for strict spec compliance.
- Put mid-conversation system messages in the correct positions in the prompt instead of collapsing all system messages at the top.
- Add support for the developer role, which is mapped to system.
- Add max_completion_tokens as an alias for max_tokens.
- Include /v1 in the API URL printed to the terminal since that's what most clients expect.
- Make the /v1/models endpoint show only the currently loaded model.
- Add stream_options support with include_usage for streaming responses.
- Return finish_reason: tool_calls when tool calls are detected.
- Several other spec compliance improvements after careful auditing.
llama.cpp
- Set ctx-size to 0 (auto) by default. Note: this only works when --gpu-layers is also set to -1, which is the default value. When using other loaders, 0 maps to 8192.
- Reduce the --fit-target default from 1024 MiB to 512 MiB.
- Use --fit-ctx 8192 to set 8192 as the minimum acceptable ctx size for --fit on (llama.cpp uses 4096 by default).
- Make logit_bias and logprobs functional in API calls.
- Add missing custom_token_bans parameter in the UI.
ExLlamaV3
- Add native logit_bias and logprobs support.
- Load the vision model and the draft model before the main model so memory auto-splitting accounts for them.
New default preset: "Top-P" (top_p: 0.95), following recommendations for several SOTA open-weights models. The old "Qwen3 - Thinking", "Qwen3 - No Thinking", "min_p", and "Instruct" presets have been removed.
Refactor reasoning/thinking extraction into a standalone module supporting multiple model formats (Qwen, GPT-OSS, Solar, seed:think, and others). Also detect when a chat template appends <think> to the prompt and prepend it to the reply, so the thinking block appears immediately during streaming.
Incognito chat: This option has been added next to the existing "New chat" button. Incognito chats are temporary, live in RAM and are never saved to disk.
Optimize chat streaming performance by updating the DOM only once per animation frame.
Increase the ctx-size slider maximum to 1M tokens in the UI, with 1024 step.
Add a new drag-and-drop UI component for reordering "Sampler priority" items.
Make all chat styles consistent with instruct style in spacings, line heights, etc., improving the quality and consistency of those styles.
Remove the gradio import in --nowebui mode, saving some 0.5-0.8 seconds on startup.
Force-exit the webui on repeated Ctrl+C.
Improve the --multi-user warning to make the known limitations transparent.
Remove the rope scaling parameters (alpha_value, rope_freq_base, compress_pos_emb). Models now have 128k+ context, and those parameters are from the 4096 context era; the parameters can still be passed to llama.cpp through --extra-flags if needed.
Optimize wheel downloads in the one-click installer to only download wheels that actually changed between updates. Previously all wheels would get downloaded if at least 1 of them had changed.
Update the Intel Arc PyTorch installation command in the one-click installer, removing the dependency on Intel oneAPI conda packages.

Bug fixes

Fix pip accidentally installing to the system Miniconda on Windows instead of the project environment.
Fix crash on non-UTF-8 Windows locales (e.g. Chinese GBK).
Fix passing adaptive-p to llama-server.
Fix truncation_length not propagating correctly when ctx_size is set to auto (0).
Fix dark theme using light theme syntax highlighting.
Fix word breaks in tables. Tables now scroll horizontally instead of breaking words.
Fix the OpenAI API server not respecting --listen-host.
Fix a crash loading the MiniMax-M2.5 jinja template.
Fix reasoning_effort not appearing in the UI for ExLlamaV3.
Fix ExLlamaV3 draft cache size to match main cache.
Fix ExLlamaV3 EOS handling for models with multiple end-of-sequence tokens.
Fix ExLlamaV3 perplexity evaluation giving incorrect values for sequences longer than 2048 tokens.

Dependency updates

Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/67a2209fabe2e3498d458561933d5380655085d2
Update ExLlamaV3 to 0.0.25
Update diffusers to 0.37
Update AMD ROCm from 6.4 to 7.2

Portable builds

Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip/extract, and run.

Which version to download:

Windows/Linux:
- NVIDIA GPU: Use cuda13.1, or cuda12.4 if you have older drivers.
- AMD/Intel GPU: Use vulkan builds.
- AMD GPU (ROCm): Use rocm builds.
- CPU only: Use cpu builds.
Mac:
- Apple Silicon: Use macos-arm64.
- Intel: Use macos-x86_64.

Updating a portable install:

Download and extract the latest version.
Replace the user_data folder with the one in your existing install. All your settings and models will be moved.

Starting with 4.0, you can also move user_data one folder up, next to the install folder. It will be detected automatically, making updates easier:

text-generation-webui-4.0/
text-generation-webui-4.1/
user_data/                    <-- shared by both installs