Changes
- Tool-calling in the UI!: Models can now call custom functions during chat. Each tool is a single
.pyfile inuser_data/tools, and five examples are provided:web_search,fetch_webpage,calculate,get_datetime, androll_dice. During streaming, each tool call appears as a collapsible accordion similar to the existing thinking blocks, showing the called function, the arguments chosen by the LLM, and the output. [Tutorial] - Replace
html2textwithtrafilaturafor extracting text from web pages, reducing boilerplate like navigation bars significantly and saving tokens in agentic tool-calling loops. - OpenAI API improvements:
- Rewrite
logprobssupport for full spec compliance across llama.cpp, ExLlamaV3, and Transformers backends. Both streaming and non-streaming responses now return token-by-token logprobs. - Add a
reasoning_contentfield for thinking blocks in both streaming and non-streaming chat completions. Now thinking blocks go exclusively in this field, andcontentonly shows the post-thinking reply, even when tool calls are present. - Add
tool_choicesupport and fix thetool_callsresponse format for strict spec compliance. - Put mid-conversation system messages in the correct positions in the prompt instead of collapsing all system messages at the top.
- Add support for the
developerrole, which is mapped tosystem. - Add
max_completion_tokensas an alias formax_tokens. - Include
/v1in the API URL printed to the terminal since that's what most clients expect. - Make the
/v1/modelsendpoint show only the currently loaded model. - Add
stream_optionssupport withinclude_usagefor streaming responses. - Return
finish_reason: tool_callswhen tool calls are detected. - Several other spec compliance improvements after careful auditing.
- Rewrite
- llama.cpp
- Set
ctx-sizeto0(auto) by default. Note: this only works when--gpu-layersis also set to-1, which is the default value. When using other loaders, 0 maps to 8192. - Reduce the
--fit-targetdefault from 1024 MiB to 512 MiB. - Use
--fit-ctx 8192to set 8192 as the minimum acceptable ctx size for--fit on(llama.cpp uses 4096 by default). - Make
logit_biasandlogprobsfunctional in API calls. - Add missing
custom_token_bansparameter in the UI.
- Set
- ExLlamaV3
- Add native
logit_biasandlogprobssupport. - Load the vision model and the draft model before the main model so memory auto-splitting accounts for them.
- Add native
- New default preset: "Top-P" (
top_p: 0.95), following recommendations for several SOTA open-weights models. The old "Qwen3 - Thinking", "Qwen3 - No Thinking", "min_p", and "Instruct" presets have been removed. - Refactor reasoning/thinking extraction into a standalone module supporting multiple model formats (Qwen, GPT-OSS, Solar, seed:think, and others). Also detect when a chat template appends
<think>to the prompt and prepend it to the reply, so the thinking block appears immediately during streaming. - Incognito chat: This option has been added next to the existing "New chat" button. Incognito chats are temporary, live in RAM and are never saved to disk.
- Optimize chat streaming performance by updating the DOM only once per animation frame.
- Increase the
ctx-sizeslider maximum to 1M tokens in the UI, with 1024 step. - Add a new drag-and-drop UI component for reordering "Sampler priority" items.
- Make all chat styles consistent with instruct style in spacings, line heights, etc., improving the quality and consistency of those styles.
- Remove the gradio import in
--nowebuimode, saving some 0.5-0.8 seconds on startup. - Force-exit the webui on repeated Ctrl+C.
- Improve the
--multi-userwarning to make the known limitations transparent. - Remove the rope scaling parameters (
alpha_value,rope_freq_base,compress_pos_emb). Models now have 128k+ context, and those parameters are from the 4096 context era; the parameters can still be passed to llama.cpp through--extra-flagsif needed. - Optimize wheel downloads in the one-click installer to only download wheels that actually changed between updates. Previously all wheels would get downloaded if at least 1 of them had changed.
- Update the Intel Arc PyTorch installation command in the one-click installer, removing the dependency on Intel oneAPI conda packages.
Bug fixes
- Fix pip accidentally installing to the system Miniconda on Windows instead of the project environment.
- Fix crash on non-UTF-8 Windows locales (e.g. Chinese GBK).
- Fix passing
adaptive-pto llama-server. - Fix
truncation_lengthnot propagating correctly whenctx_sizeis set to auto (0). - Fix dark theme using light theme syntax highlighting.
- Fix word breaks in tables. Tables now scroll horizontally instead of breaking words.
- Fix the OpenAI API server not respecting
--listen-host. - Fix a crash loading the MiniMax-M2.5 jinja template.
- Fix
reasoning_effortnot appearing in the UI for ExLlamaV3. - Fix ExLlamaV3 draft cache size to match main cache.
- Fix ExLlamaV3 EOS handling for models with multiple end-of-sequence tokens.
- Fix ExLlamaV3 perplexity evaluation giving incorrect values for sequences longer than 2048 tokens.
Dependency updates
- Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/67a2209fabe2e3498d458561933d5380655085d2
- Update ExLlamaV3 to 0.0.25
- Update diffusers to 0.37
- Update AMD ROCm from 6.4 to 7.2
Portable builds
Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip/extract, and run.
Which version to download:
-
Windows/Linux:
- NVIDIA GPU: Use
cuda13.1, orcuda12.4if you have older drivers. - AMD/Intel GPU: Use
vulkanbuilds. - AMD GPU (ROCm): Use
rocmbuilds. - CPU only: Use
cpubuilds.
- NVIDIA GPU: Use
-
Mac:
- Apple Silicon: Use
macos-arm64. - Intel: Use
macos-x86_64.
- Apple Silicon: Use
Updating a portable install:
- Download and extract the latest version.
- Replace the
user_datafolder with the one in your existing install. All your settings and models will be moved.
Starting with 4.0, you can also move user_data one folder up, next to the install folder. It will be detected automatically, making updates easier:
text-generation-webui-4.0/
text-generation-webui-4.1/
user_data/ <-- shared by both installs