github oobabooga/text-generation-webui v4.1

6 hours ago
print

Changes

  • Tool-calling in the UI!: Models can now call custom functions during chat. Each tool is a single .py file in user_data/tools, and five examples are provided: web_search, fetch_webpage, calculate, get_datetime, and roll_dice. During streaming, each tool call appears as a collapsible accordion similar to the existing thinking blocks, showing the called function, the arguments chosen by the LLM, and the output. [Tutorial]
  • Replace html2text with trafilatura for extracting text from web pages, reducing boilerplate like navigation bars significantly and saving tokens in agentic tool-calling loops.
  • OpenAI API improvements:
    • Rewrite logprobs support for full spec compliance across llama.cpp, ExLlamaV3, and Transformers backends. Both streaming and non-streaming responses now return token-by-token logprobs.
    • Add a reasoning_content field for thinking blocks in both streaming and non-streaming chat completions. Now thinking blocks go exclusively in this field, and content only shows the post-thinking reply, even when tool calls are present.
    • Add tool_choice support and fix the tool_calls response format for strict spec compliance.
    • Put mid-conversation system messages in the correct positions in the prompt instead of collapsing all system messages at the top.
    • Add support for the developer role, which is mapped to system.
    • Add max_completion_tokens as an alias for max_tokens.
    • Include /v1 in the API URL printed to the terminal since that's what most clients expect.
    • Make the /v1/models endpoint show only the currently loaded model.
    • Add stream_options support with include_usage for streaming responses.
    • Return finish_reason: tool_calls when tool calls are detected.
    • Several other spec compliance improvements after careful auditing.
  • llama.cpp
    • Set ctx-size to 0 (auto) by default. Note: this only works when --gpu-layers is also set to -1, which is the default value. When using other loaders, 0 maps to 8192.
    • Reduce the --fit-target default from 1024 MiB to 512 MiB.
    • Use --fit-ctx 8192 to set 8192 as the minimum acceptable ctx size for --fit on (llama.cpp uses 4096 by default).
    • Make logit_bias and logprobs functional in API calls.
    • Add missing custom_token_bans parameter in the UI.
  • ExLlamaV3
    • Add native logit_bias and logprobs support.
    • Load the vision model and the draft model before the main model so memory auto-splitting accounts for them.
  • New default preset: "Top-P" (top_p: 0.95), following recommendations for several SOTA open-weights models. The old "Qwen3 - Thinking", "Qwen3 - No Thinking", "min_p", and "Instruct" presets have been removed.
  • Refactor reasoning/thinking extraction into a standalone module supporting multiple model formats (Qwen, GPT-OSS, Solar, seed:think, and others). Also detect when a chat template appends <think> to the prompt and prepend it to the reply, so the thinking block appears immediately during streaming.
  • Incognito chat: This option has been added next to the existing "New chat" button. Incognito chats are temporary, live in RAM and are never saved to disk.
  • Optimize chat streaming performance by updating the DOM only once per animation frame.
  • Increase the ctx-size slider maximum to 1M tokens in the UI, with 1024 step.
  • Add a new drag-and-drop UI component for reordering "Sampler priority" items.
  • Make all chat styles consistent with instruct style in spacings, line heights, etc., improving the quality and consistency of those styles.
  • Remove the gradio import in --nowebui mode, saving some 0.5-0.8 seconds on startup.
  • Force-exit the webui on repeated Ctrl+C.
  • Improve the --multi-user warning to make the known limitations transparent.
  • Remove the rope scaling parameters (alpha_value, rope_freq_base, compress_pos_emb). Models now have 128k+ context, and those parameters are from the 4096 context era; the parameters can still be passed to llama.cpp through --extra-flags if needed.
  • Optimize wheel downloads in the one-click installer to only download wheels that actually changed between updates. Previously all wheels would get downloaded if at least 1 of them had changed.
  • Update the Intel Arc PyTorch installation command in the one-click installer, removing the dependency on Intel oneAPI conda packages.

Bug fixes

  • Fix pip accidentally installing to the system Miniconda on Windows instead of the project environment.
  • Fix crash on non-UTF-8 Windows locales (e.g. Chinese GBK).
  • Fix passing adaptive-p to llama-server.
  • Fix truncation_length not propagating correctly when ctx_size is set to auto (0).
  • Fix dark theme using light theme syntax highlighting.
  • Fix word breaks in tables. Tables now scroll horizontally instead of breaking words.
  • Fix the OpenAI API server not respecting --listen-host.
  • Fix a crash loading the MiniMax-M2.5 jinja template.
  • Fix reasoning_effort not appearing in the UI for ExLlamaV3.
  • Fix ExLlamaV3 draft cache size to match main cache.
  • Fix ExLlamaV3 EOS handling for models with multiple end-of-sequence tokens.
  • Fix ExLlamaV3 perplexity evaluation giving incorrect values for sequences longer than 2048 tokens.

Dependency updates


Portable builds

Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip/extract, and run.

Which version to download:

  • Windows/Linux:

    • NVIDIA GPU: Use cuda13.1, or cuda12.4 if you have older drivers.
    • AMD/Intel GPU: Use vulkan builds.
    • AMD GPU (ROCm): Use rocm builds.
    • CPU only: Use cpu builds.
  • Mac:

    • Apple Silicon: Use macos-arm64.
    • Intel: Use macos-x86_64.

Updating a portable install:

  1. Download and extract the latest version.
  2. Replace the user_data folder with the one in your existing install. All your settings and models will be moved.

Starting with 4.0, you can also move user_data one folder up, next to the install folder. It will be detected automatically, making updates easier:

text-generation-webui-4.0/
text-generation-webui-4.1/
user_data/                    <-- shared by both installs

Don't miss a new text-generation-webui release

NewReleases is sending notifications on new releases.