oobabooga/text-generation-webui v4.0 on GitHub

Changes

Custom Gradio fork: Gradio has been replaced with a custom fork at oobabooga/gradio where major performance optimizations were made. The UI now does far less redundant work on every update, startup is faster, SSE message delivery is instant instead of polling every 50 ms, and a new zero-rendering gr.Headless component reduces overhead during chat streaming. Analytics, unused dependencies, and unused assets have also been removed from the wheel.
Tool-calling overhaul: Now tool-calling actually works for Qwen 3.5, Devstral 2, GPT-OSS, DeepSeek V3.2, GLM 5, MiniMax M2.5, Kimi K2/K2.5, and Llama 4 models. Several improvements have been made for strict OpenAI format compliance. Extensive testing has been done to make sure tool-calling works flawlessly for the supported models. [Documentation]
Parallel API requests: For llama.cpp, ExLlamaV3, and TensorRT-LLM loaders, it is now possible to make concurrent API requests for maximum throughput. For llama.cpp, it is necessary to use the --parallel N option and multiply the context length by N. [Documentation]
Training overhaul (documentation): The training code has been completely rewritten. It is now fully in line with axolotl for both raw text training and chat training.
- For chat training, datasets in OpenAI messages format or ShareGPT conversations format are now used. Multi-turn chats are supported, with correct masking of user inputs so that training only happens on assistant messages. See user_data/training/example_messages.json and user_data/training/example_sharegpt.json for examples.
- For raw text training, JSONL files are used, with correct BOS and EOS addition for each sub-document. See user_data/training/example_text.json for an example input.
- Chat training now uses jinja2 templates for formatting prompts. You can use either the model's built-in template (if it has one) or a custom user-provided template.
- New "Target all linear layers" checkbox that applies LoRA to every nn.Linear layer except lm_head. It works for any model architecture.
- Checkpoint resumption: HF Trainer checkpoint directories are detected automatically and training resumes with full optimizer/scheduler state.
- All training input parameters now have good, reviewed default values.
- Conversations exceeding the cutoff length are now dropped instead of silently truncated (configurable).
- Dynamic padding (chat datasets): batches are now padded to the longest sequence in each batch instead of always padding to cutoff_len, reducing wasted computation.
llama.cpp
- --fit support: GPU layers now default to -1 (auto), letting llama.cpp determine the optimal number of layers and GPU split automatically. The new --fit-target parameter controls how much VRAM headroom to leave per GPU (default: 1024 MiB). Context size can also be set to 0 to let llama.cpp determine that automatically as well.
- Integrate N-gram speculative decoding support for faster generation without the need for a draft model, through the --spec-type, --spec-ngram-size-n, --spec-ngram-size-m, and --spec-ngram-min-hits parameters. Good defaults are provided, just change --spec-type to ngram-mod to activate.
- Binaries now work for any CPU instruction set (AVX, AVX2, AVX-512) by autodetecting at runtime, replacing the old separate AVX/AVX2 builds.
- Add ROCm portable builds for Windows.
- Add CUDA 13.1 portable builds.
- Add back macOS x86_64 (Intel) portable builds.
- Smaller CUDA binaries after improving compilation flags.
- Compilation workflows at oobabooga/llama-cpp-binaries have been fully audited and aligned with upstream.
- Handle SIGTERM to properly stop llama-server on pkill.
- llama-server is now spawned on port 5005 by default instead of a random port.
Adaptive-p sampler for llama.cpp, Transformers, ExLlamaV3, and ExLlamaV3_HF loaders. This sampler reshapes the logit distribution to favor tokens near a target probability.
New CLI flags to set default API generation parameters: --temperature, --min-p, --top-k, --repetition-penalty, etc., and also --enable-thinking, --reasoning-effort, and --chat-template-file. The last parameter accepts .jinja or .yaml files.
Chat completion requests are now ~85 ms faster after optimizations.
SSE separator for streaming over the API changed from \r\n to \n to match OpenAI.
Migrate TensorRT-LLM from the old ModelRunner API to the new LLM API, which can take any Transformers model as input and has more sampling parameters.
Security
- Prevent path traversal on file save/delete operations for characters, users, and uploaded files.
- Restrict model loading over API to block extra_flags and trust_remote_code parameters.
- Restrict file writes to the user_data_dir.
New --user-data-dir flag to customize the user data directory location. Now the program also auto-detects a ../user_data folder in portable mode if present, making updates easier.
User persona support: A new dropdown in the Character settings tab lets you save and load user profiles (name, bio, profile picture), so you can switch between different personas without re-entering your details (#7367). Thanks, @q5sys.
Replace PyPDF2 with pymupdf for much more accurate conversion of PDF inputs to text.
Markdown rendering improvements. All by @mamei16:
- Re-introduce inline LaTeX rendering with more robust exception handling (#7402).
- Disable uncommonly used indented codeblocks (#7401).
- Improve process_markdown_content (#7403).
Add Qwen 3.5 thinking block support to the UI.
Add Solar Open thinking block support to the UI.
Update the entire documentation to match the current code.
Update all dockerfiles. [Documentation]
Update the Google Colab notebook.
Remove the ExLlamaV2 loader, which has been archived. EXL2 users should migrate to EXL3, which has much better quantization accuracy.
Remove the Training_PRO extension, which has become obsolete after the Training tab updates.
Remove obsolete DeepSpeed inference code from 2023.
Remove unused colorama and psutil dependencies.
Update outdated GitHub Actions versions (#7384). Thanks, @pgoslatara.

Bug fixes

Fix temperature_last having no effect in llama.cpp server sampler order.
Fix code block copy button not working over HTTP (Clipboard API fallback) (#7358). Thanks, @jakubartur.
Fix message copy buttons not working over HTTP (extend Clipboard API fallback).
Fix ExLlamaV3 CFG cache initialization and speculative decoding parameter handling.
Fix blank prompt dropdown in Notebook/Default tabs on first startup.
Use absolute Python path in Windows batch scripts to fix some rare edge cases.
Bump sentence-transformers from 2.2.2 to 3.3.1 in superbooga (#7406). Thanks, @OiPunk.
Fix installer state being saved before requirements were fully installed.
Fix ExLlamav3 race condition that could cause AssertionError or hangs during generation.
Fix API server continuing to generate tokens after client disconnects for non-streaming requests.

Dependency updates

Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/6fce5c6a7dba6a3e1df0aad1574b78d1a1970621
Update Transformers to 5.3
Update ExLlamaV3 to 0.0.23
Update TensorRT-LLM to 1.1.0
Update PyTorch to 2.9.1
Update Python to 3.13
Update ROCm wheels to ROCm 6.4

Portable builds

Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip/extract, and run.

Which version to download:

Windows/Linux:
- NVIDIA GPU: Use cuda13.1, or cuda12.4 if you have older drivers.
- AMD/Intel GPU: Use vulkan builds.
- AMD GPU (ROCm): Use rocm builds.
- CPU only: Use cpu builds.
Mac:
- Apple Silicon: Use macos-arm64.
- Intel: Use macos-x86_64.

Updating a portable install:

Download and extract the latest version.
Replace the user_data folder with the one in your existing install. All your settings and models will be moved.

Starting with 4.0, you can also move user_data one folder up, next to the install folder. It will be detected automatically, making updates easier:

text-generation-webui-4.0/
text-generation-webui-4.1/
user_data/                    <-- shared by both installs