Changes
- Custom Gradio fork: Gradio has been replaced with a custom fork at oobabooga/gradio where major performance optimizations were made. The UI now does far less redundant work on every update, startup is faster, SSE message delivery is instant instead of polling every 50 ms, and a new zero-rendering
gr.Headlesscomponent reduces overhead during chat streaming. Analytics, unused dependencies, and unused assets have also been removed from the wheel. - Tool-calling overhaul: Now tool-calling actually works for Qwen 3.5, Devstral 2, GPT-OSS, DeepSeek V3.2, GLM 5, MiniMax M2.5, Kimi K2/K2.5, and Llama 4 models. Several improvements have been made for strict OpenAI format compliance. Extensive testing has been done to make sure tool-calling works flawlessly for the supported models. [Documentation]
- Parallel API requests: For llama.cpp, ExLlamaV3, and TensorRT-LLM loaders, it is now possible to make concurrent API requests for maximum throughput. For llama.cpp, it is necessary to use the
--parallel Noption and multiply the context length byN. [Documentation] - Training overhaul (documentation): The training code has been completely rewritten. It is now fully in line with axolotl for both raw text training and chat training.
- For chat training, datasets in OpenAI
messagesformat or ShareGPTconversationsformat are now used. Multi-turn chats are supported, with correct masking of user inputs so that training only happens on assistant messages. Seeuser_data/training/example_messages.jsonanduser_data/training/example_sharegpt.jsonfor examples. - For raw text training, JSONL files are used, with correct BOS and EOS addition for each sub-document. See
user_data/training/example_text.jsonfor an example input. - Chat training now uses jinja2 templates for formatting prompts. You can use either the model's built-in template (if it has one) or a custom user-provided template.
- New "Target all linear layers" checkbox that applies LoRA to every
nn.Linearlayer exceptlm_head. It works for any model architecture. - Checkpoint resumption: HF Trainer checkpoint directories are detected automatically and training resumes with full optimizer/scheduler state.
- All training input parameters now have good, reviewed default values.
- Conversations exceeding the cutoff length are now dropped instead of silently truncated (configurable).
- Dynamic padding (chat datasets): batches are now padded to the longest sequence in each batch instead of always padding to
cutoff_len, reducing wasted computation.
- For chat training, datasets in OpenAI
- llama.cpp
--fitsupport: GPU layers now default to-1(auto), letting llama.cpp determine the optimal number of layers and GPU split automatically. The new--fit-targetparameter controls how much VRAM headroom to leave per GPU (default: 1024 MiB). Context size can also be set to0to let llama.cpp determine that automatically as well.- Integrate N-gram speculative decoding support for faster generation without the need for a draft model, through the
--spec-type,--spec-ngram-size-n,--spec-ngram-size-m, and--spec-ngram-min-hitsparameters. Good defaults are provided, just change--spec-typetongram-modto activate. - Binaries now work for any CPU instruction set (AVX, AVX2, AVX-512) by autodetecting at runtime, replacing the old separate AVX/AVX2 builds.
- Add ROCm portable builds for Windows.
- Add CUDA 13.1 portable builds.
- Add back macOS x86_64 (Intel) portable builds.
- Smaller CUDA binaries after improving compilation flags.
- Compilation workflows at oobabooga/llama-cpp-binaries have been fully audited and aligned with upstream.
- Handle SIGTERM to properly stop llama-server on pkill.
- llama-server is now spawned on port 5005 by default instead of a random port.
- Adaptive-p sampler for llama.cpp, Transformers, ExLlamaV3, and ExLlamaV3_HF loaders. This sampler reshapes the logit distribution to favor tokens near a target probability.
- New CLI flags to set default API generation parameters:
--temperature,--min-p,--top-k,--repetition-penalty, etc., and also--enable-thinking,--reasoning-effort, and--chat-template-file. The last parameter accepts.jinjaor.yamlfiles. - Chat completion requests are now ~85 ms faster after optimizations.
- SSE separator for streaming over the API changed from
\r\nto\nto match OpenAI. - Migrate TensorRT-LLM from the old ModelRunner API to the new LLM API, which can take any Transformers model as input and has more sampling parameters.
- Security
- Prevent path traversal on file save/delete operations for characters, users, and uploaded files.
- Restrict model loading over API to block
extra_flagsandtrust_remote_codeparameters. - Restrict file writes to the
user_data_dir.
- New
--user-data-dirflag to customize the user data directory location. Now the program also auto-detects a../user_datafolder in portable mode if present, making updates easier. - User persona support: A new dropdown in the Character settings tab lets you save and load user profiles (name, bio, profile picture), so you can switch between different personas without re-entering your details (#7367). Thanks, @q5sys.
- Replace PyPDF2 with pymupdf for much more accurate conversion of PDF inputs to text.
- Markdown rendering improvements. All by @mamei16:
- Add Qwen 3.5 thinking block support to the UI.
- Add Solar Open thinking block support to the UI.
- Update the entire documentation to match the current code.
- Update all dockerfiles. [Documentation]
- Update the Google Colab notebook.
- Remove the ExLlamaV2 loader, which has been archived. EXL2 users should migrate to EXL3, which has much better quantization accuracy.
- Remove the Training_PRO extension, which has become obsolete after the Training tab updates.
- Remove obsolete DeepSpeed inference code from 2023.
- Remove unused colorama and psutil dependencies.
- Update outdated GitHub Actions versions (#7384). Thanks, @pgoslatara.
Bug fixes
- Fix
temperature_lasthaving no effect in llama.cpp server sampler order. - Fix code block copy button not working over HTTP (Clipboard API fallback) (#7358). Thanks, @jakubartur.
- Fix message copy buttons not working over HTTP (extend Clipboard API fallback).
- Fix ExLlamaV3 CFG cache initialization and speculative decoding parameter handling.
- Fix blank prompt dropdown in Notebook/Default tabs on first startup.
- Use absolute Python path in Windows batch scripts to fix some rare edge cases.
- Bump sentence-transformers from 2.2.2 to 3.3.1 in superbooga (#7406). Thanks, @OiPunk.
- Fix installer state being saved before requirements were fully installed.
- Fix ExLlamav3 race condition that could cause AssertionError or hangs during generation.
- Fix API server continuing to generate tokens after client disconnects for non-streaming requests.
Dependency updates
- Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/6fce5c6a7dba6a3e1df0aad1574b78d1a1970621
- Update Transformers to 5.3
- Update ExLlamaV3 to 0.0.23
- Update TensorRT-LLM to 1.1.0
- Update PyTorch to 2.9.1
- Update Python to 3.13
- Update ROCm wheels to ROCm 6.4
Portable builds
Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip/extract, and run.
Which version to download:
-
Windows/Linux:
- NVIDIA GPU: Use
cuda13.1, orcuda12.4if you have older drivers. - AMD/Intel GPU: Use
vulkanbuilds. - AMD GPU (ROCm): Use
rocmbuilds. - CPU only: Use
cpubuilds.
- NVIDIA GPU: Use
-
Mac:
- Apple Silicon: Use
macos-arm64. - Intel: Use
macos-x86_64.
- Apple Silicon: Use
Updating a portable install:
- Download and extract the latest version.
- Replace the
user_datafolder with the one in your existing install. All your settings and models will be moved.
Starting with 4.0, you can also move user_data one folder up, next to the install folder. It will be detected automatically, making updates easier:
text-generation-webui-4.0/
text-generation-webui-4.1/
user_data/ <-- shared by both installs