oobabooga/textgen v4.9 on GitHub

Changes

MTP speculative decoding support: Add draft-mtp as a new --spec-type option. Auto-enabled when loading MTP GGUFs (e.g. Qwen 3.6 MoE MTP builds).
Web search improvements:
- Add snippet support to the web_search tool: results now include a short text excerpt that often answers the query directly, eliminating the need for a follow-up fetch_webpage call (#7548).
- Drop link URLs from fetch_webpage output (links now appear as plain text instead of [text](url) markdown), significantly reducing tokens used per page.
- Prettier rendering of web_search results in the chat, with a spinner during the call.
- Add an info message to the "Activate web search" checkbox.
Show live generation speed (tokens/s) and context size while generating (#7563).
DGX Spark support: Add Linux aarch64 portable builds.
Electron
- Add "Check for updates" button in the Session tab.
- Add a folder picker for the models directory.
- Add right-click context menu for copying text.
- Add a spellcheck toggle in the Session tab (#7550).
- Store app data in user_data/cache/electron instead of the OS default location.
- Disable DNS-over-HTTPS probes.
One-click installer: Track the latest release tag instead of bleeding-edge main.
Auto-detect and auto-select sibling mmproj files when loading a model (#7564).
Detect mmproj-*.gguf files in the main models folder: They appear in the mmproj dropdown and are hidden from the regular model dropdown.
Project icon: Add an icon, courtesy of LMLocalizer on Reddit.
Treat negative --ctx-size values as auto (0).
UI
- Add drag-and-drop file upload support to the chat input (Gradio fork).
- Reorganize the right sidebar with Mode/Character/Chat style on top.
- Hide reasoning and tools controls in chat mode (only shown in instruct / chat-instruct).
- Fade in new messages, fix scroll-up jump on send.
- Rename "Send dummy message/reply" to "Insert user/assistant message".
- Polish character dropdown in chat tab.
- Tighten spacing between dropdowns and refresh buttons.
- Improve the looks of the Session tab.

Security

Restrict CORS to localhost by default to prevent drive-by API access. --listen and --public-api opt into network exposure.
Sanitize character name in load_character to prevent path traversal.
fix: prevent path traversal in load_template_by_name (#7562). Thanks, @Allen930311.
UI: Improve web search security by rejecting non-HTTP links.

Bug fixes

Fix llama-server not being killed when the parent process exits on Windows, e.g. when closing the console window or killing python.exe (#7574).
Fix streaming output leaking across chats when switching mid-stream (#7555).
Fix continue-mode regressions across template families.
Fix incorrect prompts generated with continue mode. Thanks, @MeemeeLab.
Fix thinking channel being lost across tool-call turns (#7578).
Fix API model load silently dropping hyphenated arg keys (#7577).
Fix chat deletion failing when user_data/logs is a symlink (#7579).
Fix token count not being set in non-streaming mode.
Keep web search blocks closed when the user closes them mid-stream.
fix(win): set PYTHONUTF8 for non-ASCII locale Windows compatibility (#7560). Thanks, @jerry78424.
Set TORCH_VERSION to 2.9.0 to match xformers 0.0.33's torch pin (#7581). Thanks, @AJ-Gazin.

Dependency updates

Update llama.cpp to ggml-org/llama.cpp@e947228
Update ik_llama.cpp to ikawrakow/ik_llama.cpp@40254a5
Update ExLlamaV3 to 0.0.34

Portable builds

TextGen is now a desktop app for local LLMs. Download, unzip, double-click.

Note

NVIDIA GPU: If nvidia-smi reports CUDA Version >= 13.1, use the cuda13.1 build. Otherwise, use cuda12.4.

ik_llama.cpp is a llama.cpp fork with new quant types. If unsure, use the llama.cpp column.

Windows

GPU/Platform	llama.cpp	ik_llama.cpp
NVIDIA (CUDA 12.4)	Download (936 MB)	Download (1.24 GB)
NVIDIA (CUDA 13.1)	Download (840 MB)	Download (1.33 GB)
AMD/Intel (Vulkan)	Download (336 MB)	—
AMD (ROCm 7.2)	Download (617 MB)	—
CPU only	Download (319 MB)	Download (335 MB)

Linux

GPU/Platform	llama.cpp	ik_llama.cpp
NVIDIA (CUDA 12.4)	Download (893 MB)	Download (1.21 GB)
NVIDIA (CUDA 13.1)	Download (826 MB)	Download (1.33 GB)
NVIDIA ARM64 (CUDA 13.1)	Download (910 MB)	—
AMD/Intel (Vulkan)	Download (324 MB)	—
AMD (ROCm 7.2)	Download (409 MB)	—
CPU only	Download (307 MB)	Download (338 MB)

macOS

macOS note: You need to run xattr -cr /path/to/your/textgen-folder on the extracted folder before launching. See #7558.

Architecture	llama.cpp
Apple Silicon (arm64)	Download (272 MB)
Intel (x86_64)	Download (284 MB)

Updating a portable install:

Download and extract the latest version.
Replace the user_data folder with the one in your existing install. All your settings and models will be moved.

Starting with 4.0, you can also move user_data one folder up, next to the install folder. It will be detected automatically, making updates easier:

textgen-4.6/
textgen-4.7/
user_data/    <-- shared by both installs