Changes
- MTP speculative decoding support: Add
draft-mtpas a new--spec-typeoption. Auto-enabled when loading MTP GGUFs (e.g. Qwen 3.6 MoE MTP builds). - Web search improvements:
- Add snippet support to the
web_searchtool: results now include a short text excerpt that often answers the query directly, eliminating the need for a follow-upfetch_webpagecall (#7548). - Drop link URLs from
fetch_webpageoutput (links now appear as plain text instead of[text](url)markdown), significantly reducing tokens used per page. - Prettier rendering of
web_searchresults in the chat, with a spinner during the call. - Add an info message to the "Activate web search" checkbox.
- Add snippet support to the
- Show live generation speed (tokens/s) and context size while generating (#7563).
- DGX Spark support: Add Linux aarch64 portable builds.
- Electron
- Add "Check for updates" button in the Session tab.
- Add a folder picker for the models directory.
- Add right-click context menu for copying text.
- Add a spellcheck toggle in the Session tab (#7550).
- Store app data in
user_data/cache/electroninstead of the OS default location. - Disable DNS-over-HTTPS probes.
- One-click installer: Track the latest release tag instead of bleeding-edge
main. - Auto-detect and auto-select sibling mmproj files when loading a model (#7564).
- Detect
mmproj-*.gguffiles in the main models folder: They appear in the mmproj dropdown and are hidden from the regular model dropdown. - Project icon: Add an icon, courtesy of LMLocalizer on Reddit.
- Treat negative
--ctx-sizevalues as auto (0). - UI
- Add drag-and-drop file upload support to the chat input (Gradio fork).
- Reorganize the right sidebar with Mode/Character/Chat style on top.
- Hide reasoning and tools controls in chat mode (only shown in instruct / chat-instruct).
- Fade in new messages, fix scroll-up jump on send.
- Rename "Send dummy message/reply" to "Insert user/assistant message".
- Polish character dropdown in chat tab.
- Tighten spacing between dropdowns and refresh buttons.
- Improve the looks of the Session tab.
Security
- Restrict CORS to localhost by default to prevent drive-by API access.
--listenand--public-apiopt into network exposure. - Sanitize character name in
load_characterto prevent path traversal. - fix: prevent path traversal in load_template_by_name (#7562). Thanks, @Allen930311.
- UI: Improve web search security by rejecting non-HTTP links.
Bug fixes
- Fix llama-server not being killed when the parent process exits on Windows, e.g. when closing the console window or killing python.exe (#7574).
- Fix streaming output leaking across chats when switching mid-stream (#7555).
- Fix continue-mode regressions across template families.
- Fix incorrect prompts generated with continue mode. Thanks, @MeemeeLab.
- Fix thinking channel being lost across tool-call turns (#7578).
- Fix API model load silently dropping hyphenated arg keys (#7577).
- Fix chat deletion failing when
user_data/logsis a symlink (#7579). - Fix token count not being set in non-streaming mode.
- Keep web search blocks closed when the user closes them mid-stream.
- fix(win): set PYTHONUTF8 for non-ASCII locale Windows compatibility (#7560). Thanks, @jerry78424.
- Set TORCH_VERSION to 2.9.0 to match xformers 0.0.33's torch pin (#7581). Thanks, @AJ-Gazin.
Dependency updates
- Update llama.cpp to ggml-org/llama.cpp@e947228
- Update ik_llama.cpp to ikawrakow/ik_llama.cpp@40254a5
- Update ExLlamaV3 to 0.0.34
Portable builds
TextGen is now a desktop app for local LLMs. Download, unzip, double-click.
Note
NVIDIA GPU: If nvidia-smi reports CUDA Version >= 13.1, use the cuda13.1 build. Otherwise, use cuda12.4.
ik_llama.cpp is a llama.cpp fork with new quant types. If unsure, use the llama.cpp column.
Windows
| GPU/Platform | llama.cpp | ik_llama.cpp |
|---|---|---|
| NVIDIA (CUDA 12.4) | Download (936 MB) | Download (1.24 GB) |
| NVIDIA (CUDA 13.1) | Download (840 MB) | Download (1.33 GB) |
| AMD/Intel (Vulkan) | Download (336 MB) | — |
| AMD (ROCm 7.2) | Download (617 MB) | — |
| CPU only | Download (319 MB) | Download (335 MB) |
Linux
| GPU/Platform | llama.cpp | ik_llama.cpp |
|---|---|---|
| NVIDIA (CUDA 12.4) | Download (893 MB) | Download (1.21 GB) |
| NVIDIA (CUDA 13.1) | Download (826 MB) | Download (1.33 GB) |
| NVIDIA ARM64 (CUDA 13.1) | Download (910 MB) | — |
| AMD/Intel (Vulkan) | Download (324 MB) | — |
| AMD (ROCm 7.2) | Download (409 MB) | — |
| CPU only | Download (307 MB) | Download (338 MB) |
macOS
macOS note: You need to run xattr -cr /path/to/your/textgen-folder on the extracted folder before launching. See #7558.
| Architecture | llama.cpp |
|---|---|
| Apple Silicon (arm64) | Download (272 MB) |
| Intel (x86_64) | Download (284 MB) |
Updating a portable install:
- Download and extract the latest version.
- Replace the
user_datafolder with the one in your existing install. All your settings and models will be moved.
Starting with 4.0, you can also move user_data one folder up, next to the install folder. It will be detected automatically, making updates easier:
textgen-4.6/
textgen-4.7/
user_data/ <-- shared by both installs