github oobabooga/textgen v4.9

7 hours ago

Changes

  • MTP speculative decoding support: Add draft-mtp as a new --spec-type option. Auto-enabled when loading MTP GGUFs (e.g. Qwen 3.6 MoE MTP builds).
  • Web search improvements:
    • Add snippet support to the web_search tool: results now include a short text excerpt that often answers the query directly, eliminating the need for a follow-up fetch_webpage call (#7548).
    • Drop link URLs from fetch_webpage output (links now appear as plain text instead of [text](url) markdown), significantly reducing tokens used per page.
    • Prettier rendering of web_search results in the chat, with a spinner during the call.
    • Add an info message to the "Activate web search" checkbox.
  • Show live generation speed (tokens/s) and context size while generating (#7563).
  • DGX Spark support: Add Linux aarch64 portable builds.
  • Electron
    • Add "Check for updates" button in the Session tab.
    • Add a folder picker for the models directory.
    • Add right-click context menu for copying text.
    • Add a spellcheck toggle in the Session tab (#7550).
    • Store app data in user_data/cache/electron instead of the OS default location.
    • Disable DNS-over-HTTPS probes.
  • One-click installer: Track the latest release tag instead of bleeding-edge main.
  • Auto-detect and auto-select sibling mmproj files when loading a model (#7564).
  • Detect mmproj-*.gguf files in the main models folder: They appear in the mmproj dropdown and are hidden from the regular model dropdown.
  • Project icon: Add an icon, courtesy of LMLocalizer on Reddit.
  • Treat negative --ctx-size values as auto (0).
  • UI
    • Add drag-and-drop file upload support to the chat input (Gradio fork).
    • Reorganize the right sidebar with Mode/Character/Chat style on top.
    • Hide reasoning and tools controls in chat mode (only shown in instruct / chat-instruct).
    • Fade in new messages, fix scroll-up jump on send.
    • Rename "Send dummy message/reply" to "Insert user/assistant message".
    • Polish character dropdown in chat tab.
    • Tighten spacing between dropdowns and refresh buttons.
    • Improve the looks of the Session tab.

Security

  • Restrict CORS to localhost by default to prevent drive-by API access. --listen and --public-api opt into network exposure.
  • Sanitize character name in load_character to prevent path traversal.
  • fix: prevent path traversal in load_template_by_name (#7562). Thanks, @Allen930311.
  • UI: Improve web search security by rejecting non-HTTP links.

Bug fixes

  • Fix llama-server not being killed when the parent process exits on Windows, e.g. when closing the console window or killing python.exe (#7574).
  • Fix streaming output leaking across chats when switching mid-stream (#7555).
  • Fix continue-mode regressions across template families.
  • Fix incorrect prompts generated with continue mode. Thanks, @MeemeeLab.
  • Fix thinking channel being lost across tool-call turns (#7578).
  • Fix API model load silently dropping hyphenated arg keys (#7577).
  • Fix chat deletion failing when user_data/logs is a symlink (#7579).
  • Fix token count not being set in non-streaming mode.
  • Keep web search blocks closed when the user closes them mid-stream.
  • fix(win): set PYTHONUTF8 for non-ASCII locale Windows compatibility (#7560). Thanks, @jerry78424.
  • Set TORCH_VERSION to 2.9.0 to match xformers 0.0.33's torch pin (#7581). Thanks, @AJ-Gazin.

Dependency updates

Portable builds

TextGen is now a desktop app for local LLMs. Download, unzip, double-click.

Note

NVIDIA GPU: If nvidia-smi reports CUDA Version >= 13.1, use the cuda13.1 build. Otherwise, use cuda12.4.

ik_llama.cpp is a llama.cpp fork with new quant types. If unsure, use the llama.cpp column.

Windows

GPU/Platform llama.cpp ik_llama.cpp
NVIDIA (CUDA 12.4) Download (936 MB) Download (1.24 GB)
NVIDIA (CUDA 13.1) Download (840 MB) Download (1.33 GB)
AMD/Intel (Vulkan) Download (336 MB)
AMD (ROCm 7.2) Download (617 MB)
CPU only Download (319 MB) Download (335 MB)

Linux

GPU/Platform llama.cpp ik_llama.cpp
NVIDIA (CUDA 12.4) Download (893 MB) Download (1.21 GB)
NVIDIA (CUDA 13.1) Download (826 MB) Download (1.33 GB)
NVIDIA ARM64 (CUDA 13.1) Download (910 MB)
AMD/Intel (Vulkan) Download (324 MB)
AMD (ROCm 7.2) Download (409 MB)
CPU only Download (307 MB) Download (338 MB)

macOS

macOS note: You need to run xattr -cr /path/to/your/textgen-folder on the extracted folder before launching. See #7558.

Architecture llama.cpp
Apple Silicon (arm64) Download (272 MB)
Intel (x86_64) Download (284 MB)

Updating a portable install:

  1. Download and extract the latest version.
  2. Replace the user_data folder with the one in your existing install. All your settings and models will be moved.

Starting with 4.0, you can also move user_data one folder up, next to the install folder. It will be detected automatically, making updates easier:

textgen-4.6/
textgen-4.7/
user_data/    <-- shared by both installs

Don't miss a new textgen release

NewReleases is sending notifications on new releases.