✨ Changes
- Add an option to enable/disable thinking for Qwen3 models (and all future models with this feature). You can find it as a checkbox under Parameters > enable_thinking.
- By default, thinking is enabled.
- This works directly with the Jinja2 template.
- Make
<think>UI blocks closed by default. - Set
max_updates_secondto 12 by default. This prevents CPU bottlenecking when reasoning models that generate extremely long replies generate at 50 tokens/second. - Find a new API port automatically if the default one is taken.
- Make
--verboseprint thellama-serverlaunch command to the console.
🔧 Bug fixes
- Fix ExLlamaV3_HF leaking memory, especially for long prompts/conversations.
- Fix the
streaming_llmUI checkbox not being interactive. - Fix the
max_updates_secondUI parameter not working. - Fix getting the llama.cpp token probabilities for
Qwen3-30B-A3Bthrough the API. - Fix CFG with ExLlamaV2_HF.
🔄 Backend updates
- llama.cpp: Update to ggml-org/llama.cpp@3e168be
- ExLlamaV3: Update to turboderp-org/exllamav3@4724b86.
Portable builds
Below you can find portable builds: self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.
Choosing the right build:
-
Windows/Linux:
- NVIDIA GPU: Use
cuda12.4for newer GPUs orcuda11.7for older GPUs and systems with older drivers. - AMD/Intel GPU: Use
vulkanbuilds. - CPU only: Use
cpubuilds.
- NVIDIA GPU: Use
-
Mac:
- Apple Silicon: Use
macos-arm64. - Intel CPU: Use
macos-x86_64.
- Apple Silicon: Use
Updating a portable install:
- Download and unzip the latest version.
- Replace the
user_datafolder with the one in your existing install. All your settings and models will be moved.