
Changes
- Estimate the VRAM for GGUF models using a statistical model + autoset
gpu-layers
on NVIDIA GPUs (#6980).- When you select a GGUF model in the UI, you will see an estimate for its VRAM usage, and the number of layers will be set based on the available (free, not total) VRAM on your system.
- If you change
ctx-size
orcache-type
in the UI, the number of layers will be recalculated and updated in real time. - If you load a model through the command line with e.g.
--model model.gguf --ctx-size 32768 --cache-type q4_0
, the number of GPU layers will also be automatically calculated, without the need to set--gpu-layers
. - It works even with multipart GGUF models or systems with multiple GPUs.
- Greatly simplify the Model tab by splitting settings between "Main options" and "Other options", where "Other options" is in a closed accordion by default.
- Tools support for the OpenAI compatible API (#6827). Thanks, @jkrauss82.
- Dynamic Chat Message UI update speed (#6952). This is a major UI optimization in Chat mode that renders
max_updates_second
obsolete. Thanks, @mamei16 for the very clever idea. - Optimize the Chat tab JavaScript, reducing its CPU usage (#6948).
- Add the
top_n_sigma
sampler to the llama.cpp loader. - Streamline the UI in portable builds: Hide things that do not work such as training, only show the llama.cpp loader, and do not include extensions that do not work. The latter should reduce the build sizes.
- Invert user/assistant message colors in instruct mode to make assistant messages darker and more readable.
- Improve the light theme colors.
- Add a minimum height to the streaming reply to prevent constant scrolling during chat streaming, similar to how ChatGPT and Claude work.
- Show the list of files if the user tries to download an entire GGUF repository instead of a specific file.
- llama.cpp: Handle short arguments in
--extra-flags
, likeot
. - Save the chat history right after sending a message and periodically during streaming to prevent losing messages.
Bug fixes
- API: Fix llama.cpp continuing to generate in the background after cancelling the request, improve disconnect detection, fix deadlock on simultaneous requests.
- Fix
typical_p
in the llama.cpp sampler priority. - Fix manual random seeds in llama.cpp.
- Add a retry mechanism when using the
/internal/logits
API endpoint with the llama.cpp loader to fix random failures. - Ensure environment isolation in portable builds to avoid conflicts.
- docker: Fix app UID typo in docker composes (#6957 and #6958). Thanks, @enovikov11.
- Docker fix for NVIDIA (#6964). Thanks, @phokur.
- SuperboogaV2: Minor update to avoid JSON serialization errors (#6945). Thanks, @alirezagsm.
- Fix model config loading in shared.py for Python 3.13 (#6961). Thanks, @Downtown-Case.
Backend updates
- llama.cpp: Update to ggml-org/llama.cpp@c6a2c9e.
- ExLlamaV3: Update to turboderp-org/exllamav3@a905cff.
Portable builds
Below you can find portable builds: self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.
Choosing the right build:
-
Windows/Linux:
- NVIDIA GPU: Use
cuda12.4
for newer GPUs orcuda11.7
for older GPUs and systems with older drivers. - AMD/Intel GPU: Use
vulkan
builds. - CPU only: Use
cpu
builds.
- NVIDIA GPU: Use
-
Mac:
- Apple Silicon: Use
macos-arm64
. - Intel CPU: Use
macos-x86_64
.
- Apple Silicon: Use
Updating a portable install:
- Download and unzip the latest version.
- Replace the
user_data
folder with the one in your existing install. All your settings and models will be moved.