✨ Changes
- New llama.cpp loader (#6846). A brand new, lightweight llama.cpp loader based on
llama-server
has been added, replacingllama-cpp-python
. With that:- New sampling parameters are now available in the llama.cpp loader, including
xtc
,dry
, anddynatemp
. - llama.cpp has been updated to the latest version, adding support for the new Llama-4-Scout-17B-16E-Instruct model.
- The installation size for the project has been reduced.
- llama.cpp performance should be slightly faster.
- llamacpp_HF had to be removed :( There is just 1 llama.cpp loader from now on.
- llama.cpp updates will be much more frequent from now on.
- New sampling parameters are now available in the llama.cpp loader, including
- Smoother chat streaming in the UI. Words now appear one at a time in the Chat tab instead of in chunks, which makes streaming feel nicer.
- Allow for model subfolder organization for GGUF files (#6686). Thanks, @Googolplexed0.
- With that, llama.cpp models can be placed in subfolders inside
text-generation-webui/models
for better organization (or for importing files from LM Studio).
- With that, llama.cpp models can be placed in subfolders inside
- Remove some obsolete command-line flags to clean-up the repository.
🔧 Bug fixes
- Fix an overflow bug in ExLlamaV2_HF introduced after recent updates.
- Fix GPTQ models being loaded through Transformers instead of ExLlamaV2_HF.
🔄 Backend updates
- llama.cpp: Bump to commit
b9154ecff93ff54dc554411eb844a2a654be49f2
from April 18th, 2025. - ExLlamaV3: Bump to commit
c44e56c73b2c67eee087c7195c9093520494d3bf
from April 18th, 2025.