Changes
- UI:
- Add a new "Branch chat" option to the chat tab.
- Add a new "Search chats" menu to the chat tab.
- Improve handling of markdown lists (#6626). This greatly improves the rendering of lists and nested lists in the UI. Thanks, @mamei16.
- Reduce the size of HTML and CSS sent to the UI during streaming. This improves performance and reduces CPU usage.
- Optimize the JavaScript to reduce the CPU usage during streaming.
- Add a horizontal scrollbar to code blocks that are wider than the chat area.
- Make responses start faster by removing unnecessary cleanup calls (#6625). This removes a 0.2 second delay for llama.cpp and ExLlamaV2 while also increasing the reported tokens/second.
- Add a
--torch-compile
flag for transformers (improves performance). - Add a "Static KV cache" option for transformers (improves performance).
- Connect XTC, DRY, smoothing_factor, and dynatemp to the ExLlamaV2 loader (non-HF).
- Remove the AutoGPTQ loader (#6641). The project was discontinued, and no wheels had been available for a while. GPTQ models can still be loaded through ExLlamaV2.
- Streamline the one-click installer by asking one question to NVIDIA users instead of two.
- Add a
--exclude-pattern
flag to thedownload-model.py
script (#6542). Thanks, @JackCloudman. - Add IPv6 support to the API (#6559). Thanks, @BPplays.
Bug fixes
- Fix an
orjson.JSONDecodeError
error on page reload. - Fix the font size of lists in chat mode.
- Fix CUDA error on MPS backend during API request (#6572). Thanks, @skywinder.
- Add
UnicodeDecodeError
workaround formodules/llamacpp_model.py
(#6040). Thanks, @nclok1405. - Training_PRO fix: add
if 'quantization_config' in shared.model.config.to_dict()
(#6640). Thanks, @FartyPants.
Backend updates
- llama-cpp-python: bump to 0.3.6 (llama.cpp commit
f7cd13301c2a88f97073fd119072b4cc92c08df1
, January 8, 2025).