- Support for Phi-3 Mini 4k has been introduced
- A bug causing GPU module crashes on some systems has been resolved
- Support for Command-R Plus has now been vetted with proper 64-bit indexing
- We now support more AMD GPU architectures thanks to better detection of offload archs (#368)
- We now ship prebuilt NVIDIA and ROCm modules for both Windows and Linux users. They link tinyBLAS which is a libre math library that only depends on the graphics driver being installed. Since it's slower, llamafile will automatically build a native module for your system if the CUDA or ROCm SDKs are installed. You can control this behavior using
--nocompile
or--recompile
. Yes, Our LLavA llamafile still manages to squeak under the Windows 4GB file size limit! - An assertion error has been fixed that happened when using
llamafile-quantize
to create K quants from an F32 GGUF file - A new
llamafile-tokenize
command line tool has been introduced. For example, if you want to count how many "tokens" are in a text file, you can saycat file.txt | llamafile-tokenize -m model.llamafile | wc -l
since it prints each token on a single line.