llamafile lets you distribute and run LLMs with a single file
This release improves the performance and accuracy of both CPU and GPU computations in addition to security.
- tinyBLAS now gives outputs consistent with the cuBLAS thanks to Kahan summation on matvec ops. This is good news for Windows users, because llamafile releases bundle tinyBLAS DLLs for driver-only GPU support. That support will now be faster, and more accurate than before, thereby reducing the need to install the CUDA / ROCm SDKs yourself.
- Prompt evaluation now goes much faster on CPU. For example, f16 weights on Raspberry Pi 5 are now 8x faster. These new optimizations mostly apply to
F16
,BF16
,Q8_0
,Q4_0
,Q4_0
, andF32
weights. Depending on the hardware and weights being used, we've observed llamafile-0.7 going anywhere between 30% to 500% faster than llama.cpp upstream. - Support for the bf16 data type has been introduced for CPU only, which is the Google Brain floating point format.
- Support for AVX512 has been introduced. Owners of CPUs like Zen4 can expect to see 10x faster prompt eval times.
- If you want to run
llamafile-0.7 [...] --recompile --gpu amd
support on Windows, this release requires that you use version 5.7+ of the ROCm HIP SDK, which may be downloaded here. - This release includes a security fix for CVE-2024-23496 (see #294).
- This release is synced with llama.cpp 2024-03-22 upstream.