github Mozilla-Ocho/llamafile 0.4
llamafile v0.4

latest releases: 0.8.13, 0.8.12, 0.8.11...
9 months ago

llamafile lets you distribute and run LLMs with a single file

[line drawing of llama animal head in front of slightly open manilla folder filled with files]

This release features Mixtral support. Support has been added for Qwen
models too. The --chatml, --samplers, and other flags are added.

  • 820d42d Synchronize with llama.cpp upstream

GPU now works out of the box on Windows. You still need to pass the
-ngl 35 flag, but you're no longer required to install CUDA/MSVC.

  • a7de00b Make tinyBLAS go 95% as fast as cuBLAS for token generation (#97)
  • 9d85a72 Improve GEMM performance by nearly 2x (#93)
  • 72e1c72 Support CUDA without cuBLAS (#82)
  • 2849b08 Make it possible for CUDA to extract prebuilt DSOs

Additional fixes and improvements:

  • c236a71 Improve markdown and syntax highlighting in server (#88)
  • 69ec1e4 Update the llamafile manual
  • 782c81c Add SD ops, kernels
  • 93178c9 Polyfill $HOME on some Windows systems
  • fcc727a Write log to /dev/null when main.log fails to open
  • 77cecbe Fix handling of characters that span multiple tokens when streaming

Our .llamafiles on Hugging Face have been updated to incorporate these
new release binaries. You can redownload here:

Don't miss a new llamafile release

NewReleases is sending notifications on new releases.