github Mozilla-Ocho/llamafile 0.5
llamafile v0.5

latest releases: 0.8.13, 0.8.12, 0.8.11...
8 months ago

llamafile lets you distribute and run LLMs with a single file

[line drawing of llama animal head in front of slightly open manilla folder filled with files]

The llamafile-server command is now unified into llamafile. This way
you won't need to upload your llamafiles to Hugging Face twice. We also
have rich man page documentation for this command, which can be viewed
with pagination on all platforms via the llamafile --help flag.

  • b86dcb7 Unify llamafile-server command into llamafile
  • 156f0a6 Embed man page into --help flag of each program

This release introduces support for AMD graphics cards on Windows. Our
release binaries include a prebuilt tinyBLAS DLL. Like our Nvidia DLL,
it works on stock installs and only depends on the graphics driver. GPU
on Windows is also much faster out of the box, thanks to improvements
we've made to our tinyBLAS kernels.

  • 1f1c53f Get AMD GPU support working on Windows
  • 1d9fa85 Add 2D blocking to tinyBLAS GemmEx (#153)
  • c0589f0 Apply 2D blocking to all kernels (#156)
  • c2bc6e6 Separate kernel for GemmStridedBatchedEx (#163)
  • f6ee33c Read and write column-major matrices better (#164)
  • d7cbaf7 Reduce BM/BN/BK to 64/32/64 to 48/12/48
  • 04d6e93 Introduce --gpu flag

Apple Metal users should expect to see LLaVA image summarization go
roughly 33% faster. Complete support for Microsoft's new Phi-2 model is
now available, which works great on Raspberry Pi. FreeBSD ARM64 users
can now also enjoy this project. Shell scriptability is improved. We've
also introduced a llamafile-convert command that makes it easier for
you to create your own llamafiles.

  • 922c4f1 Add GPU acceleration of LLaVA image processing on MacOS
  • 6423228 Add Phi-2 architecture support
  • ce4aac6 Support FreeBSD ARM64
  • 1dcf274 Add llamafile-convert command (#112)
  • 50bdf69 7d23bc9 Make --log-disable work better
  • 7843183 Make default thread count capped at 12 maximum
  • 2e276a1 Sync with llama.cpp upstream
  • dd4c9d7 Make JSON server crashes more informative
  • 8762f13 474b44f Introduce --nocompile flag
  • 5cf6e76 Introduce --cli flag
  • f0e86e1 Don't schlep weights into CPU when using GPU
  • f1410a1 Fix repeat_last_n in OpenAI server
  • 3119f09 Increase server max payload size

Known Issues

  • Multiple GPUs isn't supported yet.
  • CLIP only supports GPU acceleration on Apple Silicon.

Example llamafiles

Our llamafiles on Hugging Face are updated shortly after a release goes live.

Flagship models

Supreme models (highest-end consumer hardware)

Tiny models (small enough to use on raspberry pi)

Other models:

If you have a slow Internet connection and want to update your llamafiles
without needing to redownload, then see the instructions here: #24 (comment)

Don't miss a new llamafile release

NewReleases is sending notifications on new releases.