Mozilla-Ocho/llamafile 0.5 on GitHub

llamafile lets you distribute and run LLMs with a single file

The llamafile-server command is now unified into llamafile. This way
you won't need to upload your llamafiles to Hugging Face twice. We also
have rich man page documentation for this command, which can be viewed
with pagination on all platforms via the llamafile --help flag.

b86dcb7 Unify llamafile-server command into llamafile
156f0a6 Embed man page into --help flag of each program

This release introduces support for AMD graphics cards on Windows. Our
release binaries include a prebuilt tinyBLAS DLL. Like our Nvidia DLL,
it works on stock installs and only depends on the graphics driver. GPU
on Windows is also much faster out of the box, thanks to improvements
we've made to our tinyBLAS kernels.

1f1c53f Get AMD GPU support working on Windows
1d9fa85 Add 2D blocking to tinyBLAS GemmEx (#153)
c0589f0 Apply 2D blocking to all kernels (#156)
c2bc6e6 Separate kernel for GemmStridedBatchedEx (#163)
f6ee33c Read and write column-major matrices better (#164)
d7cbaf7 Reduce BM/BN/BK to 64/32/64 to 48/12/48
04d6e93 Introduce --gpu flag

Apple Metal users should expect to see LLaVA image summarization go
roughly 33% faster. Complete support for Microsoft's new Phi-2 model is
now available, which works great on Raspberry Pi. FreeBSD ARM64 users
can now also enjoy this project. Shell scriptability is improved. We've
also introduced a llamafile-convert command that makes it easier for
you to create your own llamafiles.

922c4f1 Add GPU acceleration of LLaVA image processing on MacOS
6423228 Add Phi-2 architecture support
ce4aac6 Support FreeBSD ARM64
1dcf274 Add llamafile-convert command (#112)
50bdf69 7d23bc9 Make --log-disable work better
7843183 Make default thread count capped at 12 maximum
2e276a1 Sync with llama.cpp upstream
dd4c9d7 Make JSON server crashes more informative
8762f13 474b44f Introduce --nocompile flag
5cf6e76 Introduce --cli flag
f0e86e1 Don't schlep weights into CPU when using GPU
f1410a1 Fix repeat_last_n in OpenAI server
3119f09 Increase server max payload size

Known Issues

Multiple GPUs isn't supported yet.
CLIP only supports GPU acceleration on Apple Silicon.

Example llamafiles

Our llamafiles on Hugging Face are updated shortly after a release goes live.

Flagship models

Supreme models (highest-end consumer hardware)

Tiny models (small enough to use on raspberry pi)

Other models:

If you have a slow Internet connection and want to update your llamafiles
without needing to redownload, then see the instructions here: #24 (comment)

Mozilla-Ocho/llamafile 0.5 llamafile v0.5 on GitHub

Known Issues

Example llamafiles

Mozilla-Ocho/llamafile 0.5
llamafile v0.5

on GitHub