Mozilla-Ocho/llamafile 0.6.1 on GitHub

llamafile lets you distribute and run LLMs with a single file

This release fixes a crash that can happen on Apple Metal GPUs.

9c85d9c Fix free() related crash in ggml-metal.m

Windows users will see better performance with tinyBLAS. Please note we
still recommend installing the CUDA SDK (NVIDIA), or HIP/ROCm SDK (AMD)
for maximum performance and accuracy if you're in their support vector.

df0b3ff Use thread-local register file for matmul speedups (#205)
4892494 Change BM/BN/BK to template parameters (#203)
ed05ba9 Reduce server memory use on Windows

This release also synchronizes with llama.cpp upstream (as of Jan 9th)
along with other improvements.

133b05e Sync with llama.cpp upstream
67d97b5 Use hipcc on $PATH if it exists
15e2339 Do better job reporting AMD hipBLAS errors
c617679 Don't crash when --image argument is invalid
3e8aa78 Clarify install/gpu docs/behavior per feedback
eb4989a Fix typo in OpenAI API

Example llamafiles

Our llamafiles on Hugging Face are updated shortly after a release goes live.

Flagship models

Supreme models (highest-end consumer hardware)

Tiny models (small enough to use on raspberry pi)

Other models:

If you have a slow Internet connection and want to update your llamafiles without needing to redownload, then see the instructions here: #24 (comment) You can also download llamafile-0.6.1 and simply say ./llamafile-0.6.1 -m old.llamafile to run your old weights.

Mozilla-Ocho/llamafile 0.6.1 llamafile v0.6.1 on GitHub

Example llamafiles

Mozilla-Ocho/llamafile 0.6.1
llamafile v0.6.1

on GitHub