llamafile lets you distribute and run LLMs with a single file
This release synchronizes with llama.cpp upstream and polishes GPU
auto-configuration. Support for splitting a model onto multiple NVIDIA
GPUs has been restored.
- dfd3335 Synchronize with llama.cpp 2024-01-27
- c008e43 Synchronize with llama.cpp 2024-01-26
- e34b35c Make GPU auto configuration more resilient
- 79b88f8 Sanitize -ngl flag on Apple Metal
There's a known issue with support for splitting onto multiple AMD GPUs,
which currently doesn't work. This is an upstream issue we're working to
solve. The workaround is to set export HIP_VISIBLE_DEVICES=0
in your
environment when running llamafile, so it'll only see the first GPU.
Example llamafiles
Our llamafiles on Hugging Face are updated shortly after a release goes live.
Flagship models
Supreme models (highest-end consumer hardware)
- https://hf.co/jartine/Mixtral-8x7B-Instruct-v0.1-llamafile
- https://hf.co/jartine/WizardCoder-Python-34B-V1.0-llamafile
Tiny models (small enough to use on raspberry pi)
- https://hf.co/jartine/phi-2-llamafile
- https://hf.co/jartine/rocket-3B-llamafile
- https://hf.co/jartine/TinyLlama-1.1B-Chat-v1.0-GGUF
Other models:
- https://hf.co/jartine/jartine/wizardcoder-13b-python
- https://hf.co/jartine/jartine/Nous-Hermes-Llama2-llamafile
- https://hf.co/jartine/jartine/dolphin-2.5-mixtral-8x7b-llamafile
If you have a slow Internet connection and want to update your llamafiles without needing to redownload, then see the instructions here: #24 (comment) You can also download llamafile-0.6.2
and simply say ./llamafile-0.6.2 -m old.llamafile
to run your old weights.