llamafile lets you distribute and run LLMs with a single file
This release features Mixtral support. Support has been added for Qwen
models too. The --chatml, --samplers, and other flags are added.
- 820d42d Synchronize with llama.cpp upstream
GPU now works out of the box on Windows. You still need to pass the
-ngl 35 flag, but you're no longer required to install CUDA/MSVC.
- a7de00b Make tinyBLAS go 95% as fast as cuBLAS for token generation (#97)
- 9d85a72 Improve GEMM performance by nearly 2x (#93)
- 72e1c72 Support CUDA without cuBLAS (#82)
- 2849b08 Make it possible for CUDA to extract prebuilt DSOs
Additional fixes and improvements:
- c236a71 Improve markdown and syntax highlighting in server (#88)
- 69ec1e4 Update the llamafile manual
- 782c81c Add SD ops, kernels
- 93178c9 Polyfill $HOME on some Windows systems
- fcc727a Write log to /dev/null when main.log fails to open
- 77cecbe Fix handling of characters that span multiple tokens when streaming
Our .llamafiles on Hugging Face have been updated to incorporate these
new release binaries. You can redownload here:
![[line drawing of llama animal head in front of slightly open manilla folder filled with files]](https://private-user-images.githubusercontent.com/49262/289660212-bbcb0dde-4cd9-431a-9f79-ccb5ecd912d6.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjQ1NTYyOTAsIm5iZiI6MTcyNDU1NTk5MCwicGF0aCI6Ii80OTI2Mi8yODk2NjAyMTItYmJjYjBkZGUtNGNkOS00MzFhLTlmNzktY2NiNWVjZDkxMmQ2LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA4MjUlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwODI1VDAzMTk1MFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWYyYjFjNDdkZjg1M2IyYzk5N2M3NzNiOWI1ZmIyYzIxNjUzZWEyODMxZjBmOWM2NTA5ZDdiYTZlMWJkNDdjMDcmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.Pdl2pFNQ5NExTv0QpeQBzSKSYkOR0t8m1rwPtPXIkCM)