Mozilla-Ocho/llamafile 0.8.10 on GitHub

llamafile lets you distribute and run LLMs with a single file

llamafile is a local LLM inference tool introduced by Mozilla Ocho in Nov 2023, which offers superior performance and binary portability to the stock installs of six OSes without needing to be installed. It features the best of llama.cpp and cosmopolitan libc while aiming to stay ahead of the curve by including the most cutting-edge performance and accuracy enhancements. What llamafile gives you is a fun web GUI chatbot, a turnkey OpenAI API compatible server, and a shell-scriptable CLI interface which together put you in control of artificial intelligence.

This release includes a build of the new llamafile server rewrite we've
been promising, which we're calling llamafiler. It's matured enough to
recommend for embedding serving. This is the fastest way to serve
embeddings. If you use it with all-MiniLM-L6-v2.Q6_K.gguf then on
Threadripper it can serve JSON /embedding at 800 req/sec whereas the old
llama.cpp server could only do 100 req/sec. So you can fill up your RAG
databases very quickly if you productionize this.

The old llama.cpp server came from a folder named "examples" and was
never intended to be production worthy. This server is designed to be
sturdy and uncrashable. It has /completion and /tokenize endpoints too,
which serves 3.7 million requests per second on Threadripper, thanks to
Cosmo Libc improvements.

See the LLaMAfiler Documentation for further details.

73b1836 Write documentation for new server
b3930aa Make GGML asynchronously cancelable
8604e9a Fix POSIX undefined cancelation behavior
323f50a Let SIGQUIT produce per-thread backtraces
15d7fba Use semaphore to limit GGML worker threads
d7c8e33 Add support for JSON parameters to new server
7f099cd Make stack overflows recoverable in new server
fb3421c Add barebones /completion endpoint to new server

This release restores support for non-AVX x86 microprocessors. We had to
drop support at the beginning of the year. However our CPUid dispatching
has advanced considerably since then. We're now able to offer top speeds
on modern hardware, without leaving old hardware behind.

a674cfb Restore support for non-AVX microprocessors
555fb80 Improve build configuration

Here's the remaining improvements included in this release:

cc30400 Supports SmolLM (#495)
4a4c065 Fix CUDA compile warnings and errors
82f845c Avoid crashing with BF16 on Apple Metal

Mozilla-Ocho/llamafile 0.8.10 llamafile v0.8.10 on GitHub

Mozilla-Ocho/llamafile 0.8.10
llamafile v0.8.10

on GitHub