github Mozilla-Ocho/llamafile 0.8.2
llamafile v0.8.2

latest releases: 0.8.16, 0.8.15, 0.8.14...
6 months ago

[line drawing of llama animal head in front of slightly open manilla folder filled with files]

llamafile lets you distribute and run LLMs with a single file

llamafile is a local LLM inference tool introduced by Mozilla Ocho in Nov 2023, which offers superior performance and binary portability to the stock installs of six OSes without needing to be installed. It features the best of llama.cpp and cosmopolitan libc while aiming to stay ahead of the curve by including the most cutting-edge performance and accuracy enhancements. What llamafile gives you is a fun web GUI chatbot, a turnkey OpenAI API compatible server, and a shell-scriptable CLI interface which together put you in control of artificial intelligence.

  • This release introduces faster AVX2 prompt processing for K-quants and IQ4_XS (#394). This was contributed to llamafile by @ikawrakow who originally invented K quants last year: ggerganov/llama.cpp@99009e7. In prior releases we recommended the legacy Q4_0 quant since it was the simplest and most intuitive to get working with recent matmul optimizations. Thanks to Iwan Kawrakow's efforts, the best quants (e.g. Q5_K_M) will now go the fastest (on modern x86 systems).

  • Text generation (or prediction) should now go slightly faster too, thanks to development work matmul kernels, and enhancements to thread synchronization (see 89c189e) which should be noticed most on CPUs with many cores running smaller models. MacOS ARM users who are using CPU rather than Metal can expect to see the biggest boost, now that llamafile knows how to utilize all cores (see 6c45e3e).

  • Bugs in the server /embedding endpoint have been fixed (see 0e2845a and 7900294). You can also now pass llamafile --embedding -m model -p prompt to have embeddings printed to standard output (see 42bd9b8).

  • This release synchronizes with the upstream llama.cpp project as of May 7th in 94d0940, which improves tokenization for Command-R, Refact, Olmo, and StarCoder. There's a new flash attention op that may be enabled for many models by passing the -fa flag. We haven't been able to include this in our prebuilt cuda/rocm binaries yet, so you may need to pass the llamafile --recompile flag for GPU.

  • This release introduces the --precise, --fast, and --trap flags, which control the execution of math. The --precise flag can slightly enhance the thinking of LLMs at the cost of some performance (see 2af3b88 and 9540b43). The --fast flag is included since it's unspecified which mode llamafile will use for any given situation (see bbae0f6 and b749326). The --trap flag can help you pinpoint the exact moment any NaNs appear (on CPUs that support this, e.g. most of x86), which is useful for troubleshooting. Additionally, a new vectorized expf() function has been introduced that enables llamafile to compute the exponent function faster and at full quality (see e2b3cb2). This matters because it's the function that powers SiLU and SoftMax which are used by most of todays premier public models.

  • Most of the CPU code in the GGML library now has optimal performance across different hardware architectures, thanks to new build system techniques. Features or specific options or models that underperformed before, may do better now (see 0bdea60 and c9d7393).

Additional fixes:

  • a2d159e Fix server multimodal statistics (#392)
  • aa8c01a Revert moondream vision language model support
  • eecbf89 More conservative strong/em markdown matcher (#352)
  • 38311f2 CUDA: CUDART < 11.7 workaround for __hmax, __hmax2
  • 58d2ca0 Use qsort and set linkage to static for internal functions used for offload-arch-fix (#375)
  • 4ee1e39 The PDF documentation in llamafile-0.8.2.zip is now fixed
  • 4ee1e39 Remove warnings from cuda build

Additional notes:

  • We're experiencing some instability with our Windows AMD GPU support. If you encounter crashes using the -ngl 999 flag on Windows, then try using the previous 0.8.1 release. Please also consider filing an issue, to report if it doesn't work, or better yet, please file an issue if it does work, since we otherwise have no way of knowing that (llamafile doesn't have telemetry because maximally respecting the user's privacy on their local machine is one of the stated goals of the project). You can also share details about your experience with us on the Mozilla AI Discord server.

See these instructions for how to put the latest llamafile software into your old weights, without having to redownload. #24 (comment)

Don't miss a new llamafile release

NewReleases is sending notifications on new releases.