github Mozilla-Ocho/llamafile 0.8.15
llamafile v0.8.15

14 hours ago

The --chat bot interface now supports syntax highlighting 42 separate programming languages: ada, asm, basic, c, c#, c++, cobol, css, d, forth, fortran, go, haskell, html, java, javascript, json, kotlin, ld, lisp, lua, m4, make, markdown, matlab, pascal, perl, php, python, r, ruby, rust, scala, shell, sql, swift, tcl, tex, txt, typescript, and zig.

That chatbot now supports more commands:

  • /undo may be used to have the LLM forget the last thing you said. This is useful when you get a poor response and want to try asking your question a different way, without needing to start the conversation over from scratch.
  • /push and /pop works similarly, in the sense that it allows you to rewind a conversation to a previous state. In this case, it does so by creating save points within your context window. Additionally, /stack may be used to view the current stack.
  • /clear may be used to reset the context window to the system prompt, effectively starting your conversation over.
  • /manual may be used to put the chat interface in "manual mode" which lets you (1) inject system prompts, and (2) speak as the LLM. This could be useful in cases where you want the LLM to believe it said something when it actually didn't.
  • /dump may be used to print out the raw conversation history, including special tokens (that may be model specific). You can also say /dump filename.txt to save the raw conversation to a file.

We identified an issue with Google's Gemma models, where the chatbot wasn't actually inserting the system prompt. That's now fixed. So you can now instruct Gemma to do roleplaying if you pass the flags llamafile -m gemma.gguf -p "you are role playing as foo" --chat.

You can now type CTRL-J to create multi-line prompts in the terminal chatbot. It works similarly to shift-enter in the browser. It can be a quicker alternative to using the chatbot's triple quote syntax, i.e. """multi-line / message""".

Bugs in the new chatbot have been fixed. For example, we now do a better job making sure special tokens like BOS, EOS, and EOT get inserted when appropriate into the conversation history. This should improve fidelity when using the terminal chatbot interface.

The --threads and --threads-batch flags may now be used separately to tune how many threads are used for prediction and prefill.

The llamafile-bench command now supports benchmarking GPU support (see #581 from @cjpais)

Both servers now support configuring a URL prefix, thanks to (see #597 and #604 from @vlasky)

Support for the IQ quantization formats is being removed from our CUDA module to save on build times. If you want to use IQ quants with your NVIDIA hardware, you need to pass the --iq --recompile flags to llamafile once, to build a ggml-cuda module for your system that includes them.

Finally, we have an alpha release of a new /v1/chat/completions endpoint for the new llamafiler server. We're planning to build a new web interface that's based on this soon, so you're encouraged to test this, since llamafiler will eventually replace the old server too. File an issue if there's any features you need.

Don't miss a new llamafile release

NewReleases is sending notifications on new releases.