github ggml-org/llama.cpp b9109

latest releases: b9112, b9110
3 hours ago
Details

spec : parallel drafting support (#22838)

  • spec : refactor

  • spec : drop support for incompatible vocabs

  • spec : update common_speculative_init()

  • cont : pass seq_id

  • cont : dedup ctx_seq_rm_type

  • server : sketch the ctx_dft decode loop

  • server : draft prompt cache and checkpoints

  • server : improve ctx names

  • server, spec : transition to unified spec context

  • cont : sync main and drft contexts

  • cont : async drft eval when possible

  • cont : handle non-ckpt models

  • cont : pass correct n_past for drafting

  • cont : process images throught the draft context

  • spec : handle draft running out of context

  • server : fix mtmd draft processing

  • server : fix URL for draft model

  • server : add comment

  • server : clean-up + dry

  • speculative-simple : update

  • spec : fix n_past type

  • server : fix slot ctx_drft ptr

  • tools : update readme

  • naming : improve consistency

  • spec : refactor for multi-sequence speculative context

  • cont : prepare params

  • cont : prepare params

  • spec : support parallel drafts

  • server : support parallel drafting

  • llama : reuse device buffers when possible

  • server, spec : clean-up

  • cont : clean-up

  • cont : minor

  • spec : reset drafting flag at the end

  • spec : introduce common_speculative_process()

  • spec : allow for multiple spec types (chain of speculators)

  • replace old type field of type common_speculative_type in the
    common_params_speculative struct with a vector to allow multiple
    types to be specified

  • introduce common_get_enabled_speculative_impls(const std::vector)
    to figure out which implementations the user has enabled

  • introduce common_speculative_type_from_names(const std::vectorstd::string & names)
    to parse the already user provided spec types

  • all speculators run sequentially, best one wins (we verify its drafted tokens)

  • maximize expected accepted tokens for current round by calculating the
    product between the probability of accepting current token (n_acc_tokens / n_gen_drafts)
    and the draft's length


Co-authored-by: Petros Sideris petros.sideris@nokia.com

macOS/iOS:

Linux:

Android:

Windows:

openEuler:

Don't miss a new llama.cpp release

NewReleases is sending notifications on new releases.