ggml-org/llama.cpp b9109 on GitHub

Details

spec : parallel drafting support (#22838)

spec : refactor
spec : drop support for incompatible vocabs
spec : update common_speculative_init()
cont : pass seq_id
cont : dedup ctx_seq_rm_type
server : sketch the ctx_dft decode loop
server : draft prompt cache and checkpoints
server : improve ctx names
server, spec : transition to unified spec context
cont : sync main and drft contexts
cont : async drft eval when possible
cont : handle non-ckpt models
cont : pass correct n_past for drafting
cont : process images throught the draft context
spec : handle draft running out of context
server : fix mtmd draft processing
server : fix URL for draft model
server : add comment
server : clean-up + dry
speculative-simple : update
spec : fix n_past type
server : fix slot ctx_drft ptr
tools : update readme
naming : improve consistency
spec : refactor for multi-sequence speculative context
cont : prepare params
cont : prepare params
spec : support parallel drafts
server : support parallel drafting
llama : reuse device buffers when possible
server, spec : clean-up
cont : clean-up
cont : minor
spec : reset drafting flag at the end
spec : introduce common_speculative_process()
spec : allow for multiple spec types (chain of speculators)
replace old type field of type common_speculative_type in the
common_params_speculative struct with a vector to allow multiple
types to be specified
introduce common_get_enabled_speculative_impls(const std::vector)
to figure out which implementations the user has enabled
introduce common_speculative_type_from_names(const std::vectorstd::string & names)
to parse the already user provided spec types
all speculators run sequentially, best one wins (we verify its drafted tokens)
maximize expected accepted tokens for current round by calculating the
product between the probability of accepting current token (n_acc_tokens / n_gen_drafts)
and the draft's length

Co-authored-by: Petros Sideris petros.sideris@nokia.com

macOS/iOS:

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler: