Details
spec : parallel drafting support (#22838)
-
spec : refactor
-
spec : drop support for incompatible vocabs
-
spec : update common_speculative_init()
-
cont : pass seq_id
-
cont : dedup ctx_seq_rm_type
-
server : sketch the ctx_dft decode loop
-
server : draft prompt cache and checkpoints
-
server : improve ctx names
-
server, spec : transition to unified spec context
-
cont : sync main and drft contexts
-
cont : async drft eval when possible
-
cont : handle non-ckpt models
-
cont : pass correct n_past for drafting
-
cont : process images throught the draft context
-
spec : handle draft running out of context
-
server : fix mtmd draft processing
-
server : fix URL for draft model
-
server : add comment
-
server : clean-up + dry
-
speculative-simple : update
-
spec : fix n_past type
-
server : fix slot ctx_drft ptr
-
tools : update readme
-
naming : improve consistency
-
spec : refactor for multi-sequence speculative context
-
cont : prepare params
-
cont : prepare params
-
spec : support parallel drafts
-
server : support parallel drafting
-
llama : reuse device buffers when possible
-
server, spec : clean-up
-
cont : clean-up
-
cont : minor
-
spec : reset
draftingflag at the end -
spec : introduce
common_speculative_process() -
spec : allow for multiple spec types (chain of speculators)
-
replace old type field of type common_speculative_type in the
common_params_speculative struct with a vector to allow multiple
types to be specified -
introduce common_get_enabled_speculative_impls(const std::vector)
to figure out which implementations the user has enabled -
introduce common_speculative_type_from_names(const std::vectorstd::string & names)
to parse the already user provided spec types -
all speculators run sequentially, best one wins (we verify its drafted tokens)
-
maximize expected accepted tokens for current round by calculating the
product between the probability of accepting current token (n_acc_tokens / n_gen_drafts)
and the draft's length
Co-authored-by: Petros Sideris petros.sideris@nokia.com
macOS/iOS:
- macOS Apple Silicon (arm64)
- macOS Apple Silicon (arm64, KleidiAI enabled)
- macOS Intel (x64)
- iOS XCFramework
Linux:
- Ubuntu x64 (CPU)
- Ubuntu arm64 (CPU)
- Ubuntu s390x (CPU)
- Ubuntu x64 (Vulkan)
- Ubuntu arm64 (Vulkan)
- Ubuntu x64 (ROCm 7.2)
- Ubuntu x64 (OpenVINO)
- Ubuntu x64 (SYCL FP32)
- Ubuntu x64 (SYCL FP16)
Android:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler: