Details
llama + spec: MTP Support (#22673)
-
spec: support MTP
-
fix batch size
-
rename files
-
cont : simplify (#7)
-
MTP: clean-up (#9)
-
MTP: clean-up
-
review: use llama_context_type instead of llama_graph_type
-
review: remove llama_model_has_mtp
-
review: fix convert issues
-
convert: fix pycheck
-
review: formatting
-
use
mtp-for identifying mtp models -
convert: fix mtp conversion
-
mtp -> draft-mtp
-
remove unused llama_arch
-
add need_embd in speculative
-
llama: allow partial seq_rm for GDN models for speculative decoding
Currently speculative checkpoint needs to restart from a checkpoint
after some draft tokens are not accepted, this leads to some wastage in
running the target again. This PR adds the ability to rollback upto
draft_max by storing the GDN intermediates.
-
fix pending state
-
vulkan: add GDN partial rollback
-
meta: extend check to axis 1
-
metal: add GDN partial rollback
Extend the gated delta net kernel to store intermediate states for
partial rollback support on the Metal backend.
- Add K (snapshot slot count) as a function constant
- Read input state from slot 0 of the 3D state tensor
- Write intermediate states to different slots during token loop
- For K=1, maintain backward-compatible single-slot behavior
Ref: 8c05923
Assisted-by: llama.cpp:local pi
-
delta_net_base: use ggml_pad instead of new_tensor
-
review: add need_rs_seq
-
review: rename part_bounded to n_rs
-
review: deslop comments
-
review: rename, add asserts
-
server : adjust checkpoint logic (#11)
-
server : adjust checkpoint logic
-
cont : rm asserts
-
server-context: fix early exit
-
spec : fix compatibility with n-gram and add TODOs (#13)
-
metal : cleanup
-
llama : fix faulty bitwise check in recurrent memory
-
server : disable RS-based MTP in combination with other spec types
-
spec : add TODOs
-
cont : fix comment
-
cont : update comment
-
common : fix logic for ngram + mtp compat
-
llama-memory: enable checkpointing with partial rollback
-
cont: add test-case for loading into a dirty ctx
-
llama-memory-recurrent: clear rs_idx in clear
-
download: fix mtp path
-
llama-arch: fix enorm op
-
docs: update docs
-
conversion: fix type annotations
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
macOS/iOS:
- macOS Apple Silicon (arm64)
- macOS Apple Silicon (arm64, KleidiAI enabled)
- macOS Intel (x64)
- iOS XCFramework
Linux:
- Ubuntu x64 (CPU)
- Ubuntu arm64 (CPU)
- Ubuntu s390x (CPU)
- Ubuntu x64 (Vulkan)
- Ubuntu arm64 (Vulkan)
- Ubuntu x64 (ROCm 7.2)
- Ubuntu x64 (OpenVINO)
- Ubuntu x64 (SYCL FP32)
- Ubuntu x64 (SYCL FP16)
Android:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler: