Details
spec : add self‑speculative decoding (no draft model required) + refactor (#18471)
-
server: introduce self-speculative decoding
-
server: moved self-call into speculative.cpp
-
can_speculate() includes self-speculation
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
-
server: can_speculate() tests self-spec
-
server: replace can_speculate() with slot.can_speculate()
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- common: use %zu format specifier for size_t in logging
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
-
server: can_speculate() requires a task instance
-
common: ngram map, config self-speculative decoding
-
common: add enum common_speculative_type
-
common: add vector of speculative states
-
common: add option --spec-draftless
-
server: cleanup (remove slot.batch_spec, rename)
-
common: moved self-spec impl to ngram-map
-
common: cleanup (use common_speculative_state_draft)
-
spec : refactor
-
cont : naming
-
spec: remove --spec-config
-
doc: (draftless) speculative decoding
-
common: print performance in spec decoding
-
minor : cleanup
-
common : better names
-
minor : cleanup + fix build
-
minor: comments
-
CODEOWNERS: add common/ngram-map.* (#18471)
-
common : rename speculative.draftless_type -> speculative.type
-
ngram-map : fix uninitialized values
-
ngram-map : take into account the input can become shorter
-
ngram-map : revert len check for now
-
arg : change
--spec-draftless->--spec-type -
spec : add common_speculative_state::accept()
-
spec : refactor + add common_speculative_begin()
-
spec : fix begin() call with mtmd
-
spec : additional refactor + remove common_speculative_params
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler: