Details
server : [easy] fix per round speculative decode logging (#18211)
Currently we always log 0, as we clear slot.drafted before.
To reproduce:
Run llama-server with devstral-2 as main model and devstral-2-small as
md, and verbose logging:
% ./build/bin/llama-server -v \
-m ~/llms/Devstral-2-123B-Instruct-2512-UD-Q6_K_XL-00001-of-00003.gguf \
-md ~/llms/Devstral-Small-2-24B-Instruct-2512-UD-Q2_K_XL.gguf \
-c 8192 2> /tmp/llama.cpp.debug
Check the log:
slot update_slots: id 3 | task 0 | accepted 11/0 draft tokens, new
n_tokens = 741
slot update_slots: id 3 | task 0 | accepted 4/0 draft tokens, new
n_tokens = 746
slot update_slots: id 3 | task 0 | accepted 16/0 draft tokens, new
n_tokens = 763
slot update_slots: id 3 | task 0 | accepted 11/0 draft tokens, new
n_tokens = 775
slot update_slots: id 3 | task 0 | accepted 2/0 draft tokens, new
n_tokens = 778
slot update_slots: id 3 | task 0 | accepted 4/0 draft tokens, new
n_tokens = 783
slot update_slots: id 3 | task 0 | accepted 8/0 draft tokens, new
n_tokens = 792
slot update_slots: id 3 | task 0 | accepted 2/0 draft tokens, new
n_tokens = 795
slot update_slots: id 3 | task 0 | accepted 1/0 draft tokens, new
n_tokens = 797
slot update_slots: id 3 | task 0 | accepted 1/0 draft tokens, new
n_tokens = 799
slot update_slots: id 3 | task 0 | accepted 0/0 draft tokens, new
n_tokens = 800
slot update_slots: id 3 | task 0 | accepted 2/0 draft tokens, new
n_tokens = 803
slot update_slots: id 3 | task 0 | accepted 1/0 draft tokens, new
n_tokens = 805
slot update_slots: id 3 | task 0 | accepted 6/0 draft tokens, new
n_tokens = 812
slot update_slots: id 3 | task 0 | accepted 3/0 draft tokens, new
n_tokens = 816
After the fix, get correct per round logging:
slot update_slots: id 3 | task 0 | accepted 7/8 draft tokens, new
n_tokens = 654
slot update_slots: id 3 | task 0 | accepted 1/2 draft tokens, new
n_tokens = 656
slot update_slots: id 3 | task 0 | accepted 2/16 draft tokens, new
n_tokens = 659
slot update_slots: id 3 | task 0 | accepted 1/16 draft tokens, new
n_tokens = 661
slot update_slots: id 3 | task 0 | accepted 2/16 draft tokens, new
n_tokens = 664
slot update_slots: id 3 | task 0 | accepted 16/16 draft tokens, new
n_tokens = 681
slot update_slots: id 3 | task 0 | accepted 16/16 draft tokens, new
n_tokens = 698
slot update_slots: id 3 | task 0 | accepted 3/4 draft tokens, new
n_tokens = 702
slot update_slots: id 3 | task 0 | accepted 5/12 draft tokens, new
n_tokens = 708
slot update_slots: id 3 | task 0 | accepted 16/16 draft tokens, new
n_tokens = 725
slot update_slots: id 3 | task 0 | accepted 1/1 draft tokens, new
n_tokens = 727
slot update_slots: id 3 | task 0 | accepted 8/16 draft tokens, new
n_tokens = 736
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler: