v0.5.12.post1 is a stability patch on top of v0.5.12. It cherry-picks 12 fixes — primarily for DeepSeek V4 — onto the release branch.
Bug Fixes
DeepSeek V4
- DSV4-Pro emits garbled text during single-token decode on B200/B300 (fix
deep_gemmUE8M0 scale-packing path by ceiling activation scales before packing): #25733 - DSV4 + EAGLE/MTP in disaggregation decode crashes around 2000 requests with a SWA allocator assertion (recycled KV pages kept stale sliding-window mappings): #25805
- DSV4 NSA prefill context-parallel (
--enable-nsa-prefill-context-parallel --nsa-prefill-cp-mode round-robin-split) in--disaggregation-mode prefill: scheduler crash at startup: #25396 - DSV4 HiSparse +
SGLANG_OPT_USE_COMPRESSOR_V2=1: GSM8K accuracy restored from 0.825 → 0.960: #25646 - DSV4 PD disaggregation now works with pipeline parallelism > 1 (removed stale
pp_size=1assertion): #25771 - DSV4-Flash with
--load-format dummy+ FlashInfer mxfp4 hits CUDA illegal memory access during CUDA-graph capture (the integerHashTopK.tid2eidlookup table was left uninitialized by dummy load): #25892 - DSV4 HiCache +
SGLANG_OPT_CACHE_SWA_TRANSLATION=1returns stale translation indices after a cache rebuild, causing OOB writes / wrong outputs: #25889
Disaggregation
- [PD][NIXL] Always send aux on
is_last; only expect state when truthy: #25699
Other
- Fix missing
grouparg inget_dp_buffer: #25585
Performance
- DSV4: warm MHC token-count buckets at startup (gated to
SGLANG_OPT_DEEPGEMM_HC_PRENORM=1+SGLANG_OPT_USE_TILELANG_MHC_PRE=1+ hybrid SWA) to eliminate 20–40s cold-bucket forward stalls: #25810 - DSV4-Pro: precompile a DeepGEMM branch for
_dispatch_bf16_fp32_backendto cut runtime JIT compile cost: #25860
Dependencies
- Use
[cu13]extra fornvidia-cutlass-dsl(default to CUDA 13; required for sm_103 / B300): #25576
All PRs included in this release: v0.5.12...v0.5.12.post1
Full Changelog: v0.5.12...v0.5.12.post1