v0.5.12.post1 is a stability patch on top of v0.5.12. It cherry-picks 12 fixes — primarily for DeepSeek V4 — onto the release branch.

Bug Fixes

DeepSeek V4

DSV4-Pro emits garbled text during single-token decode on B200/B300 (fix deep_gemm UE8M0 scale-packing path by ceiling activation scales before packing): #25733
DSV4 + EAGLE/MTP in disaggregation decode crashes around 2000 requests with a SWA allocator assertion (recycled KV pages kept stale sliding-window mappings): #25805
DSV4 NSA prefill context-parallel (--enable-nsa-prefill-context-parallel --nsa-prefill-cp-mode round-robin-split) in --disaggregation-mode prefill: scheduler crash at startup: #25396
DSV4 HiSparse + SGLANG_OPT_USE_COMPRESSOR_V2=1: GSM8K accuracy restored from 0.825 → 0.960: #25646
DSV4 PD disaggregation now works with pipeline parallelism > 1 (removed stale pp_size=1 assertion): #25771
DSV4-Flash with --load-format dummy + FlashInfer mxfp4 hits CUDA illegal memory access during CUDA-graph capture (the integer HashTopK.tid2eid lookup table was left uninitialized by dummy load): #25892
DSV4 HiCache + SGLANG_OPT_CACHE_SWA_TRANSLATION=1 returns stale translation indices after a cache rebuild, causing OOB writes / wrong outputs: #25889

DSV4: warm MHC token-count buckets at startup (gated to SGLANG_OPT_DEEPGEMM_HC_PRENORM=1 + SGLANG_OPT_USE_TILELANG_MHC_PRE=1 + hybrid SWA) to eliminate 20–40s cold-bucket forward stalls: #25810
DSV4-Pro: precompile a DeepGEMM branch for _dispatch_bf16_fp32_backend to cut runtime JIT compile cost: #25860

Use [cu13] extra for nvidia-cutlass-dsl (default to CUDA 13; required for sm_103 / B300): #25576

All PRs included in this release: v0.5.12...v0.5.12.post1

Full Changelog: v0.5.12...v0.5.12.post1