What's Changed
- [feat] Integrate SGLang concat_mla_k kernel into flashinfer by @jiahanc in #2237
- fix: add DeepSeek routing for Bf16xBf16 and MxIntxBf16 TRT-LLM Gen MoE by @nekorobov in #2234
- fix: Fix compilation with GCC 11 by @dbari in #2242
- feat: RMSNorm/Fused RMSNorm + FP8 Quantization kernels by @BLaZeKiLL in #2243
- feat: further optimize top-k and add fused top-k page construction kernels for DSA by @yzh119 in #2215
- test: Fix MNNVL tests to skip when container lacks SYS_PTRACE capability by @bkryu in #2245
- Remove cudaStreamSynchronize from gemm_groupwise_sm120.cuh for CUDA graph compatibility by @Copilot in #2244
- feat: support variable sequence length in decode kernel of trtllm-gen attention by @yaoyaoding in #2125
- feat: Fused RMSNorm + FP4 Quantization Kernels in CuTe-DSL by @bkryu in #2233
- Allreduce auto backend improvements by @nvmbreughe in #2239
New Contributors
- @dbari made their first contribution in #2242
- @BLaZeKiLL made their first contribution in #2243
- @Copilot made their first contribution in #2244
- @yaoyaoding made their first contribution in #2125
Full Changelog: v0.6.0rc1...v0.6.0rc2