What's Changed
- Add to CODEOWNER by @aleozlx in #2875
- fix: int32 overflow in
trtllm_fp4_block_scale_moecausing "Unsupported hidden state scale shape" for EP32+ configs by @qiching in #2853 - feat: bump nvidia-cutlass-dsl to >=4.4.2 by @limin2021 in #2833
- fix: add cute dsl moe utils to AOT by @nv-yunzheq in #2872
- fix: fix cute dsl swap_ab tactic failure by @nv-yunzheq in #2870
- [gdn] support non-contiguous state for decoding by @ZJY0516 in #2727
- chore: fix the python dependency override by @yongwww in #2651
- backinteg: nvidia-nvshmem-cu12 3.6.5 seems broken by @aleozlx in #2893
- Yanqinz/gemm cudnn autotune fix by @yanqinz2 in #2863
- feat: Add CuTe-DSL backend for NVFP4 quantization by @bkryu in #2838
- Add cute dsl mla decode op by @limin2021 in #2743
- Support for MXFP4 and NVFP4 group GEMMs on GeForce and Spark by @depaulmillz in #2738
- feat: add pdl support for cute dsl mla decode kernel support by @Observer007 in #2901
- feat: expose swizzled_input_sf parameter for CUTLASS fused MOE by @yzh119 in #2330
- fix: support fp32 logits for fp8_per_tensor and fp8_block by @yweng0828 in #2534
- Fix autotuner crash when input tensor is None by @he-yufeng in #2756
- Support in-place update for
trtllm_fp8_block_scale_moeby @wzhao18 in #2739 - [fix] bugfix 2856: Fix pre-allocated out shape check in trtllm_batch_decode_with_kv_cache_mla for q_len_per_req > 1 by @qsang-nv in #2876
- PR auto-labelling by @aleozlx in #2827
- fix test error regarding logits_types by @aleozlx in #2918
- Use 6-hour timeout for flashinfer-jit-cache wheel build (release + nightly) by @yongwww in #2880
- fix: expose trigger_completion_at_end through unified API by @nv-yunzheq in #2894
- fix: clamp enable_pdl=True to False on SM < 90 to prevent PDL PTX on Ampere by @bkryu in #2928
- feat: add Relu2 (squared ReLU) activation support in CUTLASS MoE backend by @askliar in #2926
- docker: upgrade cuDNN to latest version in CI install script by @bkryu in #2930
- [NVIDIA] fix(jit): enable GDC for CUTLASS fused MoE PDL — prevent random crashes on SM12x by @johnnynunez in #2913
- fix: Fix autotuner crash on meta-device tensor in trtllm_fp4_block_scale_routed_moe by @bkryu in #2916
- Yanqinz/dynamic shape unified api by @yanqinz2 in #2910
- doc: add CI triggering guide to CONTRIBUTING.md by @yongwww in #2924
- read real strides for kv and block scale by @sychen52 in #2844
- perf: Optimize CuTe-DSL fp4 and fp8 quantization kernels by @bkryu in #2904
- fix: vectorize get_shuffle_matrix_a_row_indices to eliminate CPU contention by @youkaichao in #2935
- feat: implement deterministic topk by @jiangyinzuo in #2661
- feat(gdn): add BF16 state kernel with MTP support beyond T>4 with intermediate caching. by @ameynaik-hub in #2679
- ci: remove 1gpu label from H100 runner selector by @yongwww in #2946
- perf: Optimize GDN MTP decode kernel (v15) — eliminate ilp=1 fallback… by @ameynaik-hub in #2842
- feat: add MXFP8 GEMM support for SM120 by @samuellees in #2902
- fix: avoid re-downloading BMM export headers when flashinfer-cubin is installed by @yzh119 in #2903
- test: xfail cuDNN FP8 prefill on Blackwell with CUDA <= 12.9 by @dierksen in #2963
- test: skip unsupported mm_mxfp8 configurations on SM12x by @bkryu in #2974
- [Fmha] revert blackwell ultra optimization that causes deadlocks. by @PerkzZheng in #2956
- feat: SM121 (GB10) tile filtering and autotuner robustness by @askliar in #2927
- Mamba SSU: horizontal MTP kernel (+ DSTATE=96 support) by @ishovkun in #2865
- fix: use float instead of double in sampling binary search to avoid FP64 bottleneck on SM103 by @bkryu in #2945
- Refactor the routing part by @ChristinaZ in #2803
- fix: snap weight_scale_vec_size to handle block_scale_interleave padding for SM120 by @samuellees in #2898
- Add filelock to ensure_symlink by @wzhao18 in #2979
- Update NVSHMEM interface to use NVSHMEM4Py instead of custom bindings by @benhg in #2960
- docs: document replay command in CLI reference by @ooooo-create in #2919
- [Chore] add missing MOE code part by @jiahanc in #2998
- enable_pdl_and_bias_for_cudnn_backend by @yanqinz2 in #2948
- bench: Enable microbenchmarking on SM121 by @bkryu in #3002
- fix: tinygemm2 hang issue due to barrier sync by @jimmyzho in #2996
- [Perf] Refactor MoE autotuning to set valid topk ids in routed MoE tuning by @wzhao18 in #2942
- fix: restore SM120 CUTLASS MoE tile candidate removed by #2927 (test_trtllm_cutlass_fused_moe.py) by @samuellees in #2984
- misc: Update gemm/batched gemm cubins from trtllm-gen, gemm header refactor by @jimmyzho in #2740
- fix: use sym_int64 for strides in rmsnorm CuTe DSL kernels to prevent int32 overflow by @bkryu in #3007
- Add SM 103 as one of supported capabilities for mm_M1_16_K7168_N256 by @harrisonlimh in #2991
- feat: add PDL support to rmsnorm_fp4quant and add_rmsnorm_fp4quant CuTe DSL kernels by @bkryu in #3008
- [Fmha] support nvfp4 output keepsMmaAb generation kernels by @PerkzZheng in #2988
- Only swizzle on v block scale; rename kv_block_scales to kv_cache_sf by @sychen52 in #2954
- feat(gdn): state checkpointing in chunk_gated_delta_rule by @feldsherov in #2908
- [chore] Install nvidia-cutlass-dsl[cu13] for cu130+ by @jiahanc in #3017
- Add flashinfer.fused_rmsnorm_silu() with native kernel backend by @kahyunnam in #2965
- feat: Support padding tokens with seqlen=0 for rope+quant+kv cache update fusion kernel by @elvischenv in #2792
- Update README.md: Jetson Thor compute capability by @qiching in #3012
- [fix] bugfix 1044: Auto-inject well-known JIT additional tensor buffers in prefill and decode run() APIs by @qsang-nv in #2855
- Update Docker CI tags to 20260408-4cce866 by @flashinfer-bot in #3018
- perf: Port TRT-LLM SM120/SM121 FP4 CUTLASS GEMM optimizations. Add PDL by @bkryu in #3026
- perf: Optimize CUTLASS MoE helper kernels for small-batch decode workloads by @bkryu in #3014
- [fix] bugfix 541: Make single_prefill/decode compatible with torch.compile CUDA graphs by @qsang-nv in #2857
- Prevent MoE autotuner buffer overflow on large token buckets by @leejnau in #3025
- Fused moe all-reduce routed scaling factor + quant support by @murphymatt in #2966
- fix: check for ptr before calling close_mnnvl_memory by @jdebache in #2892
- Second part of refactoring the routing part by @ChristinaZ in #2993
- feat(comm): add MOE Finalize/Reduction patterns to unified allreduce_fusion API by @samuellees in #2982
- Fix compilation error: add missing header by @he-yufeng in #2772
- [chore] Fix CI pre-commit mypy error by @jiahanc in #3040
- Add support for Relu2 in BF16 fused MoE by @amitz-nv in #2864
- fix: extend moe alltoall top-k specializations by @bobboli in #3021
- Fix MXFP4/MXFP8 failures in SM120 FAST_BUILD and expand all_tiles[] by @askliar in #2994
- [feat] Add blackwell GDN prefill kernel by @jiahanc in #3001
- Fix silent bug with FP8 per tensor non-gated MoE by @danisereb in #2882
- Add @qsang-nv as a code owner for attention by @sricketts in #3055
- [CuTe DSL] Add modular FMHA prefill and MLA decode attention kernels by @pgera in #2805
- bump version to 0.6.8 by @aleozlx in #3042
New Contributors
- @qiching made their first contribution in #2853
- @ZJY0516 made their first contribution in #2727
- @depaulmillz made their first contribution in #2738
- @Observer007 made their first contribution in #2901
- @yweng0828 made their first contribution in #2534
- @he-yufeng made their first contribution in #2756
- @wzhao18 made their first contribution in #2739
- @askliar made their first contribution in #2926
- @benhg made their first contribution in #2960
- @ooooo-create made their first contribution in #2919
- @harrisonlimh made their first contribution in #2991
- @feldsherov made their first contribution in #2908
- @murphymatt made their first contribution in #2966
- @pgera made their first contribution in #2805
Full Changelog: v0.6.7.post3...v0.6.8rc1