What's Changed
- perf(gdn): optimize MTP kernel with ILP rows and SMEM v caching by @ameynaik-hub in #2618
- Feat/gdn decode pooled by @xutizhou in #2521
- fix(jit): GEMM kernels produce NaN under concurrency — missing GDC flags cause PDL synchronization barriers to compile as no-ops by @voipmonitor in #2716
- Support NVFP4 KV cache decode on SM120 by @Tom-Zheng in #2520
- feat: Add TRTLLM fmha_v2 library for SM90 attention with Skip-Softmax by @jimmyzho in #2446
- bump version to 0.6.6 by @aleozlx in #2724
- [benchmark] Add All Reduce benchmark by @jiahanc in #2696
- Revert "fix(jit): GEMM kernels produce NaN under concurrency — missing GDC flags cause PDL synchronization barriers to compile as no-ops" by @aleozlx in #2737
- refactor: refactoring cuda code to cute-dsl (part 1) by @yzh119 in #2428
- Added missing padding by @nvjullin in #2726
- docker: add CUDA 13.1 Dockerfiles with cuda-tile by @yongwww in #2774
- [BugFix] guard against uint32 underflow in multi-CTA TopK chunk calculation by @LopezCastroRoberto in #2592
- fix: guard CUTLASS FMHA against SM12x and fix fmha_v2 SM121a check by @blake-snc in #2560
- fix: fix illegal memory access for NaN input in sampling kernels by @zack041 in #2456
- Add cuda-tile to package dependencies by @yzh119 in #2758
- tests: skip sliding window + fp8 to prevent hang in fmha_v2 unit tests by @jimmyzho in #2781
- feat: Add autotuner config caching, thread safety, and documentation by @bkryu in #2554
- fix: block PR merge when CI is skipped due to pending authorization by @yongwww in #2761
- [feat] Add air top-p algorithm by @qsang-nv in #2752
- [chore] Add jiahanc to moe related code owner by @jiahanc in #2748
- fix: Fix cute dsl moe failure with nvidia-cutlass-dsl >= 4.4.0 by @nv-yunzheq in #2735
- [Spark unit test debugging] Fix for tests/attention/test_trtllm_gen_mla.py by @kahyunnam in #2750
- [Spark unit test debugging] Fix for tests/gemm/test_groupwise_scaled_gemm_fp8.py by @kahyunnam in #2751
- [feat] Add 2048 experts and 32 Top K by @jiahanc in #2744
- perf: Performance tune cute dsl RMSNorm variants by @bkryu in #2777
- feat: Add FP4 KV cache quant/dequant kernels by @samuellees in #2757
- Add cute-dsl backends to mxfp[8,4]_quantization for future refactor by @bkryu in #2443
- feat: FP32 dtype output for BF16 matmuls (CUTLASS & cuDNN) by @raayandhar in #2644
- Create separate cuDNN handle per GPU by @dhiraj113 in #2688
- CuteDSL MoE fix redundant output buffer zeroing by @leejnau in #2811
- Add NVFP4 KV cache quantization support for SM100 by @sychen52 in #2702
- [fix] Bugfix 1367: fix VariableBlockSparseAttention buffer overflow by dynamically resizing kv_lens_buffer by @qsang-nv in #2802
- fix: Workaround org teams perm issue for approval purposes by @aleozlx in #2816
- Implement override shape support for cuDNN GEMM operations by @yanqinz2 in #2790
- feat: Add support for TRTLLM MXFP8 non-gated MoE with ReLU2 by @danisereb in #2707
- Upgrade cutlass 4.2.1 -> 4.4.2 by @kahyunnam in #2798
- chore: cute dsl nvfp4 moe clean up by @nv-yunzheq in #2775
- fix: Add SM120 (RTX Blackwell desktop) support for NVFP4 MoE kernels by @brandonmmusic-max in #2725
- Protect against null clusterUuid in mnnvl.py by @akshaver in #2626
- Deprecation for gated_delta_rule_mtp's intermediate_states_buffer=True by @kahyunnam in #2730
- fix: Autotuner _find_nearest_profile non-power-of-2 num_tokens, create launchers for all supported tileN in trtllm fused MoE by @amitz-nv in #2821
- fix(jit): enable GDC for CUTLASS GEMM PDL — SM100 flag only by @voipmonitor in #2780
- [Fmha] Sparse MLA decode kernel selection heuristics by @PerkzZheng in #2836
- fix: add missing re-exports for rmsnorm quant and fused_add_rmsnorm q… by @DevashishLal-CB in #2783
- Add varlen and speculative decoding support to selective state update by @roikoren755 in #2700
- [feat] trtllm-gen mxfp8 gemm by @IwakuraRein in #2653
- [Spark bug] Fix arch 12.1 -> "sm120a" flag for Spark, CUDA 12.9 by @kahyunnam in #2839
- skip per-pr for draft PRs by @aleozlx in #2831
- feat(gdn): add padding index guard for bf16 decode kernel by @kaixih in #2810
- docker: Add CUDA 13.2 Docker containers by @bkryu in #2843
- [fix] bugfix 1419: Add batch size shape validation in decode and prefill run() APIs by @qsang-nv in #2801
- Update Docker CI tags to 20260322-ff86ea0 by @flashinfer-bot in #2854
- feat: Expose TRT-LLM FMHA style paged KV Cache and page table layout by @DomBrown in #2770
- [Spark unit test] Adjust tolerance for test_xqa, test_logits_processor by @kahyunnam in #2828
- Mamba2 SSD Combined Forward Pass (Blackwell CuTe DSL Kernel) by @ishovkun in #2709
- bump version to 0.6.7 & fix api breaking changes by @aleozlx in #2832
- [Spark unit test debugging] Fix for tests/autotuner/test_autotuner_core.py by @kahyunnam in #2867
- fix: use current CUDA device instead of tp_rank for SymmDeviceMemory allocation by @fzyzcjy in #2662
New Contributors
- @voipmonitor made their first contribution in #2716
- @dhiraj113 made their first contribution in #2688
- @leejnau made their first contribution in #2811
- @sychen52 made their first contribution in #2702
- @yanqinz2 made their first contribution in #2790
- @brandonmmusic-max made their first contribution in #2725
- @akshaver made their first contribution in #2626
- @DevashishLal-CB made their first contribution in #2783
- @roikoren755 made their first contribution in #2700
Full Changelog: v0.6.6...v0.6.7