flashinfer-ai/flashinfer v0.6.7 on GitHub

What's Changed

perf(gdn): optimize MTP kernel with ILP rows and SMEM v caching by @ameynaik-hub in #2618
Feat/gdn decode pooled by @xutizhou in #2521
fix(jit): GEMM kernels produce NaN under concurrency — missing GDC flags cause PDL synchronization barriers to compile as no-ops by @voipmonitor in #2716
Support NVFP4 KV cache decode on SM120 by @Tom-Zheng in #2520
feat: Add TRTLLM fmha_v2 library for SM90 attention with Skip-Softmax by @jimmyzho in #2446
bump version to 0.6.6 by @aleozlx in #2724
[benchmark] Add All Reduce benchmark by @jiahanc in #2696
Revert "fix(jit): GEMM kernels produce NaN under concurrency — missing GDC flags cause PDL synchronization barriers to compile as no-ops" by @aleozlx in #2737
refactor: refactoring cuda code to cute-dsl (part 1) by @yzh119 in #2428
Added missing padding by @nvjullin in #2726
docker: add CUDA 13.1 Dockerfiles with cuda-tile by @yongwww in #2774
[BugFix] guard against uint32 underflow in multi-CTA TopK chunk calculation by @LopezCastroRoberto in #2592
fix: guard CUTLASS FMHA against SM12x and fix fmha_v2 SM121a check by @blake-snc in #2560
fix: fix illegal memory access for NaN input in sampling kernels by @zack041 in #2456
Add cuda-tile to package dependencies by @yzh119 in #2758
tests: skip sliding window + fp8 to prevent hang in fmha_v2 unit tests by @jimmyzho in #2781
feat: Add autotuner config caching, thread safety, and documentation by @bkryu in #2554
fix: block PR merge when CI is skipped due to pending authorization by @yongwww in #2761
[feat] Add air top-p algorithm by @qsang-nv in #2752
[chore] Add jiahanc to moe related code owner by @jiahanc in #2748
fix: Fix cute dsl moe failure with nvidia-cutlass-dsl >= 4.4.0 by @nv-yunzheq in #2735
[Spark unit test debugging] Fix for tests/attention/test_trtllm_gen_mla.py by @kahyunnam in #2750
[Spark unit test debugging] Fix for tests/gemm/test_groupwise_scaled_gemm_fp8.py by @kahyunnam in #2751
[feat] Add 2048 experts and 32 Top K by @jiahanc in #2744
perf: Performance tune cute dsl RMSNorm variants by @bkryu in #2777
feat: Add FP4 KV cache quant/dequant kernels by @samuellees in #2757
Add cute-dsl backends to mxfp[8,4]_quantization for future refactor by @bkryu in #2443
feat: FP32 dtype output for BF16 matmuls (CUTLASS & cuDNN) by @raayandhar in #2644
Create separate cuDNN handle per GPU by @dhiraj113 in #2688
CuteDSL MoE fix redundant output buffer zeroing by @leejnau in #2811
Add NVFP4 KV cache quantization support for SM100 by @sychen52 in #2702
[fix] Bugfix 1367: fix VariableBlockSparseAttention buffer overflow by dynamically resizing kv_lens_buffer by @qsang-nv in #2802
fix: Workaround org teams perm issue for approval purposes by @aleozlx in #2816
Implement override shape support for cuDNN GEMM operations by @yanqinz2 in #2790
feat: Add support for TRTLLM MXFP8 non-gated MoE with ReLU2 by @danisereb in #2707
Upgrade cutlass 4.2.1 -> 4.4.2 by @kahyunnam in #2798
chore: cute dsl nvfp4 moe clean up by @nv-yunzheq in #2775
fix: Add SM120 (RTX Blackwell desktop) support for NVFP4 MoE kernels by @brandonmmusic-max in #2725
Protect against null clusterUuid in mnnvl.py by @akshaver in #2626
Deprecation for gated_delta_rule_mtp's intermediate_states_buffer=True by @kahyunnam in #2730
fix: Autotuner _find_nearest_profile non-power-of-2 num_tokens, create launchers for all supported tileN in trtllm fused MoE by @amitz-nv in #2821
fix(jit): enable GDC for CUTLASS GEMM PDL — SM100 flag only by @voipmonitor in #2780
[Fmha] Sparse MLA decode kernel selection heuristics by @PerkzZheng in #2836
fix: add missing re-exports for rmsnorm quant and fused_add_rmsnorm q… by @DevashishLal-CB in #2783
Add varlen and speculative decoding support to selective state update by @roikoren755 in #2700
[feat] trtllm-gen mxfp8 gemm by @IwakuraRein in #2653
[Spark bug] Fix arch 12.1 -> "sm120a" flag for Spark, CUDA 12.9 by @kahyunnam in #2839
skip per-pr for draft PRs by @aleozlx in #2831
feat(gdn): add padding index guard for bf16 decode kernel by @kaixih in #2810
docker: Add CUDA 13.2 Docker containers by @bkryu in #2843
[fix] bugfix 1419: Add batch size shape validation in decode and prefill run() APIs by @qsang-nv in #2801
Update Docker CI tags to 20260322-ff86ea0 by @flashinfer-bot in #2854
feat: Expose TRT-LLM FMHA style paged KV Cache and page table layout by @DomBrown in #2770
[Spark unit test] Adjust tolerance for test_xqa, test_logits_processor by @kahyunnam in #2828
Mamba2 SSD Combined Forward Pass (Blackwell CuTe DSL Kernel) by @ishovkun in #2709
bump version to 0.6.7 & fix api breaking changes by @aleozlx in #2832
[Spark unit test debugging] Fix for tests/autotuner/test_autotuner_core.py by @kahyunnam in #2867
fix: use current CUDA device instead of tp_rank for SymmDeviceMemory allocation by @fzyzcjy in #2662

New Contributors

@voipmonitor made their first contribution in #2716
@dhiraj113 made their first contribution in #2688
@leejnau made their first contribution in #2811
@sychen52 made their first contribution in #2702
@yanqinz2 made their first contribution in #2790
@brandonmmusic-max made their first contribution in #2725
@akshaver made their first contribution in #2626
@DevashishLal-CB made their first contribution in #2783
@roikoren755 made their first contribution in #2700

Full Changelog: v0.6.6...v0.6.7

flashinfer-ai/flashinfer v0.6.7 Release v0.6.7 on GitHub

What's Changed

New Contributors

flashinfer-ai/flashinfer v0.6.7
Release v0.6.7

on GitHub