github flashinfer-ai/flashinfer v0.6.7
Release v0.6.7

latest releases: nightly-v0.6.12-20260530, v0.6.12, nightly-v0.6.12-20260529...
2 months ago

What's Changed

  • perf(gdn): optimize MTP kernel with ILP rows and SMEM v caching by @ameynaik-hub in #2618
  • Feat/gdn decode pooled by @xutizhou in #2521
  • fix(jit): GEMM kernels produce NaN under concurrency — missing GDC flags cause PDL synchronization barriers to compile as no-ops by @voipmonitor in #2716
  • Support NVFP4 KV cache decode on SM120 by @Tom-Zheng in #2520
  • feat: Add TRTLLM fmha_v2 library for SM90 attention with Skip-Softmax by @jimmyzho in #2446
  • bump version to 0.6.6 by @aleozlx in #2724
  • [benchmark] Add All Reduce benchmark by @jiahanc in #2696
  • Revert "fix(jit): GEMM kernels produce NaN under concurrency — missing GDC flags cause PDL synchronization barriers to compile as no-ops" by @aleozlx in #2737
  • refactor: refactoring cuda code to cute-dsl (part 1) by @yzh119 in #2428
  • Added missing padding by @nvjullin in #2726
  • docker: add CUDA 13.1 Dockerfiles with cuda-tile by @yongwww in #2774
  • [BugFix] guard against uint32 underflow in multi-CTA TopK chunk calculation by @LopezCastroRoberto in #2592
  • fix: guard CUTLASS FMHA against SM12x and fix fmha_v2 SM121a check by @blake-snc in #2560
  • fix: fix illegal memory access for NaN input in sampling kernels by @zack041 in #2456
  • Add cuda-tile to package dependencies by @yzh119 in #2758
  • tests: skip sliding window + fp8 to prevent hang in fmha_v2 unit tests by @jimmyzho in #2781
  • feat: Add autotuner config caching, thread safety, and documentation by @bkryu in #2554
  • fix: block PR merge when CI is skipped due to pending authorization by @yongwww in #2761
  • [feat] Add air top-p algorithm by @qsang-nv in #2752
  • [chore] Add jiahanc to moe related code owner by @jiahanc in #2748
  • fix: Fix cute dsl moe failure with nvidia-cutlass-dsl >= 4.4.0 by @nv-yunzheq in #2735
  • [Spark unit test debugging] Fix for tests/attention/test_trtllm_gen_mla.py by @kahyunnam in #2750
  • [Spark unit test debugging] Fix for tests/gemm/test_groupwise_scaled_gemm_fp8.py by @kahyunnam in #2751
  • [feat] Add 2048 experts and 32 Top K by @jiahanc in #2744
  • perf: Performance tune cute dsl RMSNorm variants by @bkryu in #2777
  • feat: Add FP4 KV cache quant/dequant kernels by @samuellees in #2757
  • Add cute-dsl backends to mxfp[8,4]_quantization for future refactor by @bkryu in #2443
  • feat: FP32 dtype output for BF16 matmuls (CUTLASS & cuDNN) by @raayandhar in #2644
  • Create separate cuDNN handle per GPU by @dhiraj113 in #2688
  • CuteDSL MoE fix redundant output buffer zeroing by @leejnau in #2811
  • Add NVFP4 KV cache quantization support for SM100 by @sychen52 in #2702
  • [fix] Bugfix 1367: fix VariableBlockSparseAttention buffer overflow by dynamically resizing kv_lens_buffer by @qsang-nv in #2802
  • fix: Workaround org teams perm issue for approval purposes by @aleozlx in #2816
  • Implement override shape support for cuDNN GEMM operations by @yanqinz2 in #2790
  • feat: Add support for TRTLLM MXFP8 non-gated MoE with ReLU2 by @danisereb in #2707
  • Upgrade cutlass 4.2.1 -> 4.4.2 by @kahyunnam in #2798
  • chore: cute dsl nvfp4 moe clean up by @nv-yunzheq in #2775
  • fix: Add SM120 (RTX Blackwell desktop) support for NVFP4 MoE kernels by @brandonmmusic-max in #2725
  • Protect against null clusterUuid in mnnvl.py by @akshaver in #2626
  • Deprecation for gated_delta_rule_mtp's intermediate_states_buffer=True by @kahyunnam in #2730
  • fix: Autotuner _find_nearest_profile non-power-of-2 num_tokens, create launchers for all supported tileN in trtllm fused MoE by @amitz-nv in #2821
  • fix(jit): enable GDC for CUTLASS GEMM PDL — SM100 flag only by @voipmonitor in #2780
  • [Fmha] Sparse MLA decode kernel selection heuristics by @PerkzZheng in #2836
  • fix: add missing re-exports for rmsnorm quant and fused_add_rmsnorm q… by @DevashishLal-CB in #2783
  • Add varlen and speculative decoding support to selective state update by @roikoren755 in #2700
  • [feat] trtllm-gen mxfp8 gemm by @IwakuraRein in #2653
  • [Spark bug] Fix arch 12.1 -> "sm120a" flag for Spark, CUDA 12.9 by @kahyunnam in #2839
  • skip per-pr for draft PRs by @aleozlx in #2831
  • feat(gdn): add padding index guard for bf16 decode kernel by @kaixih in #2810
  • docker: Add CUDA 13.2 Docker containers by @bkryu in #2843
  • [fix] bugfix 1419: Add batch size shape validation in decode and prefill run() APIs by @qsang-nv in #2801
  • Update Docker CI tags to 20260322-ff86ea0 by @flashinfer-bot in #2854
  • feat: Expose TRT-LLM FMHA style paged KV Cache and page table layout by @DomBrown in #2770
  • [Spark unit test] Adjust tolerance for test_xqa, test_logits_processor by @kahyunnam in #2828
  • Mamba2 SSD Combined Forward Pass (Blackwell CuTe DSL Kernel) by @ishovkun in #2709
  • bump version to 0.6.7 & fix api breaking changes by @aleozlx in #2832
  • [Spark unit test debugging] Fix for tests/autotuner/test_autotuner_core.py by @kahyunnam in #2867
  • fix: use current CUDA device instead of tp_rank for SymmDeviceMemory allocation by @fzyzcjy in #2662

New Contributors

Full Changelog: v0.6.6...v0.6.7

Don't miss a new flashinfer release

NewReleases is sending notifications on new releases.