github flashinfer-ai/flashinfer v0.6.8rc1
Release v0.6.8rc1

latest releases: nightly-v0.6.12-20260531, nightly-v0.6.12-20260530, v0.6.12...
one month ago

What's Changed

  • Add to CODEOWNER by @aleozlx in #2875
  • fix: int32 overflow in trtllm_fp4_block_scale_moe causing "Unsupported hidden state scale shape" for EP32+ configs by @qiching in #2853
  • feat: bump nvidia-cutlass-dsl to >=4.4.2 by @limin2021 in #2833
  • fix: add cute dsl moe utils to AOT by @nv-yunzheq in #2872
  • fix: fix cute dsl swap_ab tactic failure by @nv-yunzheq in #2870
  • [gdn] support non-contiguous state for decoding by @ZJY0516 in #2727
  • chore: fix the python dependency override by @yongwww in #2651
  • backinteg: nvidia-nvshmem-cu12 3.6.5 seems broken by @aleozlx in #2893
  • Yanqinz/gemm cudnn autotune fix by @yanqinz2 in #2863
  • feat: Add CuTe-DSL backend for NVFP4 quantization by @bkryu in #2838
  • Add cute dsl mla decode op by @limin2021 in #2743
  • Support for MXFP4 and NVFP4 group GEMMs on GeForce and Spark by @depaulmillz in #2738
  • feat: add pdl support for cute dsl mla decode kernel support by @Observer007 in #2901
  • feat: expose swizzled_input_sf parameter for CUTLASS fused MOE by @yzh119 in #2330
  • fix: support fp32 logits for fp8_per_tensor and fp8_block by @yweng0828 in #2534
  • Fix autotuner crash when input tensor is None by @he-yufeng in #2756
  • Support in-place update for trtllm_fp8_block_scale_moe by @wzhao18 in #2739
  • [fix] bugfix 2856: Fix pre-allocated out shape check in trtllm_batch_decode_with_kv_cache_mla for q_len_per_req > 1 by @qsang-nv in #2876
  • PR auto-labelling by @aleozlx in #2827
  • fix test error regarding logits_types by @aleozlx in #2918
  • Use 6-hour timeout for flashinfer-jit-cache wheel build (release + nightly) by @yongwww in #2880
  • fix: expose trigger_completion_at_end through unified API by @nv-yunzheq in #2894
  • fix: clamp enable_pdl=True to False on SM < 90 to prevent PDL PTX on Ampere by @bkryu in #2928
  • feat: add Relu2 (squared ReLU) activation support in CUTLASS MoE backend by @askliar in #2926
  • docker: upgrade cuDNN to latest version in CI install script by @bkryu in #2930
  • [NVIDIA] fix(jit): enable GDC for CUTLASS fused MoE PDL — prevent random crashes on SM12x by @johnnynunez in #2913
  • fix: Fix autotuner crash on meta-device tensor in trtllm_fp4_block_scale_routed_moe by @bkryu in #2916
  • Yanqinz/dynamic shape unified api by @yanqinz2 in #2910
  • doc: add CI triggering guide to CONTRIBUTING.md by @yongwww in #2924
  • read real strides for kv and block scale by @sychen52 in #2844
  • perf: Optimize CuTe-DSL fp4 and fp8 quantization kernels by @bkryu in #2904
  • fix: vectorize get_shuffle_matrix_a_row_indices to eliminate CPU contention by @youkaichao in #2935
  • feat: implement deterministic topk by @jiangyinzuo in #2661
  • feat(gdn): add BF16 state kernel with MTP support beyond T>4 with intermediate caching. by @ameynaik-hub in #2679
  • ci: remove 1gpu label from H100 runner selector by @yongwww in #2946
  • perf: Optimize GDN MTP decode kernel (v15) — eliminate ilp=1 fallback… by @ameynaik-hub in #2842
  • feat: add MXFP8 GEMM support for SM120 by @samuellees in #2902
  • fix: avoid re-downloading BMM export headers when flashinfer-cubin is installed by @yzh119 in #2903
  • test: xfail cuDNN FP8 prefill on Blackwell with CUDA <= 12.9 by @dierksen in #2963
  • test: skip unsupported mm_mxfp8 configurations on SM12x by @bkryu in #2974
  • [Fmha] revert blackwell ultra optimization that causes deadlocks. by @PerkzZheng in #2956
  • feat: SM121 (GB10) tile filtering and autotuner robustness by @askliar in #2927
  • Mamba SSU: horizontal MTP kernel (+ DSTATE=96 support) by @ishovkun in #2865
  • fix: use float instead of double in sampling binary search to avoid FP64 bottleneck on SM103 by @bkryu in #2945
  • Refactor the routing part by @ChristinaZ in #2803
  • fix: snap weight_scale_vec_size to handle block_scale_interleave padding for SM120 by @samuellees in #2898
  • Add filelock to ensure_symlink by @wzhao18 in #2979
  • Update NVSHMEM interface to use NVSHMEM4Py instead of custom bindings by @benhg in #2960
  • docs: document replay command in CLI reference by @ooooo-create in #2919
  • [Chore] add missing MOE code part by @jiahanc in #2998
  • enable_pdl_and_bias_for_cudnn_backend by @yanqinz2 in #2948
  • bench: Enable microbenchmarking on SM121 by @bkryu in #3002
  • fix: tinygemm2 hang issue due to barrier sync by @jimmyzho in #2996
  • [Perf] Refactor MoE autotuning to set valid topk ids in routed MoE tuning by @wzhao18 in #2942
  • fix: restore SM120 CUTLASS MoE tile candidate removed by #2927 (test_trtllm_cutlass_fused_moe.py) by @samuellees in #2984
  • misc: Update gemm/batched gemm cubins from trtllm-gen, gemm header refactor by @jimmyzho in #2740
  • fix: use sym_int64 for strides in rmsnorm CuTe DSL kernels to prevent int32 overflow by @bkryu in #3007
  • Add SM 103 as one of supported capabilities for mm_M1_16_K7168_N256 by @harrisonlimh in #2991
  • feat: add PDL support to rmsnorm_fp4quant and add_rmsnorm_fp4quant CuTe DSL kernels by @bkryu in #3008
  • [Fmha] support nvfp4 output keepsMmaAb generation kernels by @PerkzZheng in #2988
  • Only swizzle on v block scale; rename kv_block_scales to kv_cache_sf by @sychen52 in #2954
  • feat(gdn): state checkpointing in chunk_gated_delta_rule by @feldsherov in #2908
  • [chore] Install nvidia-cutlass-dsl[cu13] for cu130+ by @jiahanc in #3017
  • Add flashinfer.fused_rmsnorm_silu() with native kernel backend by @kahyunnam in #2965
  • feat: Support padding tokens with seqlen=0 for rope+quant+kv cache update fusion kernel by @elvischenv in #2792
  • Update README.md: Jetson Thor compute capability by @qiching in #3012
  • [fix] bugfix 1044: Auto-inject well-known JIT additional tensor buffers in prefill and decode run() APIs by @qsang-nv in #2855
  • Update Docker CI tags to 20260408-4cce866 by @flashinfer-bot in #3018
  • perf: Port TRT-LLM SM120/SM121 FP4 CUTLASS GEMM optimizations. Add PDL by @bkryu in #3026
  • perf: Optimize CUTLASS MoE helper kernels for small-batch decode workloads by @bkryu in #3014
  • [fix] bugfix 541: Make single_prefill/decode compatible with torch.compile CUDA graphs by @qsang-nv in #2857
  • Prevent MoE autotuner buffer overflow on large token buckets by @leejnau in #3025
  • Fused moe all-reduce routed scaling factor + quant support by @murphymatt in #2966
  • fix: check for ptr before calling close_mnnvl_memory by @jdebache in #2892
  • Second part of refactoring the routing part by @ChristinaZ in #2993
  • feat(comm): add MOE Finalize/Reduction patterns to unified allreduce_fusion API by @samuellees in #2982
  • Fix compilation error: add missing header by @he-yufeng in #2772
  • [chore] Fix CI pre-commit mypy error by @jiahanc in #3040
  • Add support for Relu2 in BF16 fused MoE by @amitz-nv in #2864
  • fix: extend moe alltoall top-k specializations by @bobboli in #3021
  • Fix MXFP4/MXFP8 failures in SM120 FAST_BUILD and expand all_tiles[] by @askliar in #2994
  • [feat] Add blackwell GDN prefill kernel by @jiahanc in #3001
  • Fix silent bug with FP8 per tensor non-gated MoE by @danisereb in #2882
  • Add @qsang-nv as a code owner for attention by @sricketts in #3055
  • [CuTe DSL] Add modular FMHA prefill and MLA decode attention kernels by @pgera in #2805
  • bump version to 0.6.8 by @aleozlx in #3042

New Contributors

Full Changelog: v0.6.7.post3...v0.6.8rc1

Don't miss a new flashinfer release

NewReleases is sending notifications on new releases.