flashinfer-ai/flashinfer v0.6.8rc1 on GitHub

What's Changed

Add to CODEOWNER by @aleozlx in #2875
fix: int32 overflow in trtllm_fp4_block_scale_moe causing "Unsupported hidden state scale shape" for EP32+ configs by @qiching in #2853
feat: bump nvidia-cutlass-dsl to >=4.4.2 by @limin2021 in #2833
fix: add cute dsl moe utils to AOT by @nv-yunzheq in #2872
fix: fix cute dsl swap_ab tactic failure by @nv-yunzheq in #2870
[gdn] support non-contiguous state for decoding by @ZJY0516 in #2727
chore: fix the python dependency override by @yongwww in #2651
backinteg: nvidia-nvshmem-cu12 3.6.5 seems broken by @aleozlx in #2893
Yanqinz/gemm cudnn autotune fix by @yanqinz2 in #2863
feat: Add CuTe-DSL backend for NVFP4 quantization by @bkryu in #2838
Add cute dsl mla decode op by @limin2021 in #2743
Support for MXFP4 and NVFP4 group GEMMs on GeForce and Spark by @depaulmillz in #2738
feat: add pdl support for cute dsl mla decode kernel support by @Observer007 in #2901
feat: expose swizzled_input_sf parameter for CUTLASS fused MOE by @yzh119 in #2330
fix: support fp32 logits for fp8_per_tensor and fp8_block by @yweng0828 in #2534
Fix autotuner crash when input tensor is None by @he-yufeng in #2756
Support in-place update for trtllm_fp8_block_scale_moe by @wzhao18 in #2739
[fix] bugfix 2856: Fix pre-allocated out shape check in trtllm_batch_decode_with_kv_cache_mla for q_len_per_req > 1 by @qsang-nv in #2876
PR auto-labelling by @aleozlx in #2827
fix test error regarding logits_types by @aleozlx in #2918
Use 6-hour timeout for flashinfer-jit-cache wheel build (release + nightly) by @yongwww in #2880
fix: expose trigger_completion_at_end through unified API by @nv-yunzheq in #2894
fix: clamp enable_pdl=True to False on SM < 90 to prevent PDL PTX on Ampere by @bkryu in #2928
feat: add Relu2 (squared ReLU) activation support in CUTLASS MoE backend by @askliar in #2926
docker: upgrade cuDNN to latest version in CI install script by @bkryu in #2930
[NVIDIA] fix(jit): enable GDC for CUTLASS fused MoE PDL — prevent random crashes on SM12x by @johnnynunez in #2913
fix: Fix autotuner crash on meta-device tensor in trtllm_fp4_block_scale_routed_moe by @bkryu in #2916
Yanqinz/dynamic shape unified api by @yanqinz2 in #2910
doc: add CI triggering guide to CONTRIBUTING.md by @yongwww in #2924
read real strides for kv and block scale by @sychen52 in #2844
perf: Optimize CuTe-DSL fp4 and fp8 quantization kernels by @bkryu in #2904
fix: vectorize get_shuffle_matrix_a_row_indices to eliminate CPU contention by @youkaichao in #2935
feat: implement deterministic topk by @jiangyinzuo in #2661
feat(gdn): add BF16 state kernel with MTP support beyond T>4 with intermediate caching. by @ameynaik-hub in #2679
ci: remove 1gpu label from H100 runner selector by @yongwww in #2946
perf: Optimize GDN MTP decode kernel (v15) — eliminate ilp=1 fallback… by @ameynaik-hub in #2842
feat: add MXFP8 GEMM support for SM120 by @samuellees in #2902
fix: avoid re-downloading BMM export headers when flashinfer-cubin is installed by @yzh119 in #2903
test: xfail cuDNN FP8 prefill on Blackwell with CUDA <= 12.9 by @dierksen in #2963
test: skip unsupported mm_mxfp8 configurations on SM12x by @bkryu in #2974
[Fmha] revert blackwell ultra optimization that causes deadlocks. by @PerkzZheng in #2956
feat: SM121 (GB10) tile filtering and autotuner robustness by @askliar in #2927
Mamba SSU: horizontal MTP kernel (+ DSTATE=96 support) by @ishovkun in #2865
fix: use float instead of double in sampling binary search to avoid FP64 bottleneck on SM103 by @bkryu in #2945
Refactor the routing part by @ChristinaZ in #2803
fix: snap weight_scale_vec_size to handle block_scale_interleave padding for SM120 by @samuellees in #2898
Add filelock to ensure_symlink by @wzhao18 in #2979
Update NVSHMEM interface to use NVSHMEM4Py instead of custom bindings by @benhg in #2960
docs: document replay command in CLI reference by @ooooo-create in #2919
[Chore] add missing MOE code part by @jiahanc in #2998
enable_pdl_and_bias_for_cudnn_backend by @yanqinz2 in #2948
bench: Enable microbenchmarking on SM121 by @bkryu in #3002
fix: tinygemm2 hang issue due to barrier sync by @jimmyzho in #2996
[Perf] Refactor MoE autotuning to set valid topk ids in routed MoE tuning by @wzhao18 in #2942
fix: restore SM120 CUTLASS MoE tile candidate removed by #2927 (test_trtllm_cutlass_fused_moe.py) by @samuellees in #2984
misc: Update gemm/batched gemm cubins from trtllm-gen, gemm header refactor by @jimmyzho in #2740
fix: use sym_int64 for strides in rmsnorm CuTe DSL kernels to prevent int32 overflow by @bkryu in #3007
Add SM 103 as one of supported capabilities for mm_M1_16_K7168_N256 by @harrisonlimh in #2991
feat: add PDL support to rmsnorm_fp4quant and add_rmsnorm_fp4quant CuTe DSL kernels by @bkryu in #3008
[Fmha] support nvfp4 output keepsMmaAb generation kernels by @PerkzZheng in #2988
Only swizzle on v block scale; rename kv_block_scales to kv_cache_sf by @sychen52 in #2954
feat(gdn): state checkpointing in chunk_gated_delta_rule by @feldsherov in #2908
[chore] Install nvidia-cutlass-dsl[cu13] for cu130+ by @jiahanc in #3017
Add flashinfer.fused_rmsnorm_silu() with native kernel backend by @kahyunnam in #2965
feat: Support padding tokens with seqlen=0 for rope+quant+kv cache update fusion kernel by @elvischenv in #2792
Update README.md: Jetson Thor compute capability by @qiching in #3012
[fix] bugfix 1044: Auto-inject well-known JIT additional tensor buffers in prefill and decode run() APIs by @qsang-nv in #2855
Update Docker CI tags to 20260408-4cce866 by @flashinfer-bot in #3018
perf: Port TRT-LLM SM120/SM121 FP4 CUTLASS GEMM optimizations. Add PDL by @bkryu in #3026
perf: Optimize CUTLASS MoE helper kernels for small-batch decode workloads by @bkryu in #3014
[fix] bugfix 541: Make single_prefill/decode compatible with torch.compile CUDA graphs by @qsang-nv in #2857
Prevent MoE autotuner buffer overflow on large token buckets by @leejnau in #3025
Fused moe all-reduce routed scaling factor + quant support by @murphymatt in #2966
fix: check for ptr before calling close_mnnvl_memory by @jdebache in #2892
Second part of refactoring the routing part by @ChristinaZ in #2993
feat(comm): add MOE Finalize/Reduction patterns to unified allreduce_fusion API by @samuellees in #2982
Fix compilation error: add missing header by @he-yufeng in #2772
[chore] Fix CI pre-commit mypy error by @jiahanc in #3040
Add support for Relu2 in BF16 fused MoE by @amitz-nv in #2864
fix: extend moe alltoall top-k specializations by @bobboli in #3021
Fix MXFP4/MXFP8 failures in SM120 FAST_BUILD and expand all_tiles[] by @askliar in #2994
[feat] Add blackwell GDN prefill kernel by @jiahanc in #3001
Fix silent bug with FP8 per tensor non-gated MoE by @danisereb in #2882
Add @qsang-nv as a code owner for attention by @sricketts in #3055
[CuTe DSL] Add modular FMHA prefill and MLA decode attention kernels by @pgera in #2805
bump version to 0.6.8 by @aleozlx in #3042

New Contributors

@qiching made their first contribution in #2853
@ZJY0516 made their first contribution in #2727
@depaulmillz made their first contribution in #2738
@Observer007 made their first contribution in #2901
@yweng0828 made their first contribution in #2534
@he-yufeng made their first contribution in #2756
@wzhao18 made their first contribution in #2739
@askliar made their first contribution in #2926
@benhg made their first contribution in #2960
@ooooo-create made their first contribution in #2919
@harrisonlimh made their first contribution in #2991
@feldsherov made their first contribution in #2908
@murphymatt made their first contribution in #2966
@pgera made their first contribution in #2805

Full Changelog: v0.6.7.post3...v0.6.8rc1

flashinfer-ai/flashinfer v0.6.8rc1 Release v0.6.8rc1 on GitHub

What's Changed

New Contributors

flashinfer-ai/flashinfer v0.6.8rc1
Release v0.6.8rc1

on GitHub