flashinfer-ai/flashinfer v0.6.11 on GitHub

What's Changed

trying this one character fix for main branch by @aleozlx in #3213
Add git submodule update to build_backend.py by @kahyunnam in #3190
fix(cute_dsl/moe): correct tile_size=256 gemm2 tactic enumeration by @leejnau in #3171
Fix trace-bmm-fp8 test: B should be K-major for subword types by @xrq-phys in #3184
feat: Add DiT-oriented kernels where Qk (Bmm1) type can be reinterpreted into Int8 or BFloat16 by @xrq-phys in #2711
[fmha-v2] Support HND and NHD paged KV cache layouts with conditional stride handling by @zhou-yuxin in #2799
[feat] Trtllm-gen Per-token Nvfp4 MoE by @IwakuraRein in #3027
feat: Add cuBLASLt backend for mm_bf16 and enable multi-tactic autotuning for FP8/MXFP8 runners by @vadiklyutiy in #2914
trtllm non causal support by @saltyminty in #3020
feat: DiT layer norm fusions for WAN: flashinfer.diffusion_ops by @kahyunnam in #3157
Refactor Part 3- Add block-per-token feature in the customized routing method by @ChristinaZ in #3166
fix(cute_dsl/moe): correct off-by-one in get_max_num_tiles to match TRT-LLM by @leejnau in #3198
Yanqinz/fix cudnn sm120 nan by @yanqinz2 in #3192
bump version to 0.6.10 by @aleozlx in #3179
fix: align is_sm120f_supported with SM12x family semantics by @leonardHONG in #3175
fix: add sm_121 to TMEM column fallback map by @leonardHONG in #3173
Include TinyGEMM into BF16 autotuner by @askliar in #3203
fix(dcp_alltoall): require MNNVL workspace, drop broken plain-memory path by @davidjpyu in #3210
Integrate CUTLASS Small Tile N Blockscaled GEMMs/Grouped GEMMs for SM120 and SM121 by @depaulmillz in #3152
Fix bf16 cudnn override-shape test call signature by @Vinnie6167 in #3215
Ameyn/wide vec t1 by @ameynaik-hub in #3147
[Perf] Add FMHAv2 to flashinfer_benchmark.py and eliminate unnecessary H2D by @jimmyzho in #2841
Fix multi-instances using same random seed by @guyuankan in #3102
add_grouped_mm_operation_directory by @yanqinz2 in #3052
Support Allreduce + Norm + Per-token Group Fp8 Quant Fusion by @wzhao18 in #3059
[Bugfix] Fix fused MoE autotuning correctness issues by filtering clusterDimZ by @wzhao18 in #3227
fix: add jitter to cubin download backoff by @pluh-nv in #3169
cute_dsl/moe: drop redundant Python-side moe_sort buffer init by @leejnau in #3226
Support Sigmoid (sigmoid+topk) routing function by @EdalatiAli in #2869
cute-dsl fmha prefill (cubin integration): remove front-padding, add attention_sink, and pdl support by @limin2021 in #3181
fix(mla): widen page index to int64_t to avoid 32-bit overflow by @Tracin in #3136
bump version to 0.6.11 by @aleozlx in #3245
fix(cute_dsl/moe): make autotuner bucket configuration adapt to runtime input by @leejnau in #3216
Fix: skip git submodule update when submodules are already populated by @kahyunnam in #3248
Fix 10 bugs in BF16 XQA MLA kernel for SM120/SM121 by @blake-snc in #2689
Tweak grouped_mm api to make backend specific argument keyword-only by @yanqinz2 in #3253
perf(moe): optimize SM120 b12x MoE short decode by @lukealonso in #3193
feat: Enable FP8 (E4M3/E5M2) in concat_mla_k for optimize long-context prefill performance and refactor type dispatch for BF16/FP16 by @qiching in #3129
fix hang in allreduce comms in SGL by @b8zhong in #3247
fix(sm12x): fix micro-kernel workspace sizing when routed_rows > num_local_experts by @meena-at-work in #3191
Issue #3047: Handle empty KV in MLA chunked-prefill by @saltyminty in #3251
Cutlass dsl 4.5 bump by @kahyunnam in #3246

New Contributors

@xrq-phys made their first contribution in #3184
@zhou-yuxin made their first contribution in #2799
@leonardHONG made their first contribution in #3175
@guyuankan made their first contribution in #3102
@pluh-nv made their first contribution in #3169
@EdalatiAli made their first contribution in #2869
@Tracin made their first contribution in #3136
@lukealonso made their first contribution in #3193

Full Changelog: v0.6.10rc1...v0.6.11

flashinfer-ai/flashinfer v0.6.11 Release v0.6.11 on GitHub

What's Changed

New Contributors

flashinfer-ai/flashinfer v0.6.11
Release v0.6.11

on GitHub