github flashinfer-ai/flashinfer v0.6.11
Release v0.6.11

pre-release5 hours ago

What's Changed

  • trying this one character fix for main branch by @aleozlx in #3213
  • Add git submodule update to build_backend.py by @kahyunnam in #3190
  • fix(cute_dsl/moe): correct tile_size=256 gemm2 tactic enumeration by @leejnau in #3171
  • Fix trace-bmm-fp8 test: B should be K-major for subword types by @xrq-phys in #3184
  • feat: Add DiT-oriented kernels where Qk (Bmm1) type can be reinterpreted into Int8 or BFloat16 by @xrq-phys in #2711
  • [fmha-v2] Support HND and NHD paged KV cache layouts with conditional stride handling by @zhou-yuxin in #2799
  • [feat] Trtllm-gen Per-token Nvfp4 MoE by @IwakuraRein in #3027
  • feat: Add cuBLASLt backend for mm_bf16 and enable multi-tactic autotuning for FP8/MXFP8 runners by @vadiklyutiy in #2914
  • trtllm non causal support by @saltyminty in #3020
  • feat: DiT layer norm fusions for WAN: flashinfer.diffusion_ops by @kahyunnam in #3157
  • Refactor Part 3- Add block-per-token feature in the customized routing method by @ChristinaZ in #3166
  • fix(cute_dsl/moe): correct off-by-one in get_max_num_tiles to match TRT-LLM by @leejnau in #3198
  • Yanqinz/fix cudnn sm120 nan by @yanqinz2 in #3192
  • bump version to 0.6.10 by @aleozlx in #3179
  • fix: align is_sm120f_supported with SM12x family semantics by @leonardHONG in #3175
  • fix: add sm_121 to TMEM column fallback map by @leonardHONG in #3173
  • Include TinyGEMM into BF16 autotuner by @askliar in #3203
  • fix(dcp_alltoall): require MNNVL workspace, drop broken plain-memory path by @davidjpyu in #3210
  • Integrate CUTLASS Small Tile N Blockscaled GEMMs/Grouped GEMMs for SM120 and SM121 by @depaulmillz in #3152
  • Fix bf16 cudnn override-shape test call signature by @Vinnie6167 in #3215
  • Ameyn/wide vec t1 by @ameynaik-hub in #3147
  • [Perf] Add FMHAv2 to flashinfer_benchmark.py and eliminate unnecessary H2D by @jimmyzho in #2841
  • Fix multi-instances using same random seed by @guyuankan in #3102
  • add_grouped_mm_operation_directory by @yanqinz2 in #3052
  • Support Allreduce + Norm + Per-token Group Fp8 Quant Fusion by @wzhao18 in #3059
  • [Bugfix] Fix fused MoE autotuning correctness issues by filtering clusterDimZ by @wzhao18 in #3227
  • fix: add jitter to cubin download backoff by @pluh-nv in #3169
  • cute_dsl/moe: drop redundant Python-side moe_sort buffer init by @leejnau in #3226
  • Support Sigmoid (sigmoid+topk) routing function by @EdalatiAli in #2869
  • cute-dsl fmha prefill (cubin integration): remove front-padding, add attention_sink, and pdl support by @limin2021 in #3181
  • fix(mla): widen page index to int64_t to avoid 32-bit overflow by @Tracin in #3136
  • bump version to 0.6.11 by @aleozlx in #3245
  • fix(cute_dsl/moe): make autotuner bucket configuration adapt to runtime input by @leejnau in #3216
  • Fix: skip git submodule update when submodules are already populated by @kahyunnam in #3248
  • Fix 10 bugs in BF16 XQA MLA kernel for SM120/SM121 by @blake-snc in #2689
  • Tweak grouped_mm api to make backend specific argument keyword-only by @yanqinz2 in #3253
  • perf(moe): optimize SM120 b12x MoE short decode by @lukealonso in #3193
  • feat: Enable FP8 (E4M3/E5M2) in concat_mla_k for optimize long-context prefill performance and refactor type dispatch for BF16/FP16 by @qiching in #3129
  • fix hang in allreduce comms in SGL by @b8zhong in #3247
  • fix(sm12x): fix micro-kernel workspace sizing when routed_rows > num_local_experts by @meena-at-work in #3191
  • Issue #3047: Handle empty KV in MLA chunked-prefill by @saltyminty in #3251
  • Cutlass dsl 4.5 bump by @kahyunnam in #3246

New Contributors

Full Changelog: v0.6.10rc1...v0.6.11

Don't miss a new flashinfer release

NewReleases is sending notifications on new releases.