What's Changed
- trying this one character fix for main branch by @aleozlx in #3213
- Add git submodule update to build_backend.py by @kahyunnam in #3190
- fix(cute_dsl/moe): correct tile_size=256 gemm2 tactic enumeration by @leejnau in #3171
- Fix trace-bmm-fp8 test: B should be K-major for subword types by @xrq-phys in #3184
- feat: Add DiT-oriented kernels where Qk (Bmm1) type can be reinterpreted into Int8 or BFloat16 by @xrq-phys in #2711
- [fmha-v2] Support HND and NHD paged KV cache layouts with conditional stride handling by @zhou-yuxin in #2799
- [feat] Trtllm-gen Per-token Nvfp4 MoE by @IwakuraRein in #3027
- feat: Add cuBLASLt backend for
mm_bf16and enable multi-tactic autotuning for FP8/MXFP8 runners by @vadiklyutiy in #2914 - trtllm non causal support by @saltyminty in #3020
- feat: DiT layer norm fusions for WAN: flashinfer.diffusion_ops by @kahyunnam in #3157
- Refactor Part 3- Add block-per-token feature in the customized routing method by @ChristinaZ in #3166
- fix(cute_dsl/moe): correct off-by-one in get_max_num_tiles to match TRT-LLM by @leejnau in #3198
- Yanqinz/fix cudnn sm120 nan by @yanqinz2 in #3192
- bump version to 0.6.10 by @aleozlx in #3179
- fix: align is_sm120f_supported with SM12x family semantics by @leonardHONG in #3175
- fix: add sm_121 to TMEM column fallback map by @leonardHONG in #3173
- Include TinyGEMM into BF16 autotuner by @askliar in #3203
- fix(dcp_alltoall): require MNNVL workspace, drop broken plain-memory path by @davidjpyu in #3210
- Integrate CUTLASS Small Tile N Blockscaled GEMMs/Grouped GEMMs for SM120 and SM121 by @depaulmillz in #3152
- Fix bf16 cudnn override-shape test call signature by @Vinnie6167 in #3215
- Ameyn/wide vec t1 by @ameynaik-hub in #3147
- [Perf] Add FMHAv2 to flashinfer_benchmark.py and eliminate unnecessary H2D by @jimmyzho in #2841
- Fix multi-instances using same random seed by @guyuankan in #3102
- add_grouped_mm_operation_directory by @yanqinz2 in #3052
- Support Allreduce + Norm + Per-token Group Fp8 Quant Fusion by @wzhao18 in #3059
- [Bugfix] Fix fused MoE autotuning correctness issues by filtering clusterDimZ by @wzhao18 in #3227
- fix: add jitter to cubin download backoff by @pluh-nv in #3169
- cute_dsl/moe: drop redundant Python-side moe_sort buffer init by @leejnau in #3226
- Support Sigmoid (sigmoid+topk) routing function by @EdalatiAli in #2869
- cute-dsl fmha prefill (cubin integration): remove front-padding, add attention_sink, and pdl support by @limin2021 in #3181
- fix(mla): widen page index to int64_t to avoid 32-bit overflow by @Tracin in #3136
- bump version to 0.6.11 by @aleozlx in #3245
- fix(cute_dsl/moe): make autotuner bucket configuration adapt to runtime input by @leejnau in #3216
- Fix: skip git submodule update when submodules are already populated by @kahyunnam in #3248
- Fix 10 bugs in BF16 XQA MLA kernel for SM120/SM121 by @blake-snc in #2689
- Tweak grouped_mm api to make backend specific argument keyword-only by @yanqinz2 in #3253
- perf(moe): optimize SM120 b12x MoE short decode by @lukealonso in #3193
- feat: Enable FP8 (E4M3/E5M2) in concat_mla_k for optimize long-context prefill performance and refactor type dispatch for BF16/FP16 by @qiching in #3129
- fix hang in allreduce comms in SGL by @b8zhong in #3247
- fix(sm12x): fix micro-kernel workspace sizing when routed_rows > num_local_experts by @meena-at-work in #3191
- Issue #3047: Handle empty KV in MLA chunked-prefill by @saltyminty in #3251
- Cutlass dsl 4.5 bump by @kahyunnam in #3246
New Contributors
- @xrq-phys made their first contribution in #3184
- @zhou-yuxin made their first contribution in #2799
- @leonardHONG made their first contribution in #3175
- @guyuankan made their first contribution in #3102
- @pluh-nv made their first contribution in #3169
- @EdalatiAli made their first contribution in #2869
- @Tracin made their first contribution in #3136
- @lukealonso made their first contribution in #3193
Full Changelog: v0.6.10rc1...v0.6.11