What's Changed
- feat: Add backend="b12x" for mm_fp4 on SM120 by @bkryu in #3051
- docs: document MAX_JOBS env var and its interaction with FLASHINFER_N… by @aleozlx in #3060
- PR #2772 might have introduced a device side compilation regression by @aleozlx in #3056
- [feat] Add routing_replay_out support to MoE kernels and Python API by @TomerBN-Nvidia in #3024
- fused_moe: pre-filter SM89 tactics with zero occupancy on SM120 Blackwell (fix review feedback on #2764) by @aniskumar-nv in #3032
- feat: Add b12x CuTe DSL fused MoE for SM120 by @bkryu in #3066
- CuTe DSL FP4 GEMM Heuristic by @Vinnie6167 in #2940
- Support lse in trtllm paged attn kernels by @murphymatt in #3058
- Revert "Support lse in trtllm paged attn kernels" by @aleozlx in #3079
- docs(gdn): document -1 padding index semantics for pool+indices path by @kaixih in #3019
- feat(gdn): separate input and output pool indices by @feldsherov in #2905
- [CICD fix] Adjust CICD MAX_JOBS to fix OOM on H100 tests by @kahyunnam in #3078
- Add qiching as code owner for autotuner files by @sricketts in #3104
- Route the missing parameter for
trtllm_fp8_per_tensor_scale_moe_opby @pavanimajety in #3094 - Fix: Extend b12x FP4 GEMM support to SM121 (GB10/DGX Spark) by @meena-at-work in #3113
- Add parallel attention by @xueweilnvidia in #2630
- [feat] Faster topk algorithm by @Aalanli in #3009
- feat: Add b12x_fused_moe / B12xMoEWrapper SM120 APIs with micro kernel and ReLU2 by @bkryu in #3080
- [fmhav2] skip fp8 tests and add warning by @jimmyzho in #3050
- feat: implement configurable
tie_breakfor filtered topk by @zianglih in #3095 - Add custom tuning buckets and rounding direction to
autotune()by @vadiklyutiy in #2958 - [CuTe DSL] Fix FP8 MLA persistent perf regression and ProxyKind cu13 wheel breakage by @pgera in #3132
New Contributors
- @TomerBN-Nvidia made their first contribution in #3024
- @aniskumar-nv made their first contribution in #3032
- @Vinnie6167 made their first contribution in #2940
- @meena-at-work made their first contribution in #3113
- @xueweilnvidia made their first contribution in #2630
- @Aalanli made their first contribution in #3009
- @vadiklyutiy made their first contribution in #2958
Full Changelog: v0.6.8rc1...v0.6.9rc1