What's Changed
- Loosened trtllm_ragged_attention_deepseek shape assertion by @nvjullin in #3064
- Update moe gemm by @IwakuraRein in #3239
- perf: optimize per-token nvfp4 quantization kernel. by @IwakuraRein in #3237
- build: add sccache-backed jit-cache builds and AOT diagnostics by @dierksen in #3205
- non-override tactic control by @yanqinz2 in #3260
- ci(jit-cache): limit sm110 builds to aarch64 by @dierksen in #3275
- feat(moe): add SM120 W4A16 b12x kernels by @lukealonso in #3271
- Add dynamic tokens-per-page TRTLLM-GEN GQA kernels by @PerkzZheng in #3259
- fix(cute_dsl/moe): unbias autotuner profiling for tile_size enumeration by @leejnau in #3252
- Support Kimi K2.5 H64 CuTe DSL MLA decode by @saltyminty in #3235
- feat: FP8 output support for CUTLASS MLA paged attention by @carlyou in #2779
- fix(jit): propagate -DNDEBUG to host-side cflags by @arpera in #3278
- feat: add SM120 fmha_v2 kernels to AOT pip wheel builds by @blake-snc in #2885
- bench(moe_deepseek): fix moe benchmark (supersedes #2886) by @leejnau in #3292
- fix(gdn_decode): widen pool indices to Int64 to prevent int32 element-offset overflow by @vadiklyutiy in #3230
- [chore] Add guard to blackwell GDN prefill by @jiahanc in #3267
- fix: remove over-strict K%4 assert in get_shuffle_matrix_sf_a_row_indices by @jimmyzho in #3163
- ci: isolate nightly package tests from source tree by @dierksen in #3274
- Fix [Spark unit test CI]: defer torch._dynamo.disable to avoid import-time crash in CI by @kahyunnam in #3290
- bench(moe_deepseek): scope autotune(True) to pre-warm only by @leejnau in #3301
- Improved
simplemamba SSU kernel by @ishovkun in #2962 - add cuda tile dependency for cuda 13.0 by @nv-yunzheq in #3305
- [Fix] Fix XQA V tile reading from wrong page when nbVItersPerXIter > 1 by @qsang-nv in #3022
- fix: MNNVL Allreduce uses bitwise sentinel checking to avoid subnormal value issue (#3053) by @timlee0212 in #3304
- Fix: remove nvfp4 llama4 blocker by @IwakuraRein in #3313
- [chore] add mamba codeowners list by @jimmyzho in #3318
- Modify release deletion command in workflow by @aleozlx in #3307
- Add to code owners by @dhiraj113 in #3326
- feat: Add CuTe DSL grouped-gemm + combine fusion support by @nvcastet in #2944
- fix(gdn): allow importing gdn_decode without a CUDA device by @kahyunnam in #3293
- feat: enable glm5 router gemm by @b8zhong in #3185
- fix(fmha_v2): fix FP8 V-scratch pipeline and varlen scheduler on SM90 by @jimmyzho in #3276
- fix typo llama routing issue in trtllm-gen moe by @IwakuraRein in #3303
- feat(logging,trace): cuda-graph-compatible level-5/10 logging + fi_trace template additions/fixes by @yyihuang in #3172
- Use cudnn 9.23 new API to query workspace with override shape by @yanqinz2 in #3291
- feat: Expose unpacked topk weights for routed moe (fp4) by @aleozlx in #2425
- Reland support lse in trtllm paged attn kernels by @murphymatt in #3116
- fix(CI unit tests, cute_dsl, spark): set USER env var before torch._dynamo import for unmapped UIDs by @kahyunnam in #3314
- feat(trace): embed runnable init() in every TraceTemplate by @yyihuang in #3221
- feat(cute_dsl/moe): deterministic balanced autotune profile inputs by @leejnau in #3286
- feat(cute_dsl/moe): add
moe_output_memset_inplacedense memset wrapper by @leejnau in #3328 - Fix/3170 dense blockscaled sm12x by @leonardHONG in #3180
- test: enable bmm_mxfp8 cutlass backend coverage on SM12x by @leonardHONG in #3183
- Ep api design - Build Infra dependencies by @Anerudhan in #3315
- [feat] Add gemma RMS AR fusion by @jiahanc in #3322
- checkpointing_ssu kernel: fused replay + conditional state-write for Mamba2 by @ishovkun in #3324
- Ameyn/gdn bf16 dispatcher and 4d pool by @ameynaik-hub in #3268
- Update trtllm FMHA cubins by @djmmoss in #3317
- fix(trace): repair TGV and XQA MLA reference tests by @yyihuang in #3365
- feat: Add 8x4 swizzle layout support to MXFP4 and MXFP8 CuTe-DSL kernels by @bkryu in #3357
- Add AGENTS.md shim by @aleozlx in #3342
- Add list_api script by @aleozlx in #3341
- Support 4over6 nvfp4 for quantizer and fused MoE by @zianglih in #3264
- Add DeepSeek V4 sparse MLA TRTLLM-GEN kernels by @PerkzZheng in #3269
- Reject EP configurations in b12x MoE with a clear error by @kahyunnam in #3302
- fix(cute_dsl): avoid MoE wrapper runner reference cycle by @leejnau in #3340
- feat: Add support for LoRa delta in MOE mxint4 x bf16, MXFP8 & BF16 to trtllm backend by @djns99 in #3153
- Restore monolithic CuTe-DSL MLA decode alongside modular, gated by cute_dsl_impl= by @pgera in #3296
- feat: RMSNorm + RoPE fusion for WAN: flashinfer.diffusion_ops.fused_qk_rmsnorm_rope by @kahyunnam in #3148
- fix deprecation warnings from cute-dsl by @b8zhong in #3333
- feat(cute_dsl/moe): re-enable use_cold_l2_cache in CuteDslMoEWrapper TuningConfig by @leejnau in #3384
- Add torch.compile-compatible custom op for fp4_quantize by @Kh4L in #3081
- Replace SM120 W4A16 MoE kernels by @lukealonso in #3336
- bump version to 0.6.12 by @aleozlx in #3388
New Contributors
- @carlyou made their first contribution in #2779
- @nvcastet made their first contribution in #2944
- @Kh4L made their first contribution in #3081
Full Changelog: v0.6.11rc1...v0.6.12rc1