What's Changed
- Run high-likelihood OOM culprits separately, record memory usage and test duration for analysis by @dierksen in #2961
- fix(autotuner): differentiate file cache entries by runner specific kernel parameters by @qiching in #3367
- feat: integrate cute-dsl Blackwell GQA decode into BatchDecodeWithPagedKVCacheWrapper by @richardmcai in #3360
- Fix returning reference to temporary in moe gemm by @benbarsdell in #3332
- Enable compression of GPU device binaries by @benbarsdell in #2949
- feat: MNNVL Allreduce quant fusion and performance optimization by @timlee0212 in #3385
- fix(norm): widen address arithmetic to int64 for large contiguous inputs > 2**31 elements by @bkryu in #3392
- MLA Decode Autotuning Across TRTLLM-Gen and CuTe Backends by @Vinnie6167 in #3355
- profiler: group perfetto traces by SM, one row per block by @Edenzzzz in #3038
- Make cute dsl mxfp8/nvfp4 quantizer bitwise exact by @zianglih in #3387
- Extend autotuner delay kernel length by @yanqinz2 in #3373
- Enable smaller tile N for SM100 Cute-DSL NVFP4 GEMM by @b8zhong in #3403
- test: align test_fmha_v2_prefill SM gating with is_sm12x_supported by @leonardHONG in #3182
- feat(autotuner): enable per-op autotune bypass for faster framework warmup by @qiching in #3396
- bench: Unify PDL behavior, add missing norm routines, and misc improvements by @bkryu in #3435
- feat(kda): add recurrent KDA decode kernel with per-K gating by @djmmoss in #2572
- Add mHC post mapping and pre big-fuse kernels by @jmydurant in #3285
- [Feat] Add num_heads < 128 support for mla decode kernel by @Observer007 in #3309
- Fix cross-warp race in checkpointing SSU kernel (mamba) by @ishovkun in #3439
- feat(trace): add check callbacks to trace templates by @yyihuang in #3330
- Add BGMV MoE CUDA kernels for multi-LoRA by @taehokim20 in #3249
- replace deprecated APIs: cute.make_fragment and cute.core.ThrMma by @brandon-yujie-sun in #3430
- [ci feat] Support /bot run tests/ to scope CI test runs by @kahyunnam in #3422
- perf(attention): Speed up FP8 KV-cache prefill (FA2 BatchPrefill) by repacking K/V to BF16 in shared memory by @bkryu in #3485
- Ep api design -- Adding the actual code and tests by @Anerudhan in #3453
- Optimize mxfp8 quantization on sm100 by @IwakuraRein in #3289
- Fix silent no-op autotuning for cuBLAS
bmm_fp8and cuDNNbmm_fp8/mm_fp4by @bkryu in #3437 - fix(quantization):
nvfp4_quantize(backend='cuda')silently corrupts scale factors when global_scale is not float32 by @bkryu in #3497 - feat(moe): add SWIGLUSTEP activation to CUTLASS fused MoE by @bkryu in #3492
- Support LSE buffers in TRTLLM API by @saltyminty in #3410
- NFC: replace deprecated API: cute.make_fragment by @brandon-yujie-sun in #3473
- feat(moe): write routing_replay_out from custom routing kernels by @jdebache in #3382
- Add CuTe DSL NVFP4 quantization with 4over6 FP16 scoring by @zianglih in #3448
- fix intermittent exit 141 (SIGPIPE) in test resource summary by @yongwww in #3498
- add_cudnn_mxfp8 by @yanqinz2 in #3489
- fix: use routedScalingFactor to initialize mRouteScale by @yweng0828 in #3499
- fix: Simple approach to restore support for bias for fp4 block scale types by @djns99 in #3416
- docs(infra): document env vars, refresh SM list, fix stale paths in CLAUDE.md by @kangbintNV in #3440
- docs(quant/sampling/activation): canonicalize quantization.rst, stub fp4_quantization.rst by @kangbintNV in #3447
- docs(gemm): document GEMM + grouped_mm public surface; canonicalize aliases by @kangbintNV in #3442
- Optimize MoE routing top-k reduction and non-power-of-two sorting by @jiahanc in #3476
- [bugfix] TorchDistBackend.bcast uses global rank instead of local rank by @xuanyu-mistral in #3418
- docs(comm): structural RST refactor for MoeAlltoAll/DCP/Mixed comm by @kangbintNV in #3445
- [docs] Backfill missing docstrings and decorators across kernels by @kangbintNV in #3456
- docs(attention): backfill missing/stale Attention/POD/cuDNN/CuteDSL APIs; restore single_prefill_with_kv_cache_return_lse by @kangbintNV in #3441
- docs(moe): close fused_moe / trtllm_*_moe / CuteDSL MoE doc gaps by @kangbintNV in #3443
- docs(comm): NumPy-style docstrings + Deprecated leads for 21 STALE comm APIs (no decorator changes) by @kangbintNV in #3444
- bump version to 0.6.13 by @aleozlx in #3513
New Contributors
- @richardmcai made their first contribution in #3360
- @jmydurant made their first contribution in #3285
- @taehokim20 made their first contribution in #3249
- @brandon-yujie-sun made their first contribution in #3430
- @xuanyu-mistral made their first contribution in #3418
Full Changelog: v0.6.12rc3...v0.6.13rc1