What's Changed
- chore: MoE benchmark effective BW fix for trtllm_block_scale_moe by @rosenrodt in #2341
- Update Docker CI tags to 20260114-cc1a362 by @flashinfer-bot in #2351
- [perf] Improve gemm_fp8_nt_groupwise (cutlass backend) by 10-40% for batch sizes <= 32 by @aidando73 in #2327
- feat: Add auto-fixing pre-commit to Claude Code workflows by @yzh119 in #2331
- tiny support glm routing by @b8zhong in #2313
- fix: Handle zeros in Mistral Large 3 MoE inference by @dbari in #2238
- benchmarks: Add norm and quantization routines to microbenchmark harness. by @bkryu in #2362
- [CI] Add support for testing dependency commits before release by @yongwww in #2353
- feat: introduce GitHub Actions workflow for PR testing by @yongwww in #2326
- chore: Add TRTLLM MoE A2A benchmark by @rosenrodt in #2354
- Added the cudnn backend Ragged KV Cache wrapper by @Anerudhan in #2352
- Enable fp16/bf16/f32 support for selective_state_update (mamba) by @ishovkun in #2366
- ci: increase nightly release build timeout by @yongwww in #2371
- chore: fix claude git actions by @yzh119 in #2384
- chore: add script to run unittests/benchmarks on Modal GPU runners by @yzh119 in #2377
- bugfix: hotfix of PR 2366 (mamba kernel) by @yzh119 in #2378
- ci: add docker cleanup before running tests by @yongwww in #2386
- chore: Refactor benchmark imports to be lazy-loaded by @bkryu in #2388
- fix: ensure each CTA processes full numHeadsQPerKv for trtllm decode kernel by @dongjiyingdjy in #2380
- ci: add Docker Hub authentication to mitigate pull rate limits by @yongwww in #2393
- A Blackwell-optimized version of selective_state_update (mamba) by @ishovkun in #2387
- fix: In-place Residual Update for add_rmsnorm_fp4quant by @bkryu in #2385
- hotfix: remove uv.lock and add it to .gitignore by @yzh119 in #2399
- feat: [Qwen3-Next] Add Cute DSL GDN decode kernel and tests by @HongliMi in #2370
- Update Mamba selective_state_scan API signature by @shaharmor98 in #2392
- Optimize quantization function in large problem size by @Shunkangz in #2343
- feat: Add output_both_sf_layouts option to add_rmsnorm_fp4quant API by @bkryu in #2395
- release: bump version to 0.6.2 by @yzh119 in #2411
New Contributors
- @rosenrodt made their first contribution in #2341
- @aidando73 made their first contribution in #2327
- @HongliMi made their first contribution in #2370
- @shaharmor98 made their first contribution in #2392
- @Shunkangz made their first contribution in #2343
Full Changelog: v0.6.1...v0.6.2