What's Changed
- feat: BF16 GEMM benchmarking support by @raayandhar in #2525
- [bugfix]Correct chunk_end calculation in multi-CTA collaboration when max_len > length by @huangzhilin-hzl in #2489
- test: Skip test_decode_delta_rule.py by @bkryu in #2600
- feat: add issue self-claim workflow for external contributors by @jwu1980 in #2586
- ci: add cleanup step to nightly release self-hosted runner jobs by @yongwww in #2510
- ci: fix H100 cleanup by @yongwww in #2590
- tests: add bias testing to nvfp4 moe by @jimmyzho in #2585
- feat: cute dsl mmfp4 for blackwell by @nv-yunzheq in #2540
- fix: correct #pragma unoll typo to #pragma unroll in vec_dtypes.cuh by @Bias92 in #2611
- fix: get tensors by const ref to not rely on deleted move constructor for
TensorViewby @hypdeb in #2602 - Mamba SSU: better automatic kernel selection + algorithm selection optionally exposed to the user. by @ishovkun in #2591
- chore/feat: Add do_finalize to trtllm-gen fp8/f16 MoE APIs by @IwakuraRein in #2548
- docs: Document setuptools upgrade requirement for editable installs with --no-build-isolation by @bkryu in #2541
- docs: resolve TODO by documenting log2f vs logf performance rationale in sampling by @Bias92 in #2609
- Ameyn/gdn bf16 tolerance parallel reduction by @ameynaik-hub in #2610
- feat: trtllm tinygemm2 in flashinfer as bf16 routergemm by @jimmyzho in #2587
- fix: cute dsl nvfp4 moe routing index error by @nv-yunzheq in #2629
- [bugfix] Fix FilteredTopK overflow correctness by @jiangyinzuo in #2605
- fix: add SM121 support to SM120 version guards by @Yuening-wa in #2631
- benchmark: Enable speculative decode microbenchmarking for paged decode by @bkryu in #2628
- feat: add is_sm12x_supported() helper for SM12x family detection by @blake-snc in #2574
- benchmark: Add MXFP4/MXFP8 quantization mode support to FP4 MoE benchmark by @bkryu in #2635
- fix: duplicate username bug in codeowners_analyzer.py by @sricketts in #2637
- Perf: Optimize GDN decode pretranspose kernel for all batch sizes by @ameynaik-hub in #2588
- support qk_nope_head_dim for 192 check for GLM-5 by @rainj-me in #2607
- fix: trtllm_mxint4_block_scale_moe unit test to index output list by @jimmyzho in #2627
- chore: Update CODEOWNERS by @flashinfer-bot in #2286
- fix: Add fused MOE and GEMM AOT modules for SM121 by @blake-snc in #2654
- refactor: pull trtllm-gen batch-gemm/gemm headers from artifactory; update tma descriptor shape init by @jimmyzho in #2235
- fix: Add tests for the AutoTuner and fix bug in _find_nearest_profile by @danisereb in #2617
- Bf16 routed moe by @IwakuraRein in #2594
- perf: Update trtllm-gen batched GEMM kernels - faster, more NVFP4 tile dims, MXFP8 with relu2 act by @amitz-nv in #2667
- Add code owner for scripts/codeownder_overrides.js by @aleozlx in #2656
- feat: Autotuner support CUDA graph and cold L2 cache by @amitz-nv in #2663
- benchmarks: Add FP8 input / BF16 output in ragged prefill benchmark by @bkryu in #2666
- Fix ImportError in AllReduceFusionWorkspace destructor during Python shutdown by @chaunceyjiang in #2659
- Version bump to 0.6.5 by @aleozlx in #2668
New Contributors
- @jwu1980 made their first contribution in #2586
- @Bias92 made their first contribution in #2611
- @jiangyinzuo made their first contribution in #2605
- @chaunceyjiang made their first contribution in #2659
Full Changelog: v0.6.4...v0.6.5