What's Changed
- ci: add permission control for public ci tests by @yongwww in #2397
- Remove cudaMalloc/Free in GDN prefill kernel by @KevinZeng08 in #2415
- Update cudnn prefill to use correct sequence strides by @vedaanta in #2414
- perf: mm_fp4 heuristic prioritizes CUTLASS over cuDNN on SM103 by @bkryu in #2404
- test: add coverage for all cli commands by @sricketts in #1848
- feat: BF16 GEMM using cuDNN backend by @raayandhar in #2376
- refactor: simplify fp4 rmsnorm by @yzh119 in #2421
- feat: update trtllm-gen MoE cubins by @nekorobov in #2416
- chore/feat: A2A + MoE benchmark; add routed counterpart for trtllm_gen_fp8_fused_moe by @rosenrodt in #2379
- [CI] Add on-demand rerun for spot-terminated jobs by @yongwww in #2403
- fix: Fix NaN output in mxfp8_quantize for very small input values by @bkryu in #2441
- feat: Support Fused MoE non gated Relu2 NVFP4 & FP8 and support Nemotron by @amitz-nv in #2304
- infra: add manual code owner override support in codeowner_analyzer.py by @sricketts in #2418
- fix: improve numerical stability of Gumbel sampling by @ixlmar in #2438
- ci: CI build workflow should always pull fresh and do not cache by @bkryu in #2454
- Update Docker CI tags to 20260131-a52eff1 by @flashinfer-bot in #2457
- Revert "feat: Support Fused MoE non gated Relu2 NVFP4 & FP8 and support Nemotron" by @nv-yunzheq in #2451
- Skip trtllm_alltoall tests on Thor by @dierksen in #2448
- Fix argument type error in _cudnn_gemm_fp4_requirement by @Kangyan-Zhou in #2450
- fix: set_log_level now properly sets logger level to enable DEBUG logs by @kahyunnam in #2449
- bugfix: fix stub generation directory in fused_moe module by @yzh119 in #2445
- [Perf][Feature] Add SM103-specific schedulers for NVFP4 CUTLASS kernels by @LopezCastroRoberto in #2303
- ci: set LD_LIBRARY_PATH in Docker images for correct cuBLAS detection by @bkryu in #2468
- add sgl_kernel.fast_topk_v2 to top_k benchmark by @huangzhilin-hzl in #2461
- Update Docker CI tags to 20260203-9b5901e by @flashinfer-bot in #2475
- MTP for mamba by @ishovkun in #2444
- Add sm90 guard to fence ptx by @jhalabi-nv in #2439
- perf: improve gdn decode cute-dsl kernels by @yzh119 in #2405
- ci: migrate release workflows to ci-infra runners by @yongwww in #2467
- fix: blockscale moe routine supports non-DS routing by @hypdeb in #2476
- Fix autotuner oom by @zack041 in #2442
- refactor: reduce hopper's gdn prefill compilation time and fix docstring. by @yzh119 in #2422
- fix: Fix memory bandwidth calculation in MLA benchmarks by @bkryu in #2479
- fix: Rename tests/mamba/test_utils.py to tests/mamba/utils.py to fix CI test discovery by @bkryu in #2481
- Add/update multi node/multi GPU test scripts by @dierksen in #2410
- feat: Support Fused MoE non gated Relu2 NVFP4 & FP8 and support Nemotron, fixed by @amitz-nv in #2462
- ci: fix permission errors in release workflow on ci-infra runner by @yongwww in #2488
- benchmarks: Expand microbenchmark harness to include sampling and RoPe APIs by @bkryu in #2484
- fix: add support check for gemm config for cutlass moe by @nv-yunzheq in #2495
- Allow non-DeepSeekV3 routing with one group by @dbari in #2502
- bump version to 0.6.3 by @aleozlx in #2497
New Contributors
- @KevinZeng08 made their first contribution in #2415
- @vedaanta made their first contribution in #2414
- @ixlmar made their first contribution in #2438
- @Kangyan-Zhou made their first contribution in #2450
- @LopezCastroRoberto made their first contribution in #2303
- @huangzhilin-hzl made their first contribution in #2461
- @zack041 made their first contribution in #2442
Full Changelog: v0.6.2...v0.6.3