What's Changed
- Reduce the JIT compilation time of gen_gemm_sm100_module by @jinyangyuan-nvidia in #1251
- fix: correctly pass k_scale and v_scale to run() in forward_return_lse (#1023) by @vlev02 in #1254
- Made AR output optional + esthetic changes by @nvmbreughe in #1265
- init add gemm fp8 using cudnn backend by @ttyio in #1264
- Feature/sm100 low latency nvfp4 kernels by @azhurkevich in #1214
- CI: install
nvidia-nvshmem-cu12by @EmilienM in #1262 - feat: enable trtllm-gen mla MTP by @yyihuang in #1258
- Add trtllm-gen attention mha kernel with FP8 Q/K/V and FP8 output by @weireweire in #1242
- add trtllm-gen context attention by @IwakuraRein in #1239
- feat: add masked deepgemm support and benchmarking by @cyx-6 in #1266
- Add missing import in comm/init,py by @joker-eph in #1275
- hotfix: fix deepgemm artifactory hash by @cyx-6 in #1278
- Unify groupwise fp8 GEMM test by @cyx-6 in #1281
- fix: update trtllm-gen fmha benchmark by @yyihuang in #1280
- fix multiCtasKvScratchPtr misalignment issue (new one) by @nvpohanh in #1286
- Fix install folder regression, and JIT-vs-AOT differences by @directhex in #1279
- Add shuffle matrix flag by @aleozlx in #1272
- Convert scale_factor from scalar to Tensor in trt_allreduce_fusion by @ilmarkov in #1284
- patch error handling by @aleozlx in #1293
- Bug fix: guard fp8 e8m0 and e2m1 compile by @Edenzzzz in #1287
- refactor: Improved metainfo for trtllm-gen fmha by @cyx-6 in #1292
- add mm_fp4 use cudnn backend by @ttyio in #1288
- fix: minor errors in cubin loader by @yyihuang in #1295
- perfix: use lightweight API to query device property by @azhurkevich in #1298
- refactor: refactor trtllm-gen attention kernel integration code by @yzh119 in #1289
- Remove FAST_BUILD FLAG for MOE by @wenscarl in #1291
- bugfix: ensure graph is captured and executed on the same stream to avoid rep… by @elfiegg in #1303
- minor: some fix and cleanup for trtllm-gen mha by @yyihuang in #1302
- [Feature] SM level profiler by @Edenzzzz in #1305
- Heuristics + testing unification + CUDA Graphs by @azhurkevich in #1306
- Update cutlass fp4 moe kernels by @wenscarl in #1294
New Contributors
- @vlev02 made their first contribution in #1254
- @ttyio made their first contribution in #1264
- @azhurkevich made their first contribution in #1214
- @weireweire made their first contribution in #1242
- @IwakuraRein made their first contribution in #1239
- @nvpohanh made their first contribution in #1286
- @directhex made their first contribution in #1279
- @ilmarkov made their first contribution in #1284
- @elfiegg made their first contribution in #1303
Full Changelog: v0.2.8...v0.2.9rc1