What's Changed
- Reduce the JIT compilation time of gen_gemm_sm100_module by @jinyangyuan-nvidia in #1251
- fix: correctly pass k_scale and v_scale to run() in forward_return_lse (#1023) by @vlev02 in #1254
- Made AR output optional + esthetic changes by @nvmbreughe in #1265
- init add gemm fp8 using cudnn backend by @ttyio in #1264
- Feature/sm100 low latency nvfp4 kernels by @azhurkevich in #1214
- CI: install
nvidia-nvshmem-cu12by @EmilienM in #1262 - feat: enable trtllm-gen mla MTP by @yyihuang in #1258
- Add trtllm-gen attention mha kernel with FP8 Q/K/V and FP8 output by @weireweire in #1242
- add trtllm-gen context attention by @IwakuraRein in #1239
- feat: add masked deepgemm support and benchmarking by @cyx-6 in #1266
- Add missing import in comm/init,py by @joker-eph in #1275
- hotfix: fix deepgemm artifactory hash by @cyx-6 in #1278
- Unify groupwise fp8 GEMM test by @cyx-6 in #1281
- fix: update trtllm-gen fmha benchmark by @yyihuang in #1280
- fix multiCtasKvScratchPtr misalignment issue (new one) by @nvpohanh in #1286
- Fix install folder regression, and JIT-vs-AOT differences by @directhex in #1279
- Add shuffle matrix flag by @aleozlx in #1272
- Convert scale_factor from scalar to Tensor in trt_allreduce_fusion by @ilmarkov in #1284
- patch error handling by @aleozlx in #1293
- Bug fix: guard fp8 e8m0 and e2m1 compile by @Edenzzzz in #1287
- refactor: Improved metainfo for trtllm-gen fmha by @cyx-6 in #1292
- add mm_fp4 use cudnn backend by @ttyio in #1288
- fix: minor errors in cubin loader by @yyihuang in #1295
- perfix: use lightweight API to query device property by @azhurkevich in #1298
- refactor: refactor trtllm-gen attention kernel integration code by @yzh119 in #1289
- Remove FAST_BUILD FLAG for MOE by @wenscarl in #1291
- bugfix: ensure graph is captured and executed on the same stream to avoid rep… by @elfiegg in #1303
- minor: some fix and cleanup for trtllm-gen mha by @yyihuang in #1302
- [Feature] SM level profiler by @Edenzzzz in #1305
- Heuristics + testing unification + CUDA Graphs by @azhurkevich in #1306
- Update cutlass fp4 moe kernels by @wenscarl in #1294
- Fix the bug of the kernel-selection heuristic in trtllm-gen by @PerkzZheng in #1307
- test qkvo quantization not equal to 1. by @weireweire in #1314
- [fix] fix integer overflow in FA2 customized_mask & add buffer overflow warning. by @happierpig in #1290
- Addition of flashinfer_benchmark.py for benchmarking routines by @bkryu in #1323
- minor: update devcontainer by @yyihuang in #1329
- Fix redundant argument in TrtllmGenDecodeModule by @IwakuraRein in #1326
- Optimizations for TRTLLM MNNVL Allreduce by @timlee0212 in #1321
- add torch float4_e2m1fn_x2 check for cudnn fp4 backend by @ttyio in #1333
- only add cudnn dependency for x86 platform by @ttyio in #1332
- Make Fp8 MoE routing_bias optional by @aleozlx in #1319
- feat: Add weight layout option for trtllm-gen fused moe by @aleozlx in #1297
- [Fix] remove torch 2.8 requirement for FP4 GEMM by @elfiegg in #1334
- Bug fix: fix duplicate launch in POD by @Edenzzzz in #1267
- Add blockwise-scaled FP8 GEMM via TRTLLM-Gen. by @sergachev in #1320
- feat: support output nvfp4 in trtllm-gen function call. by @weireweire in #1318
- Fix bench deepgemm setting by @cyx-6 in #1344
- fix: fix trtllm-gen mla error on new interface by @yyihuang in #1348
- [Bugfix] Change max_size for LRU by @elfiegg in #1349
- Support loading autotuned results from json for cutlass fp4 moe backends by @kaixih in #1310
- Refactor scripts in benchmarks to use flasinfer.testing.bench_gpu_time by @bkryu in #1337
- bugfix: Change default index in routingTopKExperts by @amirkl94 in #1347
- Support passing kv_data_type to MultiLevelCascadeAttentionWrapper.plan() by @sarckk in #1350
- Add trtllm-gen prefill test. Fix related wrapper issue. by @weireweire in #1346
- feat: Support logits_soft_cap for Persistent attn; fix kv split limit by @Edenzzzz in #1324
- chore: remove cpp benchmarks, tests, cmake path, as they are deprecated by @hypdeb in #1345
- minor: add trtllm_gen_mla benchmark by @yyihuang in #1316
- cleanup: retire aot-build-utils by @yzh119 in #1354
- minor: more informative error message for buffer overflow by @Edenzzzz in #1357
- gen_trtllm_comm_module: fix device capability detection by @dtrifiro in #1356
- Refactor Fused Moe Module by @wenscarl in #1309
- Add native cudnn_decode for improved cudnn decode performance by @Anerudhan in #1283
- Update CI docker container to use latest cudnn by @yzh119 in #1362
- feature: add fp4 mm using trtllm backend by @ttyio in #1355
- support trtllm-gen prefill fp4 output by @weireweire in #1360
- Allow cudnn prefill kernels to be called natively by @Anerudhan in #1317
- bugfix: fix ci for aot-compile by @yzh119 in #1364
- feat: auto deduce use_oneshot from token_num in all-reduce by @yyihuang in #1365
- add cutlass backend for mm_fp4 by @ttyio in #1296
- Support scale factor start index for fp4 mha prefill/decode by @weireweire in #1363
- test: add cuda graph to comm test by @yyihuang in #1366
- ci: add requests to ci docker container by @yzh119 in #1370
- Artifact downloading and single sourced artifact path by @cyx-6 in #1369
- [fix] remove (view) transpose to keep consistent with majorness MN requirement. by @elfiegg in #1358
- hotfix: update mxfp4 groupwise-scaled gemm unittests by @yzh119 in #1359
- bugfix: fixed cutlass fused moe usage of FP4QuantizationSFLayout::SWIZZLED by @yzh119 in #1371
- ci: add blackwell unittest scripts by @yzh119 in #1372
- Update documentation index by @cyx-6 in #1374
- bugfix: do cudnn related error check only when cudnn backend is enabled. by @ttyio in #1377
- bugfix: Add guard for fp4/fp8 related include headers by @yzh119 in #1376
- refactor: download trtllm gemm metadata from server by @ttyio in #1378
- Fix sphinx error by @cyx-6 in #1380
- release: bump version to v0.2.9 by @yzh119 in #1381
New Contributors
- @vlev02 made their first contribution in #1254
- @ttyio made their first contribution in #1264
- @azhurkevich made their first contribution in #1214
- @weireweire made their first contribution in #1242
- @IwakuraRein made their first contribution in #1239
- @nvpohanh made their first contribution in #1286
- @directhex made their first contribution in #1279
- @ilmarkov made their first contribution in #1284
- @elfiegg made their first contribution in #1303
- @PerkzZheng made their first contribution in #1307
- @bkryu made their first contribution in #1323
- @timlee0212 made their first contribution in #1321
- @sergachev made their first contribution in #1320
- @amirkl94 made their first contribution in #1347
- @sarckk made their first contribution in #1350
- @hypdeb made their first contribution in #1345
- @dtrifiro made their first contribution in #1356
Full Changelog: v0.2.8...v0.2.9