What's Changed
- Fix flag order by @nandor in #1392
- Add flags to trim down AoT builds by @nandor in #1393
- Force upgrade cuDNN to latest by @paul841029 in #1401
- Adding FP8 benchmark on attention and matmul testing by @bkryu in #1390
- feature: enable cublas for fp4 gemm when cudnn == 9.11.1 or >= 9.13 by @ttyio in #1405
- Relax the clear_cuda_cache by @yongwww in #1406
- Update autotune results for the nvfp4 cutlass moe backends for v0.2.9 by @kaixih in #1361
- fix shared memory alignment conflict in sampling.cuh by @842974287 in #1402
- Fix trtllm moe launcher local_num_experts by @wenscarl in #1398
- [bugfix] Fix compilation failure when compiling csrc/trtllm_moe_allreduce_fusion.cu by @nvpohanh in #1410
- install: remove nvidia-cudnn-12 from package dependency by @yzh119 in #1409
- Add mypy to pre-commit by @cyx-6 in #1179
- feat(aot): add nvshmem module for aot compilation by @EmilienM in #1261
- Add ruff to pre-commit by @cyx-6 in #1201
- install: remove nvidia-nvshmem-cu12 from package dependency by @EmilienM in #1426
- Fix redundant kernels in moe by @fzyzcjy in #1428
- ci: add arm64 to release-ci-docker.yml by @yzh119 in #1429
- Fix crash when pos_encoding_mode is passed as int by @kaixih in #1413
- Fix trtllm_ar failure by @nvpohanh in #1423
- Use self hosted runner for arm image build by @yongwww in #1433
- Remote const qualifier to avoid compilation error by @842974287 in #1421
- Add multi-arch Docker image for x86-64 and arm64 by @yongwww in #1431
- Add NOTICE with copyrights by @sricketts in #1432
- Fix FusedMoeRunner does not exist error by @nvpohanh in #1424
- Putting back cudnn_batch_prefill_with_kv_cache that was deleted by ruff by @bkryu in #1438
- Decouple cutlass config version from flashinfer version by @kaixih in #1441
- feat: Fused rope fp8 quantize kernel for MLA by @yzh119 in #1339
- Add disk cleanup for Docker builds by @yongwww in #1442
- ci: Add ARM AOT test by @yongwww in #1418
- bugfix: fix perf issue by using fp8 graph that can use cublaslt by @ttyio in #1435
- Faster weight processing (moe nvfp4) by @aleozlx in #1412
- Add alignment in MxFP8Quantization by @Qiaolin-Yu in #1445
- misc: remove unused dependency by @yzh119 in #1443
- fix: remote redundant zero_init from trtllm-gen attn by @yyihuang in #1444
- benchmark: trtllm-gen mha with sink, add benchmark args by @yyihuang in #1415
- Fixes for Blackwell Tests by @paul841029 in #1434
- Fix missing v_scale for prefill wrapper. by @weireweire in #1416
- ci: add github actions to upload sdist to pypi by @yzh119 in #1270
- 3rparty: upgrade cutlass dependency to v4.1.0 by @yzh119 in #1299
- feature: add cutlass as bmm_fp8 backend. by @ttyio in #1397
- release: bump version to v0.2.11 by @yongwww in #1447
- ci: bugfix on sdist pypi workflow by @yzh119 in #1449
New Contributors
- @paul841029 made their first contribution in #1401
- @842974287 made their first contribution in #1402
- @fzyzcjy made their first contribution in #1428
- @sricketts made their first contribution in #1432
- @Qiaolin-Yu made their first contribution in #1445
Full Changelog: v0.2.10...v0.2.11