github NVIDIA/TensorRT-LLM v1.3.0rc2

pre-release9 hours ago

Highlights:

  • Model Support

    • Enable MTP for Nemotron Super (#10754)
    • Make TRTLLM MoE the default for GPTOSS on Blackwell (#11074)
    • Add missing absolute position embeddings in Qwen3-VL vision encoder (#11065)
  • API

    • Change context params and disagg params (#10495)
    • Add KVCacheManagerV2 APIs for Transceiver (#11003)
  • Feature

    • Add Skip Softmax MLA kernels for Blackwell and fix NVFP4 KV accuracy bug (#10813)
    • Fuse AllGather for expert statistics required by EPLB (#10885)
    • Add first-iteration streaming for GPT-OSS in trtllm-serve (#10808)
    • Integrate CuteDSL argmax kernel (#10476)
    • Update Mamba decode kernel to FlashInfer (#10757)
    • Improve effective memory bandwidth with TMA.RED (#10987)
    • Reorganize AutoTuner cache file for distributed tuning (#10956)
    • Support attention DP + Helix CP (#10477)
    • Improve performance of _write_finish_reasons in TorchSampler (#10459)
    • Add gRPC server for high-performance external router integration (#11037)
    • Prepare for future KVCacheV2 MTP support (#11029)
  • Fix

    • Fix CuteDSL MoE unit test (#10983)
    • Fix overlap scheduler pause() timing (#10943)
    • Fix Pydantic deepcopy bug (#11004)
    • Restore IPv6 support in serve.py (#10929)
    • Fix conditional compilation for sm10x cubins (#10839)
    • Add graceful fallbacks for NCCL symmetric mode (#11042)
    • Fix enable_alltoall passed to CutlassFusedMoE (#11016)
    • Fix kvCacheManager isLeaf() assertion failure (#10922)
    • Add null pointer check to parseNpyHeader (#10944)
    • Fix attention DP scheduling sort order to prioritize non-relaxed requests (#11106)
  • Documentation

    • Update Qwen2/3-VL models in supported_models.md (#10797)
  • Benchmark

    • Add performance alignment to layer-wise benchmarks (#11018)
    • Clean up layer-wise benchmarks code (#11092)
    • Add DGX-Spark VLM gemm3-12b bfp16/fp4/fp8 accuracy and perf cases (#11096)
  • Test & Infra

    • Add 250K-token NVFP4 MoE + PDL regression tests (#10911)
    • Add timeout for SeedOSS test (#8683)
    • Add Fake Ops for one-sided AlltoAll (#11002)
    • Refactor setup for RNN cache transceiver (#10957)
    • Change SLURM config access to use resolvePlatform (#11006)
    • Update CI allowList (#11040)
    • Add Mamba and MLA layers to sharding tests (#10364)
    • Remove pybind11 bindings and references (#10550, #11026)
    • Add multi-acc and Lyris GB200 test support (#11024)
    • Package triton-kernels as a dependency (#10471)
    • Fix Qwen3 Eagle test (#11030)
    • Dump thread stacks for hanging tests before timeout (#10708)
    • Remove -ccache from build_wheel.py args (#11064)
    • Fix trtllm-serve guided decoding test (#11101)
    • Remove invalid account for Blossom CI (#11126)
    • Add source code pulse scan to PLC nightly pipeline (#10961)

What's Changed

New Contributors

Full Changelog: v1.3.0rc1...v1.3.0rc2

Don't miss a new TensorRT-LLM release

NewReleases is sending notifications on new releases.