github NVIDIA/TensorRT-LLM v1.3.0rc5

latest release: latest-ci-stable-commit-main
pre-release10 hours ago

Highlights

  • Model Support

    • Add support for Qwen3.5 with AutoDeploy (#11394)
    • Read mamba_ssm_cache_dtype from HF config when set to auto (#11582)
    • Add NVFP4 dynamic quantization support for visual_gen models (#11563)
  • API

    • Use new index API; add block scale support; fix max sequence length estimation; add flash MLA support (#11334)
    • Add dynamic LLMAPI defaults system (#11035)
    • Use smg-grpc-proto package for gRPC proto definitions (#11578)
    • Move SaveHiddenStates spec-dec mode to one model (#11241)
  • Feature

    • Add cache transfer setup for Mamba states (#10934)
    • Optimize MoE export by tracing with reduced experts and expanding graph (#11504)
    • Add new Helix kernels for MNNVL-based codepath (#11433)
    • Add line_profiler tool for host overhead analysis (#11232)
    • Enable multi-stream MoE; add multi-stream MLA attention (#11520)
    • Add MoE all-to-all paradigm (#10985)
    • Add support for multi instances in Triton backend with PyTorch backend (#11153)
    • Add KV cache metrics to MetricsCollector for more Prometheus metrics (#11243)
    • Account for reusable KV cache blocks in capacity calculation (#11490)
    • Add CUDA graphs, torch compile, NVTX, and warmup for Visual Gen (#11554)
    • Make preprocessing async (#11459)
    • Split up TorchSampler.Store (#11566)
  • Fix

    • Fix multimodal placeholder counts (#11461)
    • Add cacheSaltID property to BlockKey serialization (#11457)
    • Fix cache transceiver (#11409)
    • Declare the variable in the correct scope (#11066)
    • Fix spec-dec mode flag and related C++ requirements (#10996)
    • Fix Qwen3-VL-Dense/MoE accuracy drop (#11134)
    • Complete WAR for popen in QA env (#11214)
    • Improve error message for mismatched MPI world size (#11294)
    • Use the torch_dtype set by ModelOpt (#11525)
    • Fix silent MPI failures on models with custom tokenizers (#11399)
    • Fix Nemotron issues (#11425)
    • Fix pipeline parallelism + disaggregated serving (#11509)
    • Fix broken LLMAPI config (#11571)
    • Fix illegal memory access with Helix CP=64 (#11593)
    • Validate requests outside sampling loop (#11584)
    • Correct chunked prefill handling in TorchSampler (#11544)
    • Fix SpecDec sampling seed (#11081)
    • Prevent NIXL agent name collision in containerized disaggregated serving (#11552)
  • Documentation

    • Add doc for TRTLLM AIGV initial release (#11489)
    • Update hardware support (#10719)
    • Add documentation on configuring CPU affinity in TRT-LLM (#10678)
    • Add warning about 2-model MTP deprecation (#11043)
    • Update media file paths in Skip Softmax blog (#11540)
    • Update TAVA architecture diagrams for visual gen flow and auto deploy flow (#11523)
    • Add Qwen3.5 and GLM 4.7 Flash to support matrix (#11594)
  • Benchmark

    • Add ctx-only and gen-only disaggregated perf tests (#11361)
  • Test & Infra

    • Add CUTEDSL MoE backend for DeepSeek R1 NVFP4 checkpoint in stress test (#10920)
    • Update MIG tests (#11014)
    • Fix Slurm job name (#11265)
    • Ensure TorchSampler does not sync (#11508)
    • Revert MoE unit tests refactor: add unified ConfigurableMoE test framework (#11532)
    • Re-upgrade GHA for blossom-ci workflow (#11483)
    • Stop using remotes in the Conan install build step (#11516)
    • Update PLC pipeline (#11547, #11597)
    • Fix testdb file for l0_b200_multi_gpus_perf_sanity (#11603)
    • Add visual_gen CODEOWNERS paths (#11606)

What's Changed

New Contributors

Full Changelog: v1.3.0rc4...v1.3.0rc5

Don't miss a new TensorRT-LLM release

NewReleases is sending notifications on new releases.