github NVIDIA/TensorRT-LLM v1.0.0rc3

latest releases: v1.1.0rc2.post1, v1.1.0rc3, v1.1.0rc2...
pre-releaseone month ago

Announcement Highlights:

  • Model Support
    • Support Mistral3.1 VLM model (#5529)
    • Add TensorRT-Engine Qwen3 (dense) model support (#5650)
  • Feature
    • Add support for MXFP8xMXFP4 in pytorch (#5411)
    • Log stack trace on error in openai server (#5749)
    • Refactor the topk parallelization part for the routing kernels (#5705)
    • Adjust free GPU memory fraction in KvCacheConfig for DeepSeek R1 tests (#5774)
    • Support FP8 row-wise dense GEMM in torch flow (#5615)
    • Move DeepEP from Docker images to wheel building (#5534)
    • Add user-provided speculative decoding support (#5204)
    • Add optional module cache for TRT-LLM Gen Gemm interfaces (#5743)
    • Add streaming scaffolding_llm.generate_async support (#5345)
    • Detokenize option in /v1/completions request (#5382)
    • Support n-gram speculative decoding with disagg (#5732)
    • Return context response immediately when stream_interval > 1 (#5836)
    • Add support for sm121 (#5524)
    • Add LLM speculative decoding example (#5706)
    • Update xgrammar version to 0.1.19 (#5830)
    • Some refactor on WideEP (#5727)
    • Integrate TRT-LLM Gen FP4 block scale MoE with Pytorch workflow kernel autotuner (#5764)
    • Update transformers to 4.53.0 (#5747)
    • Share PyTorch tensor between processes (#5396)
    • Custom masking utils for Gemma3 VLM (#5853)
    • Remove support for llmapi + TRT backend in Triton (#5856)
    • Add TRTLLM MoE nvfp4 cubins for mid-high concurrency; attention_dp for TRTLLM MoE (#5723)
    • Enable kvcache to be reused during request generation (#4028)
    • Simplify speculative decoding configs (#5639)
    • Add binding type build argument (pybind, nanobind) (#5802)
    • Add the ability to write a request timeline (#5258)
    • Support deepEP fp4 post quant all2all dispatch (#5881)
    • Refactor the rest routing part for the routing kernels in the MoE TRT-LLM backend (#5771)
    • Move vision parts from processor to model for Gemma3 (#5888)
  • API
    • [BREAKING CHANGE] Rename mixed_sampler to enable_mixed_sampler (#5751)
    • [BREAKING CHANGE] Rename LLM.autotuner_enabled to enable_autotuner (#5876)
  • Bug Fixes
    • Fix test_generate_with_seed CI failure. (#5772)
    • Improve fp4_block_scale_moe_runner type check (#5681)
    • Fix prompt adapter TP2 case (#5782)
    • Fix disaggregate serving with attention DP (#4993)
    • Ignore nvshmem_src_*.txz from confidentiality-scan (#5831)
    • Fix a quote error introduced in #5534 (#5816)
    • Fix the accuracy issue when reduce_fusion is enabled for GEMMA model. (#5801)
    • Fix lost requests for disaggregated serving (#5815)
    • Update unit tests: skip all_close assert for dropout in attention, increase tolerance for rope op test (#5855)
    • Fix GEMM+AR fusion on blackwell (#5563)
    • Catch inference failures in trtllm-bench (#5841)
    • Add is_fp8_output key to XQA kernel cubin hashing (solves Eagle3-one-engine Hopper fp8 bug) (#5813)
    • Skip rope scaling for local layers in Gemma3 VLM (#5857)
    • Fix llama4 multimodal support (#5809)
    • Fix Llama4 Scout FP4 crash issue (#5925)
    • Fix max batch size and max tokens in kv cache estimations for Nemotron-H (#5371)
    • Fix moe regression for sm120 (#5823)
    • Fix Qwen2.5VL FP8 support (#5029)
    • Fix the illegal memory access issue in moe gemm on SM120 (#5636)
    • Avoid nesting NCCL group in allgather and reduce scatter OPs (#5866)
    • Fix tileN cannot % 16==0 & support sm89 deepgemm bmm (#5531)
    • Fix incremental detokenization (#5825)
    • Fix MoE workspace info by storing Torch tensor itself instead of data_ptr (#5900)
    • Make the bench serving script compatible with different usages (#5905)
    • Fix mistral unit tests due to transformers upgrade (#5904)
    • Fix the Llama3.1 405B hanging issue. (#5698) (#5925)
    • Fix Gemma3 unit tests due to transformers upgrade (#5921)
    • Extend triton exit time for test_llava (#5971)
    • Fix alltoall for llama4 (apply_router_weight_on_input=True) (#5902)
    • Remove SpecConfig and fix thread leak issues (#5931)
    • Fast redux detection in trtllm gen routing kernel (#5941)
    • Fix cancel request logic (#5800)
    • Fix errors in wide-ep scripts (#5992)
    • Fix error in post-merge-tests (#5949)
  • Benchmark
  • Performance
    • Optimize TRTLLM Sampler perf single beam single step (#5550)
  • Infrastructure
    • Fix a syntax issue in the image check (#5775)
    • Speedup fused moe tests (#5726)
    • Set the label community action to only run on upstream TRTLLM (#5806)
    • Update namelist in blossom-ci (#5838)
    • Update nspect version (#5832)
    • Reduce redundant test cases for TRTLLM Gen FP8 MoE (#5845)
    • Parallelize torch unittests (#5714)
    • Use current_image_tags.properties in rename_docker_images.py (#5846)
    • Fix two known NSPECT high vulnerability issues and reduce image size (#5434)
  • Documentation
    • Update the document of qwen3 and cuda_graph usage (#5705)
    • Update cuda_graph_config usage part in DS R1 docs (#5796)
    • Add llama4 Maverick eagle3 and max-throughput and low_latency benchmark guide (#5810)
    • Fix link in llama4 Maverick example (#5864)
    • Add instructions for running gemma in disaggregated serving (#5922)
    • Add qwen3 disagg perf metrics (#5822)
    • Update the disagg doc (#5938)
    • Update the link of the diagram (#5953)
  • Known Issues

What's Changed

  • feat: Add support for MXFP8xMXFP4 in pytorch by @djns99 in #5535
  • [Doc] update the document of qwen3 and cuda_graph usage by @byshiue in #5703
  • [Infra] - Fix a syntax issue in the image check by @chzblych in #5775
  • chore: log stack trace on error in openai server by @zhengd-nv in #5749
  • fix: [nvbug/5368507] Fix test_generate_with_seed CI failure. by @bobboli in #5772
  • Refactor the topk parallelization part for the routing kernels by @ChristinaZ in #5567
  • test: [CI] remove closed bugs by @xinhe-nv in #5770
  • [TRTLLM-5530][BREAKING CHANGE] refactor: LLM arglist rename mixed_sampler to enable_mixed_sampler by @Superjomn in #5751
  • fix: Adjust free GPU memory fraction in KvCacheConfig for DeepSeek R1 tests by @yizhang-nv in #5774
  • [TRTLLM-5812][feat] support FP8 row-wise dense GEMM in torch flow by @DylanChen-NV in #5615
  • feat: Optimize TRTLLM Sampler perf single beam single step by @dcampora in #5550
  • Refactor: move DeepEP from Docker images to wheel building by @yuantailing in #5534
  • [TRTLLM-6291] feat: Add user-provided speculative decoding support by @Funatiq in #5204
  • [ci] speedup fused moe tests by @omera-nv in #5726
  • [feat] Adds optional module cache for TRT-LLM Gen Gemm interfaces by @davidclark-nv in #5743
  • chore: [Breaking Change] Rename cuda_graph_config padding_enabled fie… by @nv-guomingz in #5795
  • feat: add MultimodalParams & putting all multimodal params into it and refactor HyperCLOVAX & Qwen2/2.5-VL by @yechank-nvidia in #5522
  • Revert "chore: [Breaking Change] Rename cuda_graph_config padding_enabled fie…" by @nv-guomingz in #5818
  • [fix] https://nvbugs/5333654 Unwaive to check ci status and improve torch compile multi-gpu coverage by @liji-nv in #5700
  • [fix] improve fp4_block_scale_moe_runner type check by @Alcanderian in #5681
  • feat(scaffolding): add streaming scaffolding_llm.generate_async support by @dc3671 in #5345
  • [None][infra] Set the label community action to only run on upstream TRTLLM by @poweiw in #5806
  • Waive some test_llama_eagle3 unittests by @venkywonka in #5811
  • [NvBug 5362426] fix: Fix prompt adapter TP2 case by @syuoni in #5782
  • chore: bump version to 1.0.0rc3 by @yiqingy0 in #5819
  • doc: update cuda_graph_config usage part in DS R1 docs by @nv-guomingz in #5796
  • fix: Disaggregate serving with attention DP by @VALLIS-NERIA in #4993
  • Fix: ignore nvshmem_src_*.txz from confidentiality-scan by @yuantailing in #5831
  • tests: waive failed cases on main by @xinhe-nv in #5781
  • [Infra] - Waive L0 test by @yiqingy0 in #5837
  • update namelist in blossom-ci by @niukuo in #5838
  • Fix a quote error introduced in #5534 by @yuantailing in #5816
  • [feat]: Detokenize option in /v1/completions request by @Wokzy in #5382
  • [5305318] fix: Fix the accuracy issue when reduce_fusion is enabled for GEMMA model. by @hyukn in #5801
  • [TRTLLM-5847][feat] Support n-gram speculative decoding with disagg by @raayandhar in #5732
  • [TRTLLM-5878] update nspect version by @niukuo in #5832
  • feat: Return context response immediately when stream_interval > 1 by @kaiyux in #5836
  • test: reduce redundant test cases for TRTLLM Gen FP8 MoE by @DomBrown in #5845
  • [nvbug/5308432] fix: triton_backend test_llava timeout by @chang-l in #5814
  • [TRTLLM-5366][feat]Add support for sm121 by @pamelap-nvidia in #5524
  • chore [TRTLLM-6161]: add LLM speculative decoding example by @Superjomn in #5706
  • Fix lost requests for disaggregated serving by @Tabrizian in #5815
  • fix: [5376140] [AutoDeploy] Update unit tests: skip all_close assert for dropout in attention, increase tolerance for rope op test by @Fridah-nv in #5855
  • Fix GEMM+AR fusion on blackwell by @xavier-nvidia in #5563
  • [fix] Catch inference failures in trtllm-bench by @omera-nv in #5841
  • Doc: Add llama4 Maverick eagle3 and max-throughput and low_latency benchmark guide by @jiahanc in #5810
  • test: Validate and add accuracy& perf tests for Ministral-8B-Instruct[-FP8](pytorch only) by @venkywonka in #5654
  • Add is_fp8_output key to XQA kernel cubin hashing (solves Eagle3-one-engine Hopper fp8 bug) by @jhaotingc in #5813
  • feat: TRTLLM-6224 update xgrammar version to 0.1.19 by @Wanli-Jiang in #5830
  • Doc: fix link in llama4 Maverick example by @jiahanc in #5864
  • fix: Skip rope scaling for local layers in Gemma3 VLM by @brb-nv in #5857
  • fix: [https://nvbugspro.nvidia.com/bug/5375656] Unwaive for bug 5375656. by @bobboli in #5842
  • [AutoDeploy] re-enable waive for flaky AD test by @lucaslie in #5867
  • Remove unnecessary benchmarking results by @qiaoxj07 in #5852
  • chores: merge examples for v1.0 doc by @hchings in #5736
  • [Bugfix] LLama4: fix for llama4 multimodal support by @chang-l in #5809
  • [TRTLLM-6262] Fix Llama4 Scout FP4 crash issue by @chenfeiz0326 in #5834
  • chore: some refactor on WideEP by @dongxuy04 in #5727
  • [TRTLLM-5881] feat: Integrate TRT-LLM Gen FP4 block scale MoE with Pytorch workflow kernel autotuner by @DomBrown in #5764
  • [ci] parallelize torch unittests by @omera-nv in #5714
  • fix: use current_image_tags.properties in rename_docker_images.py by @ixlmar in #5846
  • [TRTLLM-5838][fix] fix max batch size and max tokens in kv cache estimations for Nemotron-H by @tomeras91 in #5371
  • Fix : fix moe regression for sm120 by @peaceh-nv in #5823
  • [NVBUG-5304516/5319741]Qwen2.5VL FP8 support by @DylanChen-NV in #5029
  • Update transformers to 4.53.0 by @Wanli-Jiang in #5747
  • [1/N][TRTLLM-5195][feat] Share PyTorch tensor between processes by @chang-l in #5396
  • feat(models): Mistral3.1 VLM pytorch backend support by @2ez4bz in #5529
  • feat: Custom masking utils for Gemma3 VLM by @brb-nv in #5853
  • [fix] WAR to fix the illegal memory access issue in moe gemm on SM120 by @peaceh-nv in #5636
  • Waive unittest failures introduced by PR#5345 (removal of ScaffoldingOutput class) by @venkywonka in #5886
  • [feat] Add TensorRT-Engine Qwen3 (dense) model support by @gkswns0531 in #5650
  • [TRTLLM-5530] chore: rename LLM.autotuner_enabled to enable_autotuner by @Superjomn in #5876
  • avoid nesting NCCL group in allgather and reduce scatter OPs by @QiJune in #5866
  • chore: remove support for llmapi + TRT backend in Triton by @achartier in #5856
  • [feat] Add TRTLLM MoE nvfp4 cubins for mid-high concurrency; attention_dp for TRTLLM MoE by @rosenrodt in #5723
  • [fix] fix tileN cannot % 16==0 & support sm89 deepgemm bmm by @CarstyYou in #5531
  • [NvBug 5370718, 5371538] fix: Fix incremental detokenization by @syuoni in #5825
  • [None] - Waive L0 tests by @yiqingy0 in #5915
  • [fix] Fix MoE workspace info by storing Torch tensor itself instead of data_ptr by @jinyangyuan-nvidia in #5900
  • fix: Make the bench serving script compatible with different usages by @kaiyux in #5905
  • feat:enable kvcache to be reused during request generation by @narutolhy in #4028
  • infra: [TRTLLM-6054][TRTLLM-5804] Fix two known NSPECT high vulnerability issues and reduce image size by @ZhanruiSunCh in #5434
  • [refactor] Simplification of Speculative decoding configs by @wili-65535 in #5639
  • feat: binding type build argument (pybind, nanobind) by @Linda-Stadter in #5802
  • doc: Add instructions for running gemma in disaggregated serving by @Tabrizian in #5922
  • [fix] Fix mistral unit tests due to transformers upgrade by @2ez4bz in #5904
  • [nvbugs/5321981] Cherrypick fix: Fix the Llama3.1 405B hanging issue. (#5698) by @nvzhihanj in #5925
  • [enhance] Add the ability to write a request timeline. by @FrankD412 in #5258
  • deepEP fp4 post quant all2all dispatch by @yilin-void in #5881
  • test: Fix Gemma3 unit tests due to transformers upgrade by @brb-nv in #5921
  • [TRTLLM-4770][feat] Enhance cpp executor cmake to listen to ENABLE_MU… by @WilliamTambellini in #5104
  • blog: add qwen3 disagg perf metrics by @Shixiaowei02 in #5822
  • Refactor the rest routing part for the routing kernels in the MoE TRT-LLM backend by @ChristinaZ in #5771
  • [TRTLLM-5673] Doc: ensure the disagg doc is up to date by @Shixiaowei02 in #5938
  • doc: update the link of the diagram by @Shixiaowei02 in #5953
  • tests: update sanity tests & fix tests by @xinhe-nv in #5906
  • [refactor] Move vision parts from processor to model for Gemma3 by @2ez4bz in #5888
  • [TRTLLM-6264] Fix flaky test_e2e.py::test_openai_lora by @thorjohnsen in #5885
  • Added code owners for LLM API by @juney-nvidia in #5960
  • [nvbug/5308432] fix: extend triton exit time for test_llava by @chang-l in #5971
  • [NvBug 5378370] fix: Fix alltoall for llama4 (apply_router_weight_on_input=True) by @syuoni in #5902
  • [fix] Remove SpecConfig and fix thread leak issues by @mikeiovine in #5931
  • [BUG5374319][fix] WAR for draft-target-model unit tests error by @wili-65535 in #5958
  • fix: fast redux detection in trtllm gen routing kernel by @tongyuantongyu in #5941
  • fix cancel request logic by @QiJune in #5800
  • Fix errors in wide-ep scripts by @qiaoxj07 in #5992
  • [BUG5388075][fix] Fix error in post-merge-tests by @wili-65535 in #5949

New Contributors

Full Changelog: v1.0.0rc2...v1.0.0rc3

Don't miss a new TensorRT-LLM release

NewReleases is sending notifications on new releases.