NVIDIA/TensorRT-LLM v1.0.0rc3 on GitHub

Announcement Highlights:

Model Support
- Support Mistral3.1 VLM model (#5529)
- Add TensorRT-Engine Qwen3 (dense) model support (#5650)
Feature
- Add support for MXFP8xMXFP4 in pytorch (#5411)
- Log stack trace on error in openai server (#5749)
- Refactor the topk parallelization part for the routing kernels (#5705)
- Adjust free GPU memory fraction in KvCacheConfig for DeepSeek R1 tests (#5774)
- Support FP8 row-wise dense GEMM in torch flow (#5615)
- Move DeepEP from Docker images to wheel building (#5534)
- Add user-provided speculative decoding support (#5204)
- Add optional module cache for TRT-LLM Gen Gemm interfaces (#5743)
- Add streaming scaffolding_llm.generate_async support (#5345)
- Detokenize option in /v1/completions request (#5382)
- Support n-gram speculative decoding with disagg (#5732)
- Return context response immediately when stream_interval > 1 (#5836)
- Add support for sm121 (#5524)
- Add LLM speculative decoding example (#5706)
- Update xgrammar version to 0.1.19 (#5830)
- Some refactor on WideEP (#5727)
- Integrate TRT-LLM Gen FP4 block scale MoE with Pytorch workflow kernel autotuner (#5764)
- Update transformers to 4.53.0 (#5747)
- Share PyTorch tensor between processes (#5396)
- Custom masking utils for Gemma3 VLM (#5853)
- Remove support for llmapi + TRT backend in Triton (#5856)
- Add TRTLLM MoE nvfp4 cubins for mid-high concurrency; attention_dp for TRTLLM MoE (#5723)
- Enable kvcache to be reused during request generation (#4028)
- Simplify speculative decoding configs (#5639)
- Add binding type build argument (pybind, nanobind) (#5802)
- Add the ability to write a request timeline (#5258)
- Support deepEP fp4 post quant all2all dispatch (#5881)
- Refactor the rest routing part for the routing kernels in the MoE TRT-LLM backend (#5771)
- Move vision parts from processor to model for Gemma3 (#5888)
API
- [BREAKING CHANGE] Rename mixed_sampler to enable_mixed_sampler (#5751)
- [BREAKING CHANGE] Rename LLM.autotuner_enabled to enable_autotuner (#5876)
Bug Fixes
- Fix test_generate_with_seed CI failure. (#5772)
- Improve fp4_block_scale_moe_runner type check (#5681)
- Fix prompt adapter TP2 case (#5782)
- Fix disaggregate serving with attention DP (#4993)
- Ignore nvshmem_src_*.txz from confidentiality-scan (#5831)
- Fix a quote error introduced in #5534 (#5816)
- Fix the accuracy issue when reduce_fusion is enabled for GEMMA model. (#5801)
- Fix lost requests for disaggregated serving (#5815)
- Update unit tests: skip all_close assert for dropout in attention, increase tolerance for rope op test (#5855)
- Fix GEMM+AR fusion on blackwell (#5563)
- Catch inference failures in trtllm-bench (#5841)
- Add is_fp8_output key to XQA kernel cubin hashing (solves Eagle3-one-engine Hopper fp8 bug) (#5813)
- Skip rope scaling for local layers in Gemma3 VLM (#5857)
- Fix llama4 multimodal support (#5809)
- Fix Llama4 Scout FP4 crash issue (#5925)
- Fix max batch size and max tokens in kv cache estimations for Nemotron-H (#5371)
- Fix moe regression for sm120 (#5823)
- Fix Qwen2.5VL FP8 support (#5029)
- Fix the illegal memory access issue in moe gemm on SM120 (#5636)
- Avoid nesting NCCL group in allgather and reduce scatter OPs (#5866)
- Fix tileN cannot % 16==0 & support sm89 deepgemm bmm (#5531)
- Fix incremental detokenization (#5825)
- Fix MoE workspace info by storing Torch tensor itself instead of data_ptr (#5900)
- Make the bench serving script compatible with different usages (#5905)
- Fix mistral unit tests due to transformers upgrade (#5904)
- Fix the Llama3.1 405B hanging issue. (#5698) (#5925)
- Fix Gemma3 unit tests due to transformers upgrade (#5921)
- Extend triton exit time for test_llava (#5971)
- Fix alltoall for llama4 (apply_router_weight_on_input=True) (#5902)
- Remove SpecConfig and fix thread leak issues (#5931)
- Fast redux detection in trtllm gen routing kernel (#5941)
- Fix cancel request logic (#5800)
- Fix errors in wide-ep scripts (#5992)
- Fix error in post-merge-tests (#5949)
Benchmark
Performance
- Optimize TRTLLM Sampler perf single beam single step (#5550)
Infrastructure
- Fix a syntax issue in the image check (#5775)
- Speedup fused moe tests (#5726)
- Set the label community action to only run on upstream TRTLLM (#5806)
- Update namelist in blossom-ci (#5838)
- Update nspect version (#5832)
- Reduce redundant test cases for TRTLLM Gen FP8 MoE (#5845)
- Parallelize torch unittests (#5714)
- Use current_image_tags.properties in rename_docker_images.py (#5846)
- Fix two known NSPECT high vulnerability issues and reduce image size (#5434)
Documentation
- Update the document of qwen3 and cuda_graph usage (#5705)
- Update cuda_graph_config usage part in DS R1 docs (#5796)
- Add llama4 Maverick eagle3 and max-throughput and low_latency benchmark guide (#5810)
- Fix link in llama4 Maverick example (#5864)
- Add instructions for running gemma in disaggregated serving (#5922)
- Add qwen3 disagg perf metrics (#5822)
- Update the disagg doc (#5938)
- Update the link of the diagram (#5953)
Known Issues

What's Changed

feat: Add support for MXFP8xMXFP4 in pytorch by @djns99 in #5535
[Doc] update the document of qwen3 and cuda_graph usage by @byshiue in #5703
[Infra] - Fix a syntax issue in the image check by @chzblych in #5775
chore: log stack trace on error in openai server by @zhengd-nv in #5749
fix: [nvbug/5368507] Fix test_generate_with_seed CI failure. by @bobboli in #5772
Refactor the topk parallelization part for the routing kernels by @ChristinaZ in #5567
test: [CI] remove closed bugs by @xinhe-nv in #5770
[TRTLLM-5530][BREAKING CHANGE] refactor: LLM arglist rename mixed_sampler to enable_mixed_sampler by @Superjomn in #5751
fix: Adjust free GPU memory fraction in KvCacheConfig for DeepSeek R1 tests by @yizhang-nv in #5774
[TRTLLM-5812][feat] support FP8 row-wise dense GEMM in torch flow by @DylanChen-NV in #5615
feat: Optimize TRTLLM Sampler perf single beam single step by @dcampora in #5550
Refactor: move DeepEP from Docker images to wheel building by @yuantailing in #5534
[TRTLLM-6291] feat: Add user-provided speculative decoding support by @Funatiq in #5204
[ci] speedup fused moe tests by @omera-nv in #5726
[feat] Adds optional module cache for TRT-LLM Gen Gemm interfaces by @davidclark-nv in #5743
chore: [Breaking Change] Rename cuda_graph_config padding_enabled fie… by @nv-guomingz in #5795
feat: add MultimodalParams & putting all multimodal params into it and refactor HyperCLOVAX & Qwen2/2.5-VL by @yechank-nvidia in #5522
Revert "chore: [Breaking Change] Rename cuda_graph_config padding_enabled fie…" by @nv-guomingz in #5818
[fix] https://nvbugs/5333654 Unwaive to check ci status and improve torch compile multi-gpu coverage by @liji-nv in #5700
[fix] improve fp4_block_scale_moe_runner type check by @Alcanderian in #5681
feat(scaffolding): add streaming scaffolding_llm.generate_async support by @dc3671 in #5345
[None][infra] Set the label community action to only run on upstream TRTLLM by @poweiw in #5806
Waive some test_llama_eagle3 unittests by @venkywonka in #5811
[NvBug 5362426] fix: Fix prompt adapter TP2 case by @syuoni in #5782
chore: bump version to 1.0.0rc3 by @yiqingy0 in #5819
doc: update cuda_graph_config usage part in DS R1 docs by @nv-guomingz in #5796
fix: Disaggregate serving with attention DP by @VALLIS-NERIA in #4993
Fix: ignore nvshmem_src_*.txz from confidentiality-scan by @yuantailing in #5831
tests: waive failed cases on main by @xinhe-nv in #5781
[Infra] - Waive L0 test by @yiqingy0 in #5837
update namelist in blossom-ci by @niukuo in #5838
Fix a quote error introduced in #5534 by @yuantailing in #5816
[feat]: Detokenize option in /v1/completions request by @Wokzy in #5382
[5305318] fix: Fix the accuracy issue when reduce_fusion is enabled for GEMMA model. by @hyukn in #5801
[TRTLLM-5847][feat] Support n-gram speculative decoding with disagg by @raayandhar in #5732
[TRTLLM-5878] update nspect version by @niukuo in #5832
feat: Return context response immediately when stream_interval > 1 by @kaiyux in #5836
test: reduce redundant test cases for TRTLLM Gen FP8 MoE by @DomBrown in #5845
[nvbug/5308432] fix: triton_backend test_llava timeout by @chang-l in #5814
[TRTLLM-5366][feat]Add support for sm121 by @pamelap-nvidia in #5524
chore [TRTLLM-6161]: add LLM speculative decoding example by @Superjomn in #5706
Fix lost requests for disaggregated serving by @Tabrizian in #5815
fix: [5376140] [AutoDeploy] Update unit tests: skip all_close assert for dropout in attention, increase tolerance for rope op test by @Fridah-nv in #5855
Fix GEMM+AR fusion on blackwell by @xavier-nvidia in #5563
[fix] Catch inference failures in trtllm-bench by @omera-nv in #5841
Doc: Add llama4 Maverick eagle3 and max-throughput and low_latency benchmark guide by @jiahanc in #5810
test: Validate and add accuracy& perf tests for Ministral-8B-Instruct[-FP8](pytorch only) by @venkywonka in #5654
Add is_fp8_output key to XQA kernel cubin hashing (solves Eagle3-one-engine Hopper fp8 bug) by @jhaotingc in #5813
feat: TRTLLM-6224 update xgrammar version to 0.1.19 by @Wanli-Jiang in #5830
Doc: fix link in llama4 Maverick example by @jiahanc in #5864
fix: Skip rope scaling for local layers in Gemma3 VLM by @brb-nv in #5857
fix: [https://nvbugspro.nvidia.com/bug/5375656] Unwaive for bug 5375656. by @bobboli in #5842
[AutoDeploy] re-enable waive for flaky AD test by @lucaslie in #5867
Remove unnecessary benchmarking results by @qiaoxj07 in #5852
chores: merge examples for v1.0 doc by @hchings in #5736
[Bugfix] LLama4: fix for llama4 multimodal support by @chang-l in #5809
[TRTLLM-6262] Fix Llama4 Scout FP4 crash issue by @chenfeiz0326 in #5834
chore: some refactor on WideEP by @dongxuy04 in #5727
[TRTLLM-5881] feat: Integrate TRT-LLM Gen FP4 block scale MoE with Pytorch workflow kernel autotuner by @DomBrown in #5764
[ci] parallelize torch unittests by @omera-nv in #5714
fix: use current_image_tags.properties in rename_docker_images.py by @ixlmar in #5846
[TRTLLM-5838][fix] fix max batch size and max tokens in kv cache estimations for Nemotron-H by @tomeras91 in #5371
Fix : fix moe regression for sm120 by @peaceh-nv in #5823
[NVBUG-5304516/5319741]Qwen2.5VL FP8 support by @DylanChen-NV in #5029
Update transformers to 4.53.0 by @Wanli-Jiang in #5747
[1/N][TRTLLM-5195][feat] Share PyTorch tensor between processes by @chang-l in #5396
feat(models): Mistral3.1 VLM pytorch backend support by @2ez4bz in #5529
feat: Custom masking utils for Gemma3 VLM by @brb-nv in #5853
[fix] WAR to fix the illegal memory access issue in moe gemm on SM120 by @peaceh-nv in #5636
Waive unittest failures introduced by PR#5345 (removal of ScaffoldingOutput class) by @venkywonka in #5886
[feat] Add TensorRT-Engine Qwen3 (dense) model support by @gkswns0531 in #5650
[TRTLLM-5530] chore: rename LLM.autotuner_enabled to enable_autotuner by @Superjomn in #5876
avoid nesting NCCL group in allgather and reduce scatter OPs by @QiJune in #5866
chore: remove support for llmapi + TRT backend in Triton by @achartier in #5856
[feat] Add TRTLLM MoE nvfp4 cubins for mid-high concurrency; attention_dp for TRTLLM MoE by @rosenrodt in #5723
[fix] fix tileN cannot % 16==0 & support sm89 deepgemm bmm by @CarstyYou in #5531
[NvBug 5370718, 5371538] fix: Fix incremental detokenization by @syuoni in #5825
[None] - Waive L0 tests by @yiqingy0 in #5915
[fix] Fix MoE workspace info by storing Torch tensor itself instead of data_ptr by @jinyangyuan-nvidia in #5900
fix: Make the bench serving script compatible with different usages by @kaiyux in #5905
feat:enable kvcache to be reused during request generation by @narutolhy in #4028
infra: [TRTLLM-6054][TRTLLM-5804] Fix two known NSPECT high vulnerability issues and reduce image size by @ZhanruiSunCh in #5434
[refactor] Simplification of Speculative decoding configs by @wili-65535 in #5639
feat: binding type build argument (pybind, nanobind) by @Linda-Stadter in #5802
doc: Add instructions for running gemma in disaggregated serving by @Tabrizian in #5922
[fix] Fix mistral unit tests due to transformers upgrade by @2ez4bz in #5904
[nvbugs/5321981] Cherrypick fix: Fix the Llama3.1 405B hanging issue. (#5698) by @nvzhihanj in #5925
[enhance] Add the ability to write a request timeline. by @FrankD412 in #5258
deepEP fp4 post quant all2all dispatch by @yilin-void in #5881
test: Fix Gemma3 unit tests due to transformers upgrade by @brb-nv in #5921
[TRTLLM-4770][feat] Enhance cpp executor cmake to listen to ENABLE_MU… by @WilliamTambellini in #5104
blog: add qwen3 disagg perf metrics by @Shixiaowei02 in #5822
Refactor the rest routing part for the routing kernels in the MoE TRT-LLM backend by @ChristinaZ in #5771
[TRTLLM-5673] Doc: ensure the disagg doc is up to date by @Shixiaowei02 in #5938
doc: update the link of the diagram by @Shixiaowei02 in #5953
tests: update sanity tests & fix tests by @xinhe-nv in #5906
[refactor] Move vision parts from processor to model for Gemma3 by @2ez4bz in #5888
[TRTLLM-6264] Fix flaky test_e2e.py::test_openai_lora by @thorjohnsen in #5885
Added code owners for LLM API by @juney-nvidia in #5960
[nvbug/5308432] fix: extend triton exit time for test_llava by @chang-l in #5971
[NvBug 5378370] fix: Fix alltoall for llama4 (apply_router_weight_on_input=True) by @syuoni in #5902
[fix] Remove SpecConfig and fix thread leak issues by @mikeiovine in #5931
[BUG5374319][fix] WAR for draft-target-model unit tests error by @wili-65535 in #5958
fix: fast redux detection in trtllm gen routing kernel by @tongyuantongyu in #5941
fix cancel request logic by @QiJune in #5800
Fix errors in wide-ep scripts by @qiaoxj07 in #5992
[BUG5388075][fix] Fix error in post-merge-tests by @wili-65535 in #5949

New Contributors

@davidclark-nv made their first contribution in #5743
@Alcanderian made their first contribution in #5681
@Wokzy made their first contribution in #5382
@raayandhar made their first contribution in #5732
@gkswns0531 made their first contribution in #5650

Full Changelog: v1.0.0rc2...v1.0.0rc3