Announcement Highlights:
- Model Support
- Feature
- Add support for MXFP8xMXFP4 in pytorch (#5411)
- Log stack trace on error in openai server (#5749)
- Refactor the topk parallelization part for the routing kernels (#5705)
- Adjust free GPU memory fraction in KvCacheConfig for DeepSeek R1 tests (#5774)
- Support FP8 row-wise dense GEMM in torch flow (#5615)
- Move DeepEP from Docker images to wheel building (#5534)
- Add user-provided speculative decoding support (#5204)
- Add optional module cache for TRT-LLM Gen Gemm interfaces (#5743)
- Add streaming scaffolding_llm.generate_async support (#5345)
- Detokenize option in /v1/completions request (#5382)
- Support n-gram speculative decoding with disagg (#5732)
- Return context response immediately when stream_interval > 1 (#5836)
- Add support for sm121 (#5524)
- Add LLM speculative decoding example (#5706)
- Update xgrammar version to 0.1.19 (#5830)
- Some refactor on WideEP (#5727)
- Integrate TRT-LLM Gen FP4 block scale MoE with Pytorch workflow kernel autotuner (#5764)
- Update transformers to 4.53.0 (#5747)
- Share PyTorch tensor between processes (#5396)
- Custom masking utils for Gemma3 VLM (#5853)
- Remove support for llmapi + TRT backend in Triton (#5856)
- Add TRTLLM MoE nvfp4 cubins for mid-high concurrency; attention_dp for TRTLLM MoE (#5723)
- Enable kvcache to be reused during request generation (#4028)
- Simplify speculative decoding configs (#5639)
- Add binding type build argument (pybind, nanobind) (#5802)
- Add the ability to write a request timeline (#5258)
- Support deepEP fp4 post quant all2all dispatch (#5881)
- Refactor the rest routing part for the routing kernels in the MoE TRT-LLM backend (#5771)
- Move vision parts from processor to model for Gemma3 (#5888)
- API
- Bug Fixes
- Fix test_generate_with_seed CI failure. (#5772)
- Improve fp4_block_scale_moe_runner type check (#5681)
- Fix prompt adapter TP2 case (#5782)
- Fix disaggregate serving with attention DP (#4993)
- Ignore nvshmem_src_*.txz from confidentiality-scan (#5831)
- Fix a quote error introduced in #5534 (#5816)
- Fix the accuracy issue when reduce_fusion is enabled for GEMMA model. (#5801)
- Fix lost requests for disaggregated serving (#5815)
- Update unit tests: skip all_close assert for dropout in attention, increase tolerance for rope op test (#5855)
- Fix GEMM+AR fusion on blackwell (#5563)
- Catch inference failures in trtllm-bench (#5841)
- Add is_fp8_output key to XQA kernel cubin hashing (solves Eagle3-one-engine Hopper fp8 bug) (#5813)
- Skip rope scaling for local layers in Gemma3 VLM (#5857)
- Fix llama4 multimodal support (#5809)
- Fix Llama4 Scout FP4 crash issue (#5925)
- Fix max batch size and max tokens in kv cache estimations for Nemotron-H (#5371)
- Fix moe regression for sm120 (#5823)
- Fix Qwen2.5VL FP8 support (#5029)
- Fix the illegal memory access issue in moe gemm on SM120 (#5636)
- Avoid nesting NCCL group in allgather and reduce scatter OPs (#5866)
- Fix tileN cannot % 16==0 & support sm89 deepgemm bmm (#5531)
- Fix incremental detokenization (#5825)
- Fix MoE workspace info by storing Torch tensor itself instead of data_ptr (#5900)
- Make the bench serving script compatible with different usages (#5905)
- Fix mistral unit tests due to transformers upgrade (#5904)
- Fix the Llama3.1 405B hanging issue. (#5698) (#5925)
- Fix Gemma3 unit tests due to transformers upgrade (#5921)
- Extend triton exit time for test_llava (#5971)
- Fix alltoall for llama4 (apply_router_weight_on_input=True) (#5902)
- Remove SpecConfig and fix thread leak issues (#5931)
- Fast redux detection in trtllm gen routing kernel (#5941)
- Fix cancel request logic (#5800)
- Fix errors in wide-ep scripts (#5992)
- Fix error in post-merge-tests (#5949)
- Benchmark
- Performance
- Optimize TRTLLM Sampler perf single beam single step (#5550)
- Infrastructure
- Fix a syntax issue in the image check (#5775)
- Speedup fused moe tests (#5726)
- Set the label community action to only run on upstream TRTLLM (#5806)
- Update namelist in blossom-ci (#5838)
- Update nspect version (#5832)
- Reduce redundant test cases for TRTLLM Gen FP8 MoE (#5845)
- Parallelize torch unittests (#5714)
- Use current_image_tags.properties in rename_docker_images.py (#5846)
- Fix two known NSPECT high vulnerability issues and reduce image size (#5434)
- Documentation
- Update the document of qwen3 and cuda_graph usage (#5705)
- Update cuda_graph_config usage part in DS R1 docs (#5796)
- Add llama4 Maverick eagle3 and max-throughput and low_latency benchmark guide (#5810)
- Fix link in llama4 Maverick example (#5864)
- Add instructions for running gemma in disaggregated serving (#5922)
- Add qwen3 disagg perf metrics (#5822)
- Update the disagg doc (#5938)
- Update the link of the diagram (#5953)
- Known Issues
What's Changed
- feat: Add support for MXFP8xMXFP4 in pytorch by @djns99 in #5535
- [Doc] update the document of qwen3 and cuda_graph usage by @byshiue in #5703
- [Infra] - Fix a syntax issue in the image check by @chzblych in #5775
- chore: log stack trace on error in openai server by @zhengd-nv in #5749
- fix: [nvbug/5368507] Fix test_generate_with_seed CI failure. by @bobboli in #5772
- Refactor the topk parallelization part for the routing kernels by @ChristinaZ in #5567
- test: [CI] remove closed bugs by @xinhe-nv in #5770
- [TRTLLM-5530][BREAKING CHANGE] refactor: LLM arglist rename mixed_sampler to enable_mixed_sampler by @Superjomn in #5751
- fix: Adjust free GPU memory fraction in KvCacheConfig for DeepSeek R1 tests by @yizhang-nv in #5774
- [TRTLLM-5812][feat] support FP8 row-wise dense GEMM in torch flow by @DylanChen-NV in #5615
- feat: Optimize TRTLLM Sampler perf single beam single step by @dcampora in #5550
- Refactor: move DeepEP from Docker images to wheel building by @yuantailing in #5534
- [TRTLLM-6291] feat: Add user-provided speculative decoding support by @Funatiq in #5204
- [ci] speedup fused moe tests by @omera-nv in #5726
- [feat] Adds optional module cache for TRT-LLM Gen Gemm interfaces by @davidclark-nv in #5743
- chore: [Breaking Change] Rename cuda_graph_config padding_enabled fie… by @nv-guomingz in #5795
- feat: add MultimodalParams & putting all multimodal params into it and refactor HyperCLOVAX & Qwen2/2.5-VL by @yechank-nvidia in #5522
- Revert "chore: [Breaking Change] Rename cuda_graph_config padding_enabled fie…" by @nv-guomingz in #5818
- [fix] https://nvbugs/5333654 Unwaive to check ci status and improve torch compile multi-gpu coverage by @liji-nv in #5700
- [fix] improve fp4_block_scale_moe_runner type check by @Alcanderian in #5681
- feat(scaffolding): add streaming scaffolding_llm.generate_async support by @dc3671 in #5345
- [None][infra] Set the label community action to only run on upstream TRTLLM by @poweiw in #5806
- Waive some
test_llama_eagle3
unittests by @venkywonka in #5811 - [NvBug 5362426] fix: Fix prompt adapter TP2 case by @syuoni in #5782
- chore: bump version to 1.0.0rc3 by @yiqingy0 in #5819
- doc: update cuda_graph_config usage part in DS R1 docs by @nv-guomingz in #5796
- fix: Disaggregate serving with attention DP by @VALLIS-NERIA in #4993
- Fix: ignore nvshmem_src_*.txz from
confidentiality-scan
by @yuantailing in #5831 - tests: waive failed cases on main by @xinhe-nv in #5781
- [Infra] - Waive L0 test by @yiqingy0 in #5837
- update namelist in blossom-ci by @niukuo in #5838
- Fix a quote error introduced in #5534 by @yuantailing in #5816
- [feat]: Detokenize option in /v1/completions request by @Wokzy in #5382
- [5305318] fix: Fix the accuracy issue when reduce_fusion is enabled for GEMMA model. by @hyukn in #5801
- [TRTLLM-5847][feat] Support n-gram speculative decoding with disagg by @raayandhar in #5732
- [TRTLLM-5878] update nspect version by @niukuo in #5832
- feat: Return context response immediately when stream_interval > 1 by @kaiyux in #5836
- test: reduce redundant test cases for TRTLLM Gen FP8 MoE by @DomBrown in #5845
- [nvbug/5308432] fix: triton_backend test_llava timeout by @chang-l in #5814
- [TRTLLM-5366][feat]Add support for sm121 by @pamelap-nvidia in #5524
- chore [TRTLLM-6161]: add LLM speculative decoding example by @Superjomn in #5706
- Fix lost requests for disaggregated serving by @Tabrizian in #5815
- fix: [5376140] [AutoDeploy] Update unit tests: skip all_close assert for dropout in attention, increase tolerance for rope op test by @Fridah-nv in #5855
- Fix GEMM+AR fusion on blackwell by @xavier-nvidia in #5563
- [fix] Catch inference failures in
trtllm-bench
by @omera-nv in #5841 - Doc: Add llama4 Maverick eagle3 and max-throughput and low_latency benchmark guide by @jiahanc in #5810
- test: Validate and add accuracy& perf tests for Ministral-8B-Instruct[-FP8](pytorch only) by @venkywonka in #5654
- Add is_fp8_output key to XQA kernel cubin hashing (solves Eagle3-one-engine Hopper fp8 bug) by @jhaotingc in #5813
- feat: TRTLLM-6224 update xgrammar version to 0.1.19 by @Wanli-Jiang in #5830
- Doc: fix link in llama4 Maverick example by @jiahanc in #5864
- fix: Skip rope scaling for local layers in Gemma3 VLM by @brb-nv in #5857
- fix: [https://nvbugspro.nvidia.com/bug/5375656] Unwaive for bug 5375656. by @bobboli in #5842
- [AutoDeploy] re-enable waive for flaky AD test by @lucaslie in #5867
- Remove unnecessary benchmarking results by @qiaoxj07 in #5852
- chores: merge examples for v1.0 doc by @hchings in #5736
- [Bugfix] LLama4: fix for llama4 multimodal support by @chang-l in #5809
- [TRTLLM-6262] Fix Llama4 Scout FP4 crash issue by @chenfeiz0326 in #5834
- chore: some refactor on WideEP by @dongxuy04 in #5727
- [TRTLLM-5881] feat: Integrate TRT-LLM Gen FP4 block scale MoE with Pytorch workflow kernel autotuner by @DomBrown in #5764
- [ci] parallelize torch unittests by @omera-nv in #5714
- fix: use current_image_tags.properties in rename_docker_images.py by @ixlmar in #5846
- [TRTLLM-5838][fix] fix max batch size and max tokens in kv cache estimations for Nemotron-H by @tomeras91 in #5371
- Fix : fix moe regression for sm120 by @peaceh-nv in #5823
- [NVBUG-5304516/5319741]Qwen2.5VL FP8 support by @DylanChen-NV in #5029
- Update transformers to 4.53.0 by @Wanli-Jiang in #5747
- [1/N][TRTLLM-5195][feat] Share PyTorch tensor between processes by @chang-l in #5396
- feat(models): Mistral3.1 VLM pytorch backend support by @2ez4bz in #5529
- feat: Custom masking utils for Gemma3 VLM by @brb-nv in #5853
- [fix] WAR to fix the illegal memory access issue in moe gemm on SM120 by @peaceh-nv in #5636
- Waive unittest failures introduced by PR#5345 (removal of
ScaffoldingOutput
class) by @venkywonka in #5886 - [feat] Add TensorRT-Engine Qwen3 (dense) model support by @gkswns0531 in #5650
- [TRTLLM-5530] chore: rename LLM.autotuner_enabled to enable_autotuner by @Superjomn in #5876
- avoid nesting NCCL group in allgather and reduce scatter OPs by @QiJune in #5866
- chore: remove support for llmapi + TRT backend in Triton by @achartier in #5856
- [feat] Add TRTLLM MoE nvfp4 cubins for mid-high concurrency; attention_dp for TRTLLM MoE by @rosenrodt in #5723
- [fix] fix tileN cannot % 16==0 & support sm89 deepgemm bmm by @CarstyYou in #5531
- [NvBug 5370718, 5371538] fix: Fix incremental detokenization by @syuoni in #5825
- [None] - Waive L0 tests by @yiqingy0 in #5915
- [fix] Fix MoE workspace info by storing Torch tensor itself instead of data_ptr by @jinyangyuan-nvidia in #5900
- fix: Make the bench serving script compatible with different usages by @kaiyux in #5905
- feat:enable kvcache to be reused during request generation by @narutolhy in #4028
- infra: [TRTLLM-6054][TRTLLM-5804] Fix two known NSPECT high vulnerability issues and reduce image size by @ZhanruiSunCh in #5434
- [refactor] Simplification of Speculative decoding configs by @wili-65535 in #5639
- feat: binding type build argument (pybind, nanobind) by @Linda-Stadter in #5802
- doc: Add instructions for running gemma in disaggregated serving by @Tabrizian in #5922
- [fix] Fix mistral unit tests due to transformers upgrade by @2ez4bz in #5904
- [nvbugs/5321981] Cherrypick fix: Fix the Llama3.1 405B hanging issue. (#5698) by @nvzhihanj in #5925
- [enhance] Add the ability to write a request timeline. by @FrankD412 in #5258
- deepEP fp4 post quant all2all dispatch by @yilin-void in #5881
- test: Fix Gemma3 unit tests due to transformers upgrade by @brb-nv in #5921
- [TRTLLM-4770][feat] Enhance cpp executor cmake to listen to ENABLE_MU… by @WilliamTambellini in #5104
- blog: add qwen3 disagg perf metrics by @Shixiaowei02 in #5822
- Refactor the rest routing part for the routing kernels in the MoE TRT-LLM backend by @ChristinaZ in #5771
- [TRTLLM-5673] Doc: ensure the disagg doc is up to date by @Shixiaowei02 in #5938
- doc: update the link of the diagram by @Shixiaowei02 in #5953
- tests: update sanity tests & fix tests by @xinhe-nv in #5906
- [refactor] Move vision parts from processor to model for Gemma3 by @2ez4bz in #5888
- [TRTLLM-6264] Fix flaky test_e2e.py::test_openai_lora by @thorjohnsen in #5885
- Added code owners for LLM API by @juney-nvidia in #5960
- [nvbug/5308432] fix: extend triton exit time for test_llava by @chang-l in #5971
- [NvBug 5378370] fix: Fix alltoall for llama4 (apply_router_weight_on_input=True) by @syuoni in #5902
- [fix] Remove SpecConfig and fix thread leak issues by @mikeiovine in #5931
- [BUG5374319][fix] WAR for draft-target-model unit tests error by @wili-65535 in #5958
- fix: fast redux detection in trtllm gen routing kernel by @tongyuantongyu in #5941
- fix cancel request logic by @QiJune in #5800
- Fix errors in wide-ep scripts by @qiaoxj07 in #5992
- [BUG5388075][fix] Fix error in post-merge-tests by @wili-65535 in #5949
New Contributors
- @davidclark-nv made their first contribution in #5743
- @Alcanderian made their first contribution in #5681
- @Wokzy made their first contribution in #5382
- @raayandhar made their first contribution in #5732
- @gkswns0531 made their first contribution in #5650
Full Changelog: v1.0.0rc2...v1.0.0rc3