NVIDIA/TensorRT-LLM v1.1.0rc5 on GitHub

Announcement Highlights

Model Support
- Enable NvFP4/FP8 quantization for Nemotron-H architecture (#7589)
- Enable KV-cache reuse and add E2E tests for llava-next (#7349)
- Support gpt-oss with fp8 kv cache (#7612)
- Support kvcache reuse for phi4mm (#7563)
API
- Add TorchLlmArgs to the connector api (#7493)
Benchmark
- Extend test_perf.py to add disagg-serving perf tests (#7503)
- Add accuracy test for deepseek-r1 with chunked_prefill (#7365)
Feature
- Optimize MLA kernels with separate reduction kernels (#7597)
- Wrap MOE with custom op (#7277)
- Make the should_use_spec_decode logic a bit smarter (#7112)
- Use a shell context to install dependancies (#7383)
- Topk logprobs for TRT backend and top1 logprob for PyT backend (#6097)
- Support chunked prefill for multimodal models (#6843)
- Optimize MLA chunked prefill && support fp8 mla chunked prefill (#7477)
- Disable deep_gemm for Qwen3 QKNormRoPEAttention and Linear layers due to accuracy issues (#7616)
- Add deepseek r1-w4afp8 quickstart (#7645)
- Nanobind: Allow none types for fields in result (#7672)
- Using arrival time in llmapi when creating LlmRequest in pytorch workflow (#7553)
- UCX zmq ip support ipv6 (#7530)
- Refactor: Quantization Transforms with Inheritance (#7227)

What's Changed

[None][chore] Remove closed bugs by @xinhe-nv in #7591
[https://nvbugs/5454559][fix] handle bias term in fuse_gate_mlp by @Linda-Stadter in #7449
[None][fix] enable NvFP4/FP8 quantization for Nemotron-H architecture by @tomeras91 in #7589
[None][feat] Optimize MLA kernels with separate reduction kernels by @PerkzZheng in #7597
[https://nvbugs/5445466][fix] unwaive DS R1 test cases with bug already fixed by @lancelly in #7429
[#6798][fix] fix compilation error in ub_allocator in single device build by @WilliamTambellini in #6874
[https://nvbugs/5434424][fix] A quick fix for the wrong output issue of SM89 blocked scaling batched GEMM when the input tensor is non-contiguous. by @StudyingShao in #7615
[None][chore] add TorchLlmArgs to the connector api by @richardhuo-nv in #7493
[TRTLLM-6707][fix] nanobind fix for executor exit call by @Linda-Stadter in #7565
[None][ci] add DGX_H100-2_GPUs-PyTorch-Others-1 pipeline by @QiJune in #7629
[TRTLLM-7408][feat] Wrap MOE with custom op. by @liji-nv in #7277
[TRTLLM-5059][feat] Enable KV-cache reuse and add E2E tests for llava-next by @chang-l in #7349
[None][fix] fix post-merge issue raised by #5488 by @nv-guomingz in #7655
[https://nvbugs/5410687][test] Add deepseek r1-w4afp8 quickstart by @fredricz-20070104 in #7645
[None][fix]UCX zmq ip support ipv6 by @chuangz0 in #7530
[None][feat] Make the should_use_spec_decode logic a bit smarter by @zheyuf in #7112
[#5861][autodeploy] Refactor: Quantization Transforms with Inheritance by @Fridah-nv in #7227
[#7208][fix] Fix config type of MedusaConfig by @karljang in #7320
[None][infra] Bump version to 1.1.0rc5 by @yiqingy0 in #7668
[TRTLLM-7871][infra] Extend test_perf.py to add disagg-serving perf tests. by @bo-nv in #7503
[https://nvbugs/5494698][fix] skip gemma3 27b on blackwell by @xinhe-nv in #7505
[https://nvbugs/5477359][fix] Nanobind: Allow none types for fields in result by @Linda-Stadter in #7672
[None][chore] remove executor config in kv cache creator by @leslie-fang25 in #7526
[https://nvbugs/5488212][waive] Waive failed tests for L20 by @nvamyt in #7664
[None][feat] Use a shell context to install dependancies by @v-shobhit in #7383
[https://nvbugs/5505402] [fix] Disable deep_gemm for Qwen3 QKNormRoPEAttention and Linear layers due to accuracy issues by @DomBrown in #7616
[None][infra] Waive failed cases on main 0910 by @EmmaQiaoCh in #7676
[None][infra] Adjust labeling llm prompt for bug issues by @karljang in #7385
[None][ci] move some test cases from l40s to a30 by @QiJune in #7684
[None][fix] Fix the incorrect header file import in dataType.h by @Fan-Yunfan in #7133
[https://nvbugs/5498165][fix] fix permission error for config file lock by @chang-l in #7656
[https://nvbugs/5513192][fix] Add the missing param for kv_cache_tran… by @nv-guomingz in #7679
[TRTLLM-1302][feat] Topk logprobs for TRT backend and top1 logprob for PyT backend by @LinPoly in #6097
[TRTLLM-7169][infra] Fix Slurm multi-node test showing "Submit Test Results" in the test name by @ZhanruiSunCh in #6856
[TRTLLM-6791][infra] Add check for uploading stage name and avoid overriding test result tar file by @ZhanruiSunCh in #6742
[None][ci] Some improvements for Slurm CI by @chzblych in #7689
[None][ci] Test waives for the main branch 09/14 by @chzblych in #7698
[None][feat] support gpt-oss with fp8 kv cache by @PerkzZheng in #7612
[TRTLLM-6903][feat] Support chunked prefill for multimodal models by @chang-l in #6843
[None][chore] Add failed cases into waives.txt by @xinhe-nv in #7682
[None][chore] Enable multiple postprocess workers tests for chat completions api by @JunyiXu-nv in #7602
[TRTLLM-7279][test] add accuracy test for deepseek-r1 with chunked_prefill by @crazydemo in #7365
[https://nvbugs/5467981][fix] Fix Qwen2.5-VL fails with cuda graph padding by @DylanChen-NV in #7122
[None][chore] move some cases from post-merge to pre-merge to detect errors in early stage by @HuiGao-NV in #7699
[TRTLLM-7918][feat] Support kvcache reuse for phi4mm by @Wanli-Jiang in #7563
[None][test] add test for min_tokens by @ixlmar in #7678
[TRTLLM-7918][feat] Revert "Support kvcache reuse for phi4mm (#7563)" by @Wanli-Jiang in #7722
[None][fix] using arrival time in llmapi when creating LlmRequest in pytorch workflow by @zhengd-nv in #7553
[TRTLLM-7192][feat] optimize MLA chunked prefill && support fp8 mla chunked prefill by @jmydurant in #7477
[None][ci] Test waives for the main branch 09/15 by @chzblych in #7709

New Contributors

@zheyuf made their first contribution in #7112

Full Changelog: v1.1.0rc4...v1.1.0rc5