Announcement Highlights
- Model Support
- API
- Add TorchLlmArgs to the connector api (#7493)
- Benchmark
- Feature
- Optimize MLA kernels with separate reduction kernels (#7597)
- Wrap MOE with custom op (#7277)
- Make the should_use_spec_decode logic a bit smarter (#7112)
- Use a shell context to install dependancies (#7383)
- Topk logprobs for TRT backend and top1 logprob for PyT backend (#6097)
- Support chunked prefill for multimodal models (#6843)
- Optimize MLA chunked prefill && support fp8 mla chunked prefill (#7477)
- Disable deep_gemm for Qwen3 QKNormRoPEAttention and Linear layers due to accuracy issues (#7616)
- Add deepseek r1-w4afp8 quickstart (#7645)
- Nanobind: Allow none types for fields in result (#7672)
- Using arrival time in llmapi when creating LlmRequest in pytorch workflow (#7553)
- UCX zmq ip support ipv6 (#7530)
- Refactor: Quantization Transforms with Inheritance (#7227)
What's Changed
- [None][chore] Remove closed bugs by @xinhe-nv in #7591
- [https://nvbugs/5454559][fix] handle bias term in fuse_gate_mlp by @Linda-Stadter in #7449
- [None][fix] enable NvFP4/FP8 quantization for Nemotron-H architecture by @tomeras91 in #7589
- [None][feat] Optimize MLA kernels with separate reduction kernels by @PerkzZheng in #7597
- [https://nvbugs/5445466][fix] unwaive DS R1 test cases with bug already fixed by @lancelly in #7429
- [#6798][fix] fix compilation error in ub_allocator in single device build by @WilliamTambellini in #6874
- [https://nvbugs/5434424][fix] A quick fix for the wrong output issue of SM89 blocked scaling batched GEMM when the input tensor is non-contiguous. by @StudyingShao in #7615
- [None][chore] add TorchLlmArgs to the connector api by @richardhuo-nv in #7493
- [TRTLLM-6707][fix] nanobind fix for executor exit call by @Linda-Stadter in #7565
- [None][ci] add DGX_H100-2_GPUs-PyTorch-Others-1 pipeline by @QiJune in #7629
- [TRTLLM-7408][feat] Wrap MOE with custom op. by @liji-nv in #7277
- [TRTLLM-5059][feat] Enable KV-cache reuse and add E2E tests for llava-next by @chang-l in #7349
- [None][fix] fix post-merge issue raised by #5488 by @nv-guomingz in #7655
- [https://nvbugs/5410687][test] Add deepseek r1-w4afp8 quickstart by @fredricz-20070104 in #7645
- [None][fix]UCX zmq ip support ipv6 by @chuangz0 in #7530
- [None][feat] Make the should_use_spec_decode logic a bit smarter by @zheyuf in #7112
- [#5861][autodeploy] Refactor: Quantization Transforms with Inheritance by @Fridah-nv in #7227
- [#7208][fix] Fix config type of MedusaConfig by @karljang in #7320
- [None][infra] Bump version to 1.1.0rc5 by @yiqingy0 in #7668
- [TRTLLM-7871][infra] Extend test_perf.py to add disagg-serving perf tests. by @bo-nv in #7503
- [https://nvbugs/5494698][fix] skip gemma3 27b on blackwell by @xinhe-nv in #7505
- [https://nvbugs/5477359][fix] Nanobind: Allow none types for fields in result by @Linda-Stadter in #7672
- [None][chore] remove executor config in kv cache creator by @leslie-fang25 in #7526
- [https://nvbugs/5488212][waive] Waive failed tests for L20 by @nvamyt in #7664
- [None][feat] Use a shell context to install dependancies by @v-shobhit in #7383
- [https://nvbugs/5505402] [fix] Disable deep_gemm for Qwen3 QKNormRoPEAttention and Linear layers due to accuracy issues by @DomBrown in #7616
- [None][infra] Waive failed cases on main 0910 by @EmmaQiaoCh in #7676
- [None][infra] Adjust labeling llm prompt for bug issues by @karljang in #7385
- [None][ci] move some test cases from l40s to a30 by @QiJune in #7684
- [None][fix] Fix the incorrect header file import in dataType.h by @Fan-Yunfan in #7133
- [https://nvbugs/5498165][fix] fix permission error for config file lock by @chang-l in #7656
- [https://nvbugs/5513192][fix] Add the missing param for kv_cache_tran… by @nv-guomingz in #7679
- [TRTLLM-1302][feat] Topk logprobs for TRT backend and top1 logprob for PyT backend by @LinPoly in #6097
- [TRTLLM-7169][infra] Fix Slurm multi-node test showing "Submit Test Results" in the test name by @ZhanruiSunCh in #6856
- [TRTLLM-6791][infra] Add check for uploading stage name and avoid overriding test result tar file by @ZhanruiSunCh in #6742
- [None][ci] Some improvements for Slurm CI by @chzblych in #7689
- [None][ci] Test waives for the main branch 09/14 by @chzblych in #7698
- [None][feat] support gpt-oss with fp8 kv cache by @PerkzZheng in #7612
- [TRTLLM-6903][feat] Support chunked prefill for multimodal models by @chang-l in #6843
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #7682
- [None][chore] Enable multiple postprocess workers tests for chat completions api by @JunyiXu-nv in #7602
- [TRTLLM-7279][test] add accuracy test for deepseek-r1 with chunked_prefill by @crazydemo in #7365
- [https://nvbugs/5467981][fix] Fix Qwen2.5-VL fails with cuda graph padding by @DylanChen-NV in #7122
- [None][chore] move some cases from post-merge to pre-merge to detect errors in early stage by @HuiGao-NV in #7699
- [TRTLLM-7918][feat] Support kvcache reuse for phi4mm by @Wanli-Jiang in #7563
- [None][test] add test for min_tokens by @ixlmar in #7678
- [TRTLLM-7918][feat] Revert "Support kvcache reuse for phi4mm (#7563)" by @Wanli-Jiang in #7722
- [None][fix] using arrival time in llmapi when creating LlmRequest in pytorch workflow by @zhengd-nv in #7553
- [TRTLLM-7192][feat] optimize MLA chunked prefill && support fp8 mla chunked prefill by @jmydurant in #7477
- [None][ci] Test waives for the main branch 09/15 by @chzblych in #7709
New Contributors
Full Changelog: v1.1.0rc4...v1.1.0rc5