github NVIDIA/TensorRT-LLM v1.1.0rc5

pre-release7 hours ago

Announcement Highlights

  • Model Support
    • Enable NvFP4/FP8 quantization for Nemotron-H architecture (#7589)
    • Enable KV-cache reuse and add E2E tests for llava-next (#7349)
    • Support gpt-oss with fp8 kv cache (#7612)
    • Support kvcache reuse for phi4mm (#7563)
  • API
    • Add TorchLlmArgs to the connector api (#7493)
  • Benchmark
    • Extend test_perf.py to add disagg-serving perf tests (#7503)
    • Add accuracy test for deepseek-r1 with chunked_prefill (#7365)
  • Feature
    • Optimize MLA kernels with separate reduction kernels (#7597)
    • Wrap MOE with custom op (#7277)
    • Make the should_use_spec_decode logic a bit smarter (#7112)
    • Use a shell context to install dependancies (#7383)
    • Topk logprobs for TRT backend and top1 logprob for PyT backend (#6097)
    • Support chunked prefill for multimodal models (#6843)
    • Optimize MLA chunked prefill && support fp8 mla chunked prefill (#7477)
    • Disable deep_gemm for Qwen3 QKNormRoPEAttention and Linear layers due to accuracy issues (#7616)
    • Add deepseek r1-w4afp8 quickstart (#7645)
    • Nanobind: Allow none types for fields in result (#7672)
    • Using arrival time in llmapi when creating LlmRequest in pytorch workflow (#7553)
    • UCX zmq ip support ipv6 (#7530)
    • Refactor: Quantization Transforms with Inheritance (#7227)

What's Changed

  • [None][chore] Remove closed bugs by @xinhe-nv in #7591
  • [https://nvbugs/5454559][fix] handle bias term in fuse_gate_mlp by @Linda-Stadter in #7449
  • [None][fix] enable NvFP4/FP8 quantization for Nemotron-H architecture by @tomeras91 in #7589
  • [None][feat] Optimize MLA kernels with separate reduction kernels by @PerkzZheng in #7597
  • [https://nvbugs/5445466][fix] unwaive DS R1 test cases with bug already fixed by @lancelly in #7429
  • [#6798][fix] fix compilation error in ub_allocator in single device build by @WilliamTambellini in #6874
  • [https://nvbugs/5434424][fix] A quick fix for the wrong output issue of SM89 blocked scaling batched GEMM when the input tensor is non-contiguous. by @StudyingShao in #7615
  • [None][chore] add TorchLlmArgs to the connector api by @richardhuo-nv in #7493
  • [TRTLLM-6707][fix] nanobind fix for executor exit call by @Linda-Stadter in #7565
  • [None][ci] add DGX_H100-2_GPUs-PyTorch-Others-1 pipeline by @QiJune in #7629
  • [TRTLLM-7408][feat] Wrap MOE with custom op. by @liji-nv in #7277
  • [TRTLLM-5059][feat] Enable KV-cache reuse and add E2E tests for llava-next by @chang-l in #7349
  • [None][fix] fix post-merge issue raised by #5488 by @nv-guomingz in #7655
  • [https://nvbugs/5410687][test] Add deepseek r1-w4afp8 quickstart by @fredricz-20070104 in #7645
  • [None][fix]UCX zmq ip support ipv6 by @chuangz0 in #7530
  • [None][feat] Make the should_use_spec_decode logic a bit smarter by @zheyuf in #7112
  • [#5861][autodeploy] Refactor: Quantization Transforms with Inheritance by @Fridah-nv in #7227
  • [#7208][fix] Fix config type of MedusaConfig by @karljang in #7320
  • [None][infra] Bump version to 1.1.0rc5 by @yiqingy0 in #7668
  • [TRTLLM-7871][infra] Extend test_perf.py to add disagg-serving perf tests. by @bo-nv in #7503
  • [https://nvbugs/5494698][fix] skip gemma3 27b on blackwell by @xinhe-nv in #7505
  • [https://nvbugs/5477359][fix] Nanobind: Allow none types for fields in result by @Linda-Stadter in #7672
  • [None][chore] remove executor config in kv cache creator by @leslie-fang25 in #7526
  • [https://nvbugs/5488212][waive] Waive failed tests for L20 by @nvamyt in #7664
  • [None][feat] Use a shell context to install dependancies by @v-shobhit in #7383
  • [https://nvbugs/5505402] [fix] Disable deep_gemm for Qwen3 QKNormRoPEAttention and Linear layers due to accuracy issues by @DomBrown in #7616
  • [None][infra] Waive failed cases on main 0910 by @EmmaQiaoCh in #7676
  • [None][infra] Adjust labeling llm prompt for bug issues by @karljang in #7385
  • [None][ci] move some test cases from l40s to a30 by @QiJune in #7684
  • [None][fix] Fix the incorrect header file import in dataType.h by @Fan-Yunfan in #7133
  • [https://nvbugs/5498165][fix] fix permission error for config file lock by @chang-l in #7656
  • [https://nvbugs/5513192][fix] Add the missing param for kv_cache_tran… by @nv-guomingz in #7679
  • [TRTLLM-1302][feat] Topk logprobs for TRT backend and top1 logprob for PyT backend by @LinPoly in #6097
  • [TRTLLM-7169][infra] Fix Slurm multi-node test showing "Submit Test Results" in the test name by @ZhanruiSunCh in #6856
  • [TRTLLM-6791][infra] Add check for uploading stage name and avoid overriding test result tar file by @ZhanruiSunCh in #6742
  • [None][ci] Some improvements for Slurm CI by @chzblych in #7689
  • [None][ci] Test waives for the main branch 09/14 by @chzblych in #7698
  • [None][feat] support gpt-oss with fp8 kv cache by @PerkzZheng in #7612
  • [TRTLLM-6903][feat] Support chunked prefill for multimodal models by @chang-l in #6843
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #7682
  • [None][chore] Enable multiple postprocess workers tests for chat completions api by @JunyiXu-nv in #7602
  • [TRTLLM-7279][test] add accuracy test for deepseek-r1 with chunked_prefill by @crazydemo in #7365
  • [https://nvbugs/5467981][fix] Fix Qwen2.5-VL fails with cuda graph padding by @DylanChen-NV in #7122
  • [None][chore] move some cases from post-merge to pre-merge to detect errors in early stage by @HuiGao-NV in #7699
  • [TRTLLM-7918][feat] Support kvcache reuse for phi4mm by @Wanli-Jiang in #7563
  • [None][test] add test for min_tokens by @ixlmar in #7678
  • [TRTLLM-7918][feat] Revert "Support kvcache reuse for phi4mm (#7563)" by @Wanli-Jiang in #7722
  • [None][fix] using arrival time in llmapi when creating LlmRequest in pytorch workflow by @zhengd-nv in #7553
  • [TRTLLM-7192][feat] optimize MLA chunked prefill && support fp8 mla chunked prefill by @jmydurant in #7477
  • [None][ci] Test waives for the main branch 09/15 by @chzblych in #7709

New Contributors

Full Changelog: v1.1.0rc4...v1.1.0rc5

Don't miss a new TensorRT-LLM release

NewReleases is sending notifications on new releases.