github NVIDIA/TensorRT-LLM v1.3.0rc7

pre-release7 hours ago

Highlights

  • Model Support

    • Support tensor parallelism of TRTLLM MoE backend for Nemotron-H model (#11470)
    • Add Kimi-K2.5 text model support (NVFP4) (#11777)
    • Add Helix CP support for DSV3.2 (#11507)
    • Support mix quantization between shared experts and routed experts for DSV3 (#11215)
    • Support Cohere Command A model (#11505)
    • Extract embeddings as .safetensors and support float8-quantized models (#11180)
  • API

    • Add --served-model-name option to serve command (#11711)
    • Add flag to trtllm serve to override KV cache dtype (#11487)
    • Use string stop/bad words in gRPC proto instead of pre-tokenized TokenSequence (#11888)
    • Support multimodal image input in gRPC server (#11800)
    • Expose use_python_scheduler in SchedulerConfig and add associated tests (#11884)
    • Add max_gpu_total_bytes to control KVCacheManagerV2 capacity (#11907)
  • Feature

    • Support PARD (Parallel Draft Model) in one-model speculative decoding (#11438)
    • Enable autotuner for VisualGen and compilation config support (#11660)
    • Add globaltimer-based timing backend for autotuner profiling (#11657)
    • Support heterogeneous tokens_per_block (#11751)
    • Refactor KVCacheManagerV2 to simplify new model support (#11749)
    • Support Helix CP with GQA (#11570)
    • Add option to skip KV cache memory estimation (#11714)
    • Implement suffix automaton on device for speculative decoding and one-model support (#11434)
    • Separate radix search tree implementation (#10862)
    • Add support for expert_number (\le 2048) and K (\le 32) (#11510)
    • Add support for bidirectional sliding window attention mask to fmha_v2 (#11212)
    • Avoid duplicated computation with ADP + Helix CP in GQA (#11891)
    • Add explicit video encode format support (#11830)
    • Refactor video encoding to use ffmpeg CLI or pure Python fallback (#11672)
    • Integrate CuTe DSL top-k kernel for Blackwell (#11900)
    • Integrate suffix automaton with EAGLE3 and PARD (#11878)
    • Add 5D A2A for fused Ulysses (#11787)
    • Add SiLU to trtllm-gen MoE (#11663)
    • Optimize by fusing nvfp4_quant into layernorm_gated for mamba2_mixer (#11473)
    • Wire KVCacheBlock to UnifiedBlockTree using lookup-node pointers (#11919)
    • Run extra general warmup to warm up memory pool (#10340)
  • Fix

    • Add async worker to MTP/EAGLE3 sampler (#11573)
    • Fix disaggregated cancellation (#11730)
    • Use prefer_pinned() in pard.py (#11762)
    • Release KVCacheManagerV2 memory immediately on shutdown (#11746)
    • Remove duplicated MoE computation with Helix CP+DP (#11167)
    • Register add+norm fallback pass for torch.compile in multi-GPU mode (#11739)
    • Propagate logprobs from prefill to decode in disaggregated serving (#11727)
    • Propagate logits from prefill to decode in disaggregated serving (#11767)
    • Enable separate draft KV cache pool for aggregated mode and KVBM (#11689)
    • Fix warnings when building moe_kernels.cu (#11703)
    • Fix available_blocks typo in scheduler (#11801)
    • Clean up memory in rollout process (#11658)
    • Warm up maybe_compiled_cat in forward_context_with_chunked_prefill (#11743)
    • Fix DeepEPLowLatency with CuTe DSL MoE backend (#11769)
    • Fix FP8 per-tensor torch.compile graph break in dynamic quantization (#11759)
    • Fix streaming generation logits and speed up logits testcase (#10637)
    • Fix overly aggressive capacity scheduler (#11731)
    • Use proper tokens when exclude_input_in_output is true (#9453)
    • Move launch_dependent_grids after tmem free to fix race (#11812)
    • Fix E/PD disaggregated chunked prefill bug (#11805)
    • Fix SM120 issue for rms_norm with nvfp4_quant_fusion (#11774)
    • Remove dead code (#11813)
    • Fix KVCacheManagerV2 OOM and dummy request allocation in chunked prefill / pipeline parallel (#11710)
    • Fix AttributeError when DSA indexer accesses non-DSA KVCacheManager (#11858)
    • Override mMaxAttentionWindow with actual largest window size (#11842)
    • Update check_is_moe to support mlp_layer_types after config.json update (#11477)
    • Fix incorrect GPU timing in time breakdown under overlap scheduler (#11860)
    • Fix OOM hang with NCCL_SYMMETRIC fallback during long-context inference (#11870)
    • Fix position IDs input for Qwen3.5 text-only usage (#11877)
    • Disable preload for Llama4 Scout (#11873)
    • Fix formatting issue in tensorrt_llm/serve/openai_server.py (#11920)
    • Prevent RuntimeError from dict mutation during iteration in EXAONE MoE weight mapper (#11862)
    • Fix Nemotron MTP crash on SM90 (#11807)
    • Fix Mistral Large3 + EAGLE bug (#11942, #11885)
    • Fix TeaCache broken caching for FLUX.1 and FLUX.2 (#11868)
    • Fix FLUX.1 TeaCache polynomial coefficients and defaults (#12007)
    • Implement workaround for ClientPayloadError (#12018)
    • Fix duplicate model entry in model list (#12029)
    • Fix Python string truthiness bug in FMHA cubin selection (#11909)
  • Documentation

    • Fix typos, grammar, and accuracy across documentation (#11766)
    • Add sparse attention tech blog (#11644)
    • Add known issue for disaggregated serving hang with asymmetric PP/TP (#11789)
    • Fix documentation links (#11912)
    • Replace “TensorRT-LLM” with “TensorRT LLM” (#11914)
    • Add CI trigger and test-failure retrieval instructions to AGENTS.md (#11803)
  • Benchmark

    • Vectorize quantize_fp8_blockwise with CUDA kernel (#11724)
    • Use F.rms_norm for per-head QK normalization in VisualGen (#11798)
    • Short-sequence MHA optimization for DSA MLA prefill (#11677)
    • Parallel VAE harness and implementation for WAN (#11875)
    • Add Triton FP8 blockwise quant kernel and autotuner bucket-skip for VisualGen (#11854)
    • Optimize _prepare_inputs host time (#11704)
    • Improve are_stop_words performance (#11196)
    • Add DeepSeek RCCA performance test case (#11736)
    • Add VisualGen benchmarking script (#11651)
  • Test & Infra

    • Add tests for all database configs (#11653)
    • Move B200 test stage to AIHub (#11692)
    • Support local wheel installation and add GB300 demo cases (#11742)
    • Remove submodule pulls from TRT-LLM git checkouts (#11693)
    • Add back WAN VBench test in CI (#11804)
    • Add E2E test for cancelled disaggregated generation requests with overlap scheduler (#11795)
    • Pass Nsight options to ray_executor and trigger profiling through collective_rpc (#11493)
    • Add B200 multi-node tests DB (#11783)
    • Add sanity tests for release 1.2 version (#11738)
    • Add QA test case for trust-remote-code on multi-node failure (#11905)
    • Fix model_name Starcoder 15B allowed-models issue (#11981)
    • Upgrade xgrammar from 0.1.25 to 0.1.32 (#12016)
    • Limit TileIRAS to CUDA 13.1 (#12042)
    • Remove VisualGen benchmark test from YAML (#12027)

What's Changed

  • [None][feat] Support tensor parallelism for nemotron-h model by @Wanli-Jiang in #11470
  • [None][test] Add tests for all database configs. by @fsaady in #11653
  • [https://nvbugs/5911143][fix] add async worker to MTP/Eagle3 sampler,… by @dhansen-nvidia in #11573
  • [TRTLLM-10886][feat] Support PARD(Parallel Draft Model) in one-model spec dec by @ziyixiong-nv in #11438
  • [None][fix] Fix disagg cancellation by @Tabrizian in #11730
  • [None][fix] Use prefer_pinned() in pard.py by @mikeiovine in #11762
  • [None][fix] Make KVCacheManagerV2 release mem immediately on shutdown by @lowsfer in #11746
  • [TRTLLM-11115][feat] enable autotuner for visual gen + Compilation Config by @NVShreyas in #11660
  • [None][chore] Minor fix in w4a8 mxfp4 mxfp8 test. by @Tracin in #11745
  • [None][infra] Move B200 test stage to AIHub by @yuanjingx87 in #11692
  • [None][infra] Waive failed cases for main on 02/27 by @EmmaQiaoCh in #11770
  • [TRTLLM-11064][fix] Remove duplicated MoE Computation with Helix CP+DP by @brb-nv in #11167
  • [TRTLLM-10386][fix] torch.compile: register add+norm fallback pass in multi-GPU mode by @luyiyun1021 in #11739
  • [None][feat] Support heterogeneous tokens_per_block by @lowsfer in #11751
  • [None][chore] Remove closed bugs by @xinhe-nv in #11527
  • [None][test] local wheel installation support and add gb300 cases demo by @fredricz-20070104 in #11742
  • [None][feat] Refactor cache manager v2 to simplify new model support by @jiaganc in #11749
  • [https://nvbugs/5879614][fix] Waive test_guided_decoding_with_eagle3 xgrammar in disaggregated serving by @ziyixiong-nv in #11773
  • [https://nvbugs/5911788][test] Waive test_llm_partial_update_weights[Qwen3/Qwen3-8B] by @liji-nv in #11785
  • [None][feat] add globaltimer-based timing backend for autotuner profi… by @dhansen-nvidia in #11657
  • [https://nvbugs/5926823][fix] Propagate logprobs from prefill to decode in disagg by @brb-nv in #11727
  • [TRTLLMINF-9][chore] Remove submodule pulls from TRT-LLM git checkouts by @dpitman-nvda in #11693
  • [https://nvbugs/5685010][fix] Delete test_eagle3_output_repetition_4gpus flaky assertions. by @zheyuf in #11725
  • [None][fix] enable separate draft KV cache pool for aggregated + KVBM… by @zyang-Modular in #11689
  • [TRTLLM-11058][feat] Support Helix CP with GQA by @brb-nv in #11570
  • [None][perf] Vectorize quantize_fp8_blockwise with CUDA kernel by @karljang in #11724
  • [https://nvbugs/5868616][fix] Fix warnings when building moe_kernels.cu by @yumin066 in #11703
  • [None][chore] Add CI trigger and test failure retrieval instructions to AGENTS.md by @lucaslie in #11803
  • [None][fix] Fix typo: avaiable_blocks -> available_blocks in scheduler by @kaiyux in #11801
  • [TRTLLM-11568][feat] Fix collective calls by @greg-kwasniewski1 in #11632
  • [None][perf] Use F.rms_norm for per-head QK normalization in visual gen by @karljang in #11798
  • [TRTLLM-11185][test] Add back WAN VBench test in CI by @chang-l in #11804
  • [TRTLLM-9782][feat] Support to skip KV cache memory estimation by @HuiGao-NV in #11714
  • [None][doc] Fix typos, grammar, and accuracy across documentation by @kaiyux in #11766
  • [None][fix] cleanup mem in rollout process by @hchings in #11658
  • [None][feat] Add --served-model-name option to serve command by @slin1237 in #11711
  • [None][chore] Update AGENTS.md by @lucaslie in #11809
  • [None][fix] AutoDeploy: Fix shape handling for singleton prefill by @galagam in #11679
  • [None][infra] Waive failed cases for main on 03/01 by @EmmaQiaoCh in #11811
  • [None][feat] TRT-LLM Gen MoE finalize kernel optimization by @nekorobov in #11501
  • [None][test] Add E2E test for cancelled disagg gen request with overlap scheduler by @Tabrizian in #11795
  • [None][chore] pass nsight options to ray_executor and trigger profiling through collective_rpc by @davidmlw in #11493
  • [TRTLLM-10962][feat] Refactor video encoding to use ffmpeg CLI or pur… by @JunyiXu-nv in #11672
  • [https://nvbugs/5823212][fix] Warmup maybe_compiled_cat in forward_context_with_chunked_prefill by @yuantailing in #11743
  • [None][feat] Extract embeding as .savetensors and support float8 quantized model by @nvyocox in #11180
  • [https://nvbugs/5885070][fix] fix deepeplowlatency with cutedsl moe backend by @leslie-fang25 in #11769
  • [None][fix] Fix FP8 per-tensor torch.compile graph break in dynamic quantization by @karljang in #11759
  • [TRTLLM-9687][feat] Improve are_stop_words performance by @stnie in #11196
  • [https://nvbugs/5883738][fix] fix bug for illegal memory access on Qwen3-235B-A22B-Thinking-2507-NVFP4 + Eagle3 by @sunnyqgg in #11474
  • [#10693][chore] AutoDeploy: Add L1 tests from coverage dashboard by @marinayanov in #11530
  • [https://nvbugs/5764627][fix] Fix generation logits with streaming and improve runtime of logits testcase. Also fixes https://nvbugs/5573238 by @stnie in #10637
  • [https://nvbugs/5934461][fix] Propagate logits from prefill to decode in disagg by @brb-nv in #11767
  • [#11726][feat] AutoDeploy: Fuse gemms of mixed children by @taylor-yb-lee in #11793
  • [None][fix] Fix overly aggressive capacity scheduler by @jthomson04 in #11731
  • [https://nvbugs/5689262][fix] use proper tokens when exclude_input_in_output is true by @lazykyama in #9453
  • [https://nvbugs/5863912][fix] Fix with move launch_dependent_grids after tmem free by @benzh-2025 in #11812
  • [https://nvbugs/5938603][fix] Fix E/PD disagg chunked prefill bug by @2ez4bz in #11805
  • [None][test] add deepseek RCCA perf test case by @ruodil in #11736
  • [None][fix] remove torch compile models arg by @NVShreyas in #11836
  • [None][test] add b200 multi nodes tests db by @xinhe-nv in #11783
  • [None][fix] Fix SM120 issue for rms_norm with nvfp4_quant_fusion by @Wanli-Jiang in #11774
  • [None][infra] Waive failed cases for main for post-merge 2564 by @ZhanruiSunCh in #11848
  • [https://nvbugs/5936502][fix] remove dead codes by @bo-nv in #11813
  • [None][chore] a GitHub Action to assign the PR to the author by @zhenhuaw-me in #11673
  • [None][infra] Fix a typo in waives.txt by @EmmaQiaoCh in #11852
  • [None][test] Fix wrong lora config by @yufeiwu-nv in #11818
  • [None][test] fix flaky issues by @xinhe-nv in #11814
  • [None][fix] Fix OOM issue/dummy request allocation/chunked prefill/pp for KV Cache Manager V2 by @yizhang-nv in #11710
  • [None][test] update waive list by @xinhe-nv in #11815
  • [TRTLLM-9939][perf] Short-sequence MHA optimization for DSA MLA prefill by @kaiyux in #11677
  • [None][refactor] Revisit attention interface for AutoDeploy by @lucaslie in #11796
  • [None][feat] Add a flag in trtllm serve to support overriding kv cache dtype by @cjluo-nv in #11487
  • [TRTLLMINF-9][chore] Use checkoutFile in mergeWaiveList to avoid full clone by @dpitman-nvda in #11794
  • [None][chore] Refresh inferenceX configs in recipes by @venkywonka in #11595
  • [TRTLLM-11042][feat] Implement suffix automaton on device for spec and support one model by @cascade812 in #11434
  • [https://nvbugs/5941681][fix] Handle dict type for speculative_config by @ziyixiong-nv in #11828
  • [None][feat] Add Kimi-K2.5 text model support (NVFP4) by @lancelly in #11777
  • [None][chore] Bump version to 1.3.0rc7 by @yuanjingx87 in #11864
  • [https://nvbugs/5919026][fix] Fix AttributeError when DSA indexer accesses non-DSA kv_cache_manager by @ziyixiong-nv in #11858
  • [TRTLLM-11184][feat] Explicit video encode format support by @JunyiXu-nv in #11830
  • [None][test] Enable DeepGemm + DeepEPLowLatency MoE test combination by @Tabrizian in #11876
  • [#10009][fix] Fix json_schema response_format to support OpenAI API w… by @JunyiXu-nv in #11497
  • [https://nvbugs/5927620][fix] Override mMaxAttentionWindow with the actual largest window size by @ziyixiong-nv in #11842
  • [None][feat] Support mix quantization between shared experts and routed experts for dsv3 by @dmtri35 in #11215
  • [#11666][fix] Fix inmemory model dir detection by @capyun007 in #11753
  • [None][infra] Waive 3 failed cases for main in post-merge 2566 by @ZhanruiSunCh in #11881
  • [None][doc] Add sparse attention tech blog by @heyuhhh in #11644
  • [TRTLLM-9392][feat] Support MoE output to alltoall's workspace for all the quantization recipe of trtllm-gen. by @bobboli in #11449
  • [TRTLLM-10852][feat] Enhance logprobs functionality to always return prompt token logprobs in prompt logprobs by @stnie in #11235
  • [None][fix] Fix typos, grammar, and formatting in comments and docstrings by @kaiyux in #11826
  • [None][fix] Update check_is_moe into support mlp_layer_types after config.json update by @eagle705 in #11477
  • [https://nvbugs/5946303][fix] Fix incorrect GPU timing in time breakdown under overlap scheduler by @luyiyun1021 in #11860
  • [None][chore] Update autotuner by @jiahanc in #11859
  • [None][chore] Handle failure in auto-assign author workflow by @zhenhuaw-me in #11906
  • [https://nvbugs/5930934][fix] Fix OOM hang with NCCL_SYMMETRIC fallback during long-context inference by @peihu-nv in #11870
  • [None][fix] Qwen3.5 fix positions ids input for text-only usage by @bmarimuthu-nv in #11877
  • [None][fix] Refactor nanoV3+superV3 accuracy tests to load example config by @galagam in #11458
  • [None][chore] Deprecate eagle3 2-model by @mikeiovine in #11761
  • [#11819][fix] Disable preload for Llama4 scout by @taylor-yb-lee in #11873
  • [None][chore] Fix format issue in tensorrt_llm/serve/openai_server.py by @chienchunhung in #11920
  • [None][feat] Separate radix search tree implementation by @thorjohnsen in #10862
  • [None][feat] Add support for expert_number<=2048 and K<=32 by @ChristinaZ in #11510
  • [None][infra] Waive 1 failed cases for main in pre-merge 29212 by @ZhanruiSunCh in #11929
  • [None][fix] remove leak check for kimi by @xinhe-nv in #11825
  • [https://nvbugs/5907477][chore] unwaive test by @reasonsolo in #11896
  • [TRTLLM-10956][infra] Support build-only mode for GenPostMergeBuilds job by @mzweilz in #11895
  • [#11755][feat] AutoDeploy onboarding agent + Kimi K2.5 AD modeling code by @bmarimuthu-nv in #11780
  • [None][fix] Prevent RuntimeError from dict mutation during iteration in EXAONE MoE weight mapper by @Bias92 in #11862
  • [TRTLLM-11101][feat] VisualGen benchmarking script by @zhenhuaw-me in #11651
  • [https://nvbugs/5820734][fix] Run extra general warmup to warm up memory pool by @liji-nv in #10340
  • [None][fix] Fix nemotron super MTP crash on SM90 by @sunnyqgg in #11807
  • [None][chore] Use cluster service discover in disagg CI tests by @ekou24 in #11242
  • [None][feat] External Drafter One Model by @IzzyPutterman in #11758
  • [None][chore] Update model list by @tcherckez-nvidia in #11827
  • [#11578][fix] Use string stop/bad words in gRPC proto instead of pre-tokenized TokenSequence by @CatherineSue in #11888
  • [None][feat] Add support for bidirectional sliding window attention mask to fmha_v2 by @djns99 in #11212
  • [TRTLLM-11036][feat] Enable new moe test and clean the legacy moe test in the CI by @xxi-nv in #11817
  • [None][infra] Waive 4 failed cases for main in post-merge 2571 by @ZhanruiSunCh in #11968
  • [None][test] Fix deepseek-r1 OOM issue for H100 perf test by @yufeiwu-nv in #11948
  • [None][fix] Remove incorrect Python import style rule from AGENTS.md by @yuxianq in #11940
  • [https://nvbugs/5896577][fix] fix bug of mistral large3 with eagle by @byshiue in #11942
  • [https://nvbugs/5819048][fix] unwaive test of qwen3-235b eagle3 by @byshiue in #11969
  • [None][feat] Avoid duplicated computation with ADP + Helix CP in GQA by @brb-nv in #11891
  • [https://nvbugs/5624818][fix] Add unittest for GPT-OSS non-paged_context_fmha by @pengbowang-nv in #11415
  • [#10245][feat] AutoDeploy: Support Finegrained FP8 quantization by @bmarimuthu-nv in #10897
  • [TRTLLM-11284][infra] Move large models test to post-merge by @EmmaQiaoCh in #11933
  • [TRTLLM-11155][infra] Run multi-GPU tests even single-GPU tests are failed when use --disable-fail-fast by @yiqingy0 in #11740
  • [None][fix] Refine tests/unittest/_torch/flashinfer/test_trtllm_flashinfer_symbol_collision.py to reduce jit-compile time by @yihwang-nv in #11890
  • [#11422][feat] AutoDeploy: Piecewise cudagraph support Prototype by @nvchenghaoz in #11515
  • [TRTLLM-11189][fix] VisualGen isolated TeaCache Wan fix by @o-stoner in #11964
  • [https://nvbugs/5846166][fix] Update Perf Triage Scripts to Fix gen_only issue by @chenfeiz0326 in #11802
  • [TRTLLM-11057][feat] Add Helix CP support for DSV3.2 by @brb-nv in #11507
  • [#2912][feat] Support Cohere Command A model by @torotoki in #11505
  • [TRTLLM-11259][perf] Parallel VAE harness and implementation for WAN by @NVShreyas in #11875
  • [#11578][feat] support multimodal image input in gRPC server by @CatherineSue in #11800
  • [TRTLLM-11093][feat] add 5D A2A for fused ulysses by @NVShreyas in #11787
  • [TRTLLM-11189][fix] Fix TeaCache broken caching for FLUX.1 and FLUX.2 by @karljang in #11868
  • [None][refactor] Request management in ScheduledRequests by @Funatiq in #11784
  • [None][perf] Add Triton FP8 blockwise quant kernel and autotuner bucket-skip for visual gen by @chang-l in #11854
  • [TRTLLM-11290][feat] Enable trtllm-serve E2E tests by @JunyiXu-nv in #11985
  • [None][feat] Optimize by fuse nvfp4_quant to layernorm_gated for mamba2_mixer by @Wanli-Jiang in #11473
  • [None][chore] Autodeploy: add models for sprint by @nvchenghaoz in #11999
  • [None][infra] Update CI allow list 20260305 by @yuanjingx87 in #11965
  • [None][chore] Mass integration of release/1.2 weekly - 6th by @dominicshanshan in #11934
  • [None][fix] Fix Collect Perf Sanity Result's import requests Error by @chenfeiz0326 in #12002
  • [TRTLLM-10956][infra] Skip updating gitlab status for GenPostMergeBuilds by @mzweilz in #11954
  • [None][feat] add ReLU2 NVFP4 fusion for AutoDeploy with tests by @tcherckez-nvidia in #11957
  • [TRTLLM-11159][feat] Wire KVCacheBlock to UnifiedBlockTree, replacing mPrevBlock/mNextBlocks with lookup-node pointers. by @SimengLiu-nv in #11919
  • [#11166][infra] AutoDeploy: improve test organization in CI and add overview doc by @lucaslie in #11291
  • [None][chore] Model update 260308 by @tcherckez-nvidia in #12011
  • [None][infra] Update AutoDeploy CODEOWNERS coverage by @lucaslie in #12013
  • [https://nvbugs/5732958][bug] Fix TestLlama4MinLatency::test_llama_allclose_to_hf failure by @nvpohanh in #10191
  • [None][chore] Unwaive some skip for trtllm moe backend by @leslie-fang25 in #11975
  • [TRTLLM-11134][feat] export VisualGen API and update doc by @zhenhuaw-me in #11911
  • [https://nvbugs/5823783][test] add qa test case for trust-remote-code on multinode failure by @crazydemo in #11905
  • [None][feat] Use max_gpu_total_bytes to control v2's capacity by @jiaganc in #11907
  • [TRTLLM-11342][fix] Fix FLUX.1 TeaCache polynomial coefficients and default t… by @karljang in #12007
  • [None][fix] Use try/except fallback for Pydantic ValidatorIterator in chat message parsing by @Wanli-Jiang in #11903
  • [None][infra] Unwaive 2 cases on rtx-pro-6000d by @EmmaQiaoCh in #12003
  • [TRTLLM-11276][chore] Expose use_python_scheduler in SchedulerConfig and add UTs/ITs for python scheduler by @lancelly in #11884
  • [None][infra] Waive 7 failed cases for main in post-merge 2576 by @ZhanruiSunCh in #12014
  • [https://nvbugs/5948878][fix] Implement workaround for ClientPayloadError by @yingguo-trt in #12018
  • [TRTLLM-10407][feat] Integrate CuTE DSL top-k kernel for Blackwell by @limin2021 in #11900
  • [TRTLLM-11148][perf] _prepare_inputs host time optimization by @hyukn in #11704
  • [None][test] Fix model_name starcoder_15b is not in allowed_models issue by @yufeiwu-nv in #11981
  • [None][infra] Waive 5 failed cases for main in post-merge 2578 by @ZhanruiSunCh in #12023
  • [None][chore] AutoDeploy: re-enable nvfp4 superv3 accuracy test by @galagam in #11945
  • [None][chore] Remove visual_gen benchmark test from YAML by @zhenhuaw-me in #12027
  • [None][fix] Fix the model list as it had a dup model by @tcherckez-nvidia in #12029
  • [https://nvbugs/5863806][fix] Fix Python string truthiness bug in FMHA cubin selection by @luyiyun1021 in #11909
  • [None][feat] Upgrade xgrammar from 0.1.25 to 0.1.32 by @sunnyqgg in #12016
  • [https://nvbugs/5924144][test] unwaive cpp/test_unit_tests.py::test_unit_tests[kernels-80] by @Funatiq in #11902
  • [None][chore] limit tileiras to CUDA13.1 by @tburt-nv in #12042
  • [None][feat] Add silu to trtllm-gen MoE by @IwakuraRein in #11663
  • [TRTLLM-11045][feat] Integrate SA with EAGLE3 and PARD by @cascade812 in #11878
  • [None][chore] waive test_visual_gen_quickstart by @tburt-nv in #12043
  • [None][feat] NIXL support for hybrid model cache transfer by @NVShreyas in #11608

New Contributors

Full Changelog: v1.3.0rc6...v1.3.0rc7

Don't miss a new TensorRT-LLM release

NewReleases is sending notifications on new releases.