NVIDIA/TensorRT-LLM v1.3.0rc19 on GitHub

Known Issues
- Llama 3.1 8B FP8 can hang during the autotuner warmup on GB200.
Model Support
- Support NVIDIA Wan2.2-T2V quantized checkpoints (#15093)
- Enable MTP for Step-3.7 NVFP4 and port Step-3.7VL vision tower to TRT-LLM modules (#14926)
- Support T5 and BART in the PyTorch backend (#13919)
- Support MiniMax-M3 in the PyTorch backend (#15292)
API
- Align VisualGen serve request schema with VisualGenParams (#14733)
- Support multi-item scoring in LLM.encode (#14693)
- Drop legacy --extra_visual_gen_options CLI alias (#15262)
Feature
- Enable TRTLLM MoE backend for Nemotron-H BF16 checkpoint (#14944)
- Add async Ulysses pipeline (enabled for LTX-2 and WAN) (#13978)
- Make TrtllmGenAttention the default decode backend on Blackwell+ (#14618)
- Skip redundant data expand in DeepGemmFusedMoE via fused expand+quant Triton kernel (#14591)
- Add Prometheus metrics for prompt cache, speculative decoding, perplexity, and batch occupancy (#12636)
- Add Indexer TopK single-block / multi-pass radix implementation (#14268)
- Enable gen-only speculative decoding for disagg setups (#14546)
- Support EAGLE3 dynamic trees on Blackwell (#12958)
- Add CUDA graph support for per-expert LoRA in Cutlass backend (#14881)
- Add support for beam search in disaggregated serving (#14876)
- Add maximal LLMAPI capture in usage telemetry (#14398)
- Optimize Qwen2.5/3/3.5-VL performance (#11943)
- Add skip-softmax TMA-load + sync-MMA warp-specialized context FMHA for sm_120/sm_121 (#15163)
- Enable TRTLLM cross attention backend (#15345)
- Support per-request mm_processor_kwargs for Qwen3-VL (#14702)
- Add prefetch_reuse_blocks and configurable prefetch count (#15149)
- Add MegaMoECuteDsl NVFP4 MoE backend (#14608)
- Make EAGLE3 honor sampling params by default (#14745)
- Add multiple FMHA library support to TRTLLM attention backend (#15204)
- Add checkpointing variant of replay for MTP for mamba models (#14203)
Fix
- Remove redundant TikTokenTokenizer shim from Kimi-K2.5 input processor (#14741)
- Rename misnamed tunable_fp4_quantize kwarg and add real SF-swizzle control (#15002)
- Gate FlashInfer GDN kernels to supported configurations (#15094)
- Count DSA indexer K-cache correctly as UINT8 in KV cache size estimate (#15088)
- Select CUTLASS MoE backend on non-Blackwell SMs for Qwen3.5-35B-A3B FP8 (#15081)
- Fix SageAttention kernel regression by using static scheduler (#15047)
- Fall back to local cache when loading tokenizer for gated models (#12998)
- Fix PyExecutor FPM iteration timing (#14922)
- Register multimodal placeholders for Qwen3.5 MoE VLM serving (#15079)
- Fix and unwaive Nemotron-related bugs (#15085)
- Guard DSA DSL atom-split against MTP draft next (#14891)
- Scope disagg-ctx cache-transfer quorum vote to TP instead of WORLD (#15136)
- Clear workspace in run_mla_generation to avoid illegal memory access (#15173)
- Fix MAX_UTILIZATION reuse token budget (#15066)
- Add kv_transfer_timeout_ms to avoid timeout (#15152)
- Preserve ip:port for trtllm-serve visual-gen (#14355)
- Fix guided decoding (xgrammar) + EAGLE-3 + draft_len_schedule crash during CUDA graph capture (#15023)
- Stabilize Mamba replay state update (#14841)
- Fix max_context_length value for attention workspace sizing (#15156)
- Fix issue where host KV cache usage would double when speculative decoding is used (#14373)
- Disable NCCL_SYMMETRIC tactic on GB10 (DGX Spark) (#12902)
- Fix attentionOp FP8 MLA KV-reuse workspace calculation (#14852)
- Fix beam search log_probs non-determinism with batch_size > 1 (#15125)
- Forward secondary_offload_min_priority to KVCacheManager in PyTorch executor (#13768)
- Enable multi-block mode for XQA HMMA spec-dec (#15312)
- Fix TinyGEMM barrier bug (#15338)
- Fix stale sparse attention kwargs (#15460)
- Fix CppMambaHybridCacheManager to handle dp dummy request (#15054)
- Fix embedding vocab mask for rejection sampling in Kimi-K2.5 (#15233)
Documentation
- Add FLUX visual generation examples (#14987)
- Add Qwen3.5 deployment guide doc (#15111)
- Fix stale --disable_xqa reference in legacy docs (#13395)
- Add Cache-DiT documentation (#15268)
Benchmark
- Weight trtllm-bench AR/AL averages by output length (#14998)
Test & Infra
- Add accuracy tests for nemotron-v3-ultra (#14808)
- Remove TestLlama4ScoutInstruct tests (#15144)
- Require minimum of 4 GPUs in llm_perf_core.yml and add new performance tests (#15090)
- Add DFlash coverage for Qwen3.5 MoE variant (#15132)
- Add e2e example tests for flux1/2, ltx2, wan_i2v, and cosmos3 (#15126)
- Enable disagg cancellation stress test (#15174)
- Fix periodic-junit in unittest pytest (#14075)
- Update K2.5 and GLM-5 into CI perf test (#14960)
- Add Qwen3-32B FP8 disagg stress test (#14278)
- Sunset old disagg test cases for the QA side (#15290)
- Add e2e Tensor Parallel LPIPS tests for VisualGen (#15208)
- Remove TensorRT performance baseline and update to PyTorch only (#15256)
- Add integration tests for MoE LoRA and bugfixes (#15271)

What's Changed

[None][infra] Waive TestQwen3NextInstruct nvfp4 cases by @mzweilz in #15086
[https://nvbugs/6248757][fix] Avoid running all reduce in aux stream by @tensorrt-cicd in #14917
[https://nvbugs/6221483][fix] AutoDeploy: Fix Eagle metadata host syncs by @govind-ramnarayan in #14714
[None][feat] add FLUX visual generation examples by @karljang in #14987
[https://nvbugs/6261164][fix] AutoDeploy: Don't allocate speculative caches when speculation is off by @tensorrt-cicd in #15020
[https://nvbugs/6211189][fix] Lower the reference to 46.5 (matching cross-GPU empirical mean) and remove the t by @tensorrt-cicd in #14799
[None][refactor] split VisualGen pipeline and model configs by @bobboli in #14956
[TRTLLM-11457][feat] Async Ulysses pipeline (Enabled for LTX-2 + WAN) by @luyiyun1021 in #13978
[TRTLLM-11548][doc] Add Qwen3.5 deployment guide doc by @nv-guomingz in #15111
[https://nvbugs/6181383][fix] Build inner text/vision/audio sub-configs as empty PretrainedConfig() then setat by @tensorrt-cicd in #14399
[https://nvbugs/6273850][chore] waive TestQwen3_5_4B::test_bf16 for all GPUs by @tburt-nv in #15112
[None][doc] Add docs for AutoDeploy transforms by @bmarimuthu-nv in #15122
[None][infra] Waive 4 failed cases for main in post-merge 2769 by @ZhanruiSunCh in #15140
[https://nvbugs/6227203][fix] Remove redundant TikTokenTokenizer shim from KimiK25InputProcessor by @tianyuxbear in #14741
[None][fix] tunable_fp4_quantize: rename misnamed kwarg + add real SF-swizzle control by @luyiyun1021 in #15002
[None][test] Fix gen_only missing prev_device_step_time race in perf sanity by @tensorrt-cicd in #15108
[None][test] Fix disagg test result dir by @fredricz-20070104 in #14864
[TRTLLM-13332][test] Remove TestLlama4ScoutInstruct tests by @QiJune in #15144
[https://nvbugs/6266705][fix] Gate FlashInfer GDN kernels to supporte… by @nv-guomingz in #15094
[https://nvbugs/6255037][fix] Count DSA indexer K-cache correctly as UINT8 in KV cache size estimate by @eopXD in #15088
[https://nvbugs/6194812][test] Update llm_perf_core.yml to require a minimum of 4 GPUs and add new performance tests by @yufeiwu-nv in #15090
[TRTLLMINF-112][infra] Reduce the waiting time between check node is online or not by @EmmaQiaoCh in #14819
[None][infra] Waive 1 failed cases for main in pre-merge 41821 by @ZhanruiSunCh in #15135
[None][infra] CBTS Layer 3: pass test-db via Artifactory instead of env var by @crazydemo in #15142
[TRTLLM-13264][feat] Add native bias epilogue to NVFP4 GEMM by @luyiyun1021 in #15053
[https://nvbugs/6278380][unwaive] unwaive ad cases by @crazydemo in #15148
[https://nvbugs/6244474][fix] AutoDeploy: Remove llama perf test from CI by @MrGeva in #15107
[https://nvbugs/6212252][fix] Select CUTLASS MoE backend on non-Blackwell SMs in TestQwen3_5_35B_A3B::test_fp8 by @xxi-nv in #15081
[TRTLLM-13302][feat] Register NVIDIA Wan2.2-T2V quantized checkpoints by @zhenhuaw-me in #15093
[None][chore] add VisualGen team as the codeowner of the VisualGen Attention by @zhenhuaw-me in #15150
[None][feat] Default on FlashInferTrtllmGenAttention by @yihwang-nv in #14618
[None][infra] Test DFW with BSL branch by @yuanjingx87 in #14597
[TRTLLM-12214][perf] customMoeRoutingKernel: lower BLOCK_SIZE to 128, raise maxNumBlocks by @xwang233 in #14590
[TRTLLM-12214][perf] DeepGemmFusedMoE: skip redundant data expand via fused expand+quant Triton kernel by @xwang233 in #14591
[TRTLLM-12648][test] implement disagg cancellation load thread by @chienchunhung in #15124
[None][fix] Fix regression from SageAttention kernel: Use static scheduler by @xrq-phys in #15047
[TRTLLM-12467][feat] EPD improvements by @venkywonka in #13864
[None][feat] Expose stored block-hash chain to KV cache connector by @jthomson04 in #14806
[#12805][fix] Fall back to local cache when loading tokenizer for gated models by @1MrazorT1 in #12998
[None][feat] Support partial RoPE fusion for Hopper kernels in XQA for Laguna by @DomBrown in #15110
[None][infra] Add nv-xtf, rahul-steiger-nv, tedzhouhk, tensorrt-cicd to blossom-ci allowlist by @ZhanruiSunCh in #14955
[None][feat] Add Prometheus metrics for prompt cache, speculative decoding, perplexity, and batch occupancy by @vedularaghu in #12636
[None][chore] Unwaive DSV32 helix tests by @brb-nv in #14871
[None][fix] unset UCX_TLS=tcp by @tburt-nv in #15008
[None][feat] Port 13 AutoDeploy custom models to sharding IR + opt them in via registry by @greg-kwasniewski1 in #14778
[None][chore] Make image paths absolute in blog22 by @brb-nv in #15177
Fix PyExecutor FPM iteration timing by @tedzhouhk in #14922
[#13816][feat] AutoDeploy: Optimize gpt-oss-120b perf by @taylor-yb-lee in #14202
[None][fix] Register Multimodal Placeholders for Qwen3.5 MoE VLM Serving by @anurags25 in #15079
[None][feat] Weight trtllm-bench AR/AL averages by output length by @zhaoyangwang-nvidia in #14998
[TRTLLM-13052][feat] Enable TRTLLM moe backend for nemotron-h BF16 ckpt by @Wanli-Jiang in #14944
[None][fix] Fix and unwaive nemotron related bugs by @Wanli-Jiang in #15085
[https://nvbugs/6140226][test] Add DFlash coverage for Qwen3.5 MoE variant by @yingguo-trt in #15132
[None][test] temporarily waive Cosmos3 B200 failures by @bobboli in #15195
[NVBUG-6241842][fix] DSA DSL atom-split: guard against MTP draft next… by @limin2021 in #14891
[#11423][feat] AutoDeploy: Basic Disagg Support by @govind-ramnarayan in #14057
[https://nvbugs/6280060][fix] Scope disagg-ctx cache-transfer quorum vote to TP instead of WORLD by @tensorrt-cicd in #15136
[None][test] Add e2e example tests for flux1/2, ltx2, wan_i2v, and cosmos3 by @chang-l in #15126
[#12632][feat] Add pipeline cache support for AutoDeploy by @nvchenghaoz in #13729
[None][test] Add support for nemotron_3_ultra_550b_nvfp4 model in performance tests and configurations by @yufeiwu-nv in #15166
[None][feat] Indexer TopK: single-block / multi-pass radix by @dcampora in #14268
[None][fix] Clear workspace in run_mla_generation to avoid potential illegal memory access issue by @yihwang-nv in #15173
[None][chore] Unwaive AutoDeploy accuracy tests by @bmarimuthu-nv in #14971
[None][test] Increase kv_transfer_timeout_ms for b200 deepseek-r1 disagg gen_only perf test by @tensorrt-cicd in #15205
[None][feat] Enable MTP for Step-3.7 NVFP4 and port Step-3.7VL vision tower to TRT-LLM modules by @kaiyux in #14926
[https://nvbugs/6266370][fix] Fix MAX_UTILIZATION reuse token budget on main by @brb-nv in #15066
[https://nvbugs/6272573][ci] Unwaive skipped test by @2ez4bz in #15118
[https://nvbugs/6245279][fix] AutoDeploy: Unwaive accuracy tests by @galagam in #15214
[TRTLLM-12491][feat] Align VisualGen serve request schema with VisualGenParams by @zhenhuaw-me in #14733
[None][test] Add MLA chunked-prefill SM dispatch regression coverage by @DhineshPonnarasan in #13904
[TRTLLM-12648][test] enable disagg cancellation stress test by @chienchunhung in #15174
[None][feat] Preserve cache_salt string in KV cache events by @jthomson04 in #13051
[https://nvbugs/6104831][fix] Port dataTransceiver shared_ptr lifetime fix by @chienchunhung in #14979
[None][fix] Fix AutoDeploy transform docs generation by @bmarimuthu-nv in #15228
[None][feat] Targeted warmup-waste cleanup by @dominicshanshan in #14609
[None][fix] Remove TLLM_RUBIN_FEATURES by @yuxianq in #15143
[https://nvbugs/6108994][fix] add kv_transfer_timeout_ms to avoid timeout by @bo-nv in #15152
[TRTLLM-12657][infra] Fix periodic-junit in unittest pytest by @yiqingy0 in #14075
[https://nvbugs/6143883][fix] Preserve ip:port for trtllm-serve visual-gen by @JunyiXu-nv in #14355
[TRTLLM-12958][feat] Enable gen-only spec dec by @bo-nv in #14546
[https://nvbugs/6162120][test] Remove 78 closed-bug waive entries for main by @tensorrt-cicd in #15061
[https://nvbugs/6278399][fix] Add x86_64 path using CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR with… by @tensorrt-cicd in #15129
[TRTLLM-11538][feat] Blackwell custom mask fmha support by @sunnyqgg in #12958
[None][infra] Waive 6 failed cases for main in post-merge 2773 by @ZhanruiSunCh in #15250
[None][feat] Enhance CuteDSL NVF4 MOE by @liyuhannnnn in #15092
[None][infra] Waive 3 failed cases for main in post-merge 2772 by @ZhanruiSunCh in #15253
[None][test] Update K2.5 andGLM-5 into CI Perf Test by @chenfeiz0326 in #14960
[None][feat] enable GQA and cross-attention for attn2d by @NVShreyas in #14961
[#12230][fix] Add bounds checking in autotuner _find_nearest_profile for SM121 by @mihai-chiorean in #12310
[None][refactor] visual_gen Attention: drop redundant enable_ulysses kwarg (rebase artifact from #13978) by @luyiyun1021 in #15141
[None][fix] Generalize FP8 checkpoint loading for Qwen3.5 by @amukkara in #15067
[#13858][fix] AutoDeploy fix the piecewise vlm issue by @nvchenghaoz in #14006
[TRTLLM-12507][feat] Cudagraph support for per-expert lora in Cutlass backend - Part 2 by @brb-nv in #14881
[None][test] Remove stale perf sanity waives by @cascade812 in #15269
[None][infra] Waive 8 failed cases for main in pre-merge 42699 by @ZhanruiSunCh in #15273
[None][fix] Install processor-output validation filter at module import by @aswinvisva in #14832
[None][infra] Waive 10 failed cases for main in pre-merge 42753 by @ZhanruiSunCh in #15275
[TRTLLM-12534][fix] Nemotron Nano - properly account for text prompts in inflight batching with EVS on by @moraxu in #15016
[None][doc] Fix stale --disable_xqa reference in legacy docs by @Erfandarzi in #13395
[TRTLLM-11403][doc] Cache-DiT documentation by @o-stoner in #15268
[#15022][fix] Guided decoding (xgrammar) + EAGLE-3 + draft_len_schedule reaching 0 crashes during CUDA graph capture, "bitmask must have the same batch size as logits" by @chungen04 in #15023
[TRTLLM-12154][test] Add Qwen3-32B FP8 disagg stress test by @brnguyen2 in #14278
[TRTLLM-13141][feat] Add backend-agnostic SourceIdentity gate for weight sharing by @chienchunhung in #14878
[None][feat] Add PyTorch reset_prefix_cache API by @milesial in #14970
[None][fix] Stabilize Mamba replay state update by @sunnyqgg in #14841
[None][infra] Waive remaining AutoDeploy Disagg tests until fix lands by @govind-ramnarayan in #15282
[None][test] Sunset the old disagg test cases for the qa side by @fredricz-20070104 in #15290
[None][infra] Waive 1 failed cases for main in pre-merge 42836 by @ZhanruiSunCh in #15293
[None][fix] Fix max_context_length value for attention workspace sizing by @pengbowang-nv in #15156
[TRTLLM-12038][feat] Add accuracy tests for nemotron-v3-ultra by @Wanli-Jiang in #14808
[#14672][fix] AutoDeploy: Vendor OpenELMConfig locally to fix OpenELM config loading by @plapagesse in #15175
[https://nvbugs/6035425][fix] Fix KV cache host splitting logic by @mikeiovine in #14373
[None][refactor] Move KV cache manager V2 to separate file by @jiaganc in #14680
[TRTLLM-12963][refactor] LTX-2 attention: drop dead k_pe parameter; require cached cross-attn by @luyiyun1021 in #14555
[TRTLLM-10184][chore] Remove legacy XQA precompiled path by @pengbowang-nv in #14941
[TRTLLM-35882][feat] cute dsl gvr-top multi-cta optimization by @limin2021 in #15198
[None][fix] Revert "Add PyTorch reset_prefix_cache API (#14970)" by @xxi-nv in #15306
Revert "[None][test] Add support for nemotron_3_ultra_550b_nvfp4 model in performance tests and configurations" by @tburt-nv in #15310
[https://nvbugs/6309375][test] AutoDeploy: Remove stale fallback test by @govind-ramnarayan in #15316
[None][fix] AutoDeploy: set enable_spec_decode on ADEngine for disagg by @Shixiaowei02 in #15260
[TRTLLM-12498][feat] Add support for beam search in disaggregated serving by @athena-nv in #14876
[None][chore] 2 more WAN multi-gpu tests by @NVShreyas in #15223
[TRTLLM-12721][feat] Add disagg transfer state consensus by @chienchunhung in #15139
[None][infra] Waive 1 failed cases for main in pre-merge 43047 by @ZhanruiSunCh in #15326
[#12715][fix] disable NCCL_SYMMETRIC tactic on GB10 (DGX Spark) by @nv-lschneider in #12902
[None][feat] AutoDeploy: Qwen3.5: Apply whielist based sharding and apply lm_head sharding by @taylor-yb-lee in #15185
[https://nvbugs/6293015][fix] Add a delegating `@property def vocab_size_padded(self) -> int: return… by @tensorrt-cicd in #15219
[TRTLLM-12842][feat] Maximal LLMAPI capture in usage telemetry by @venkywonka in #14398
[TRTLLM-12427][perf] Qwen2.5/3/3.5-VL Performance Optimization by @yechank-nvidia in #11943
[TRTLLM-11408][test] Add e2e Tensor Parallel LPIPS tests for VisualGen by @yingguo-trt in #15208
[None][infra] Waive 1 failed cases for main in pre-merge 43173 by @ZhanruiSunCh in #15358
[None][infra] Record CBTS decision to OpenSearch for CI-health monitoring by @crazydemo in #15210
[None][feat] MNNVL Performance Optimization and FP8/NVFP4 Quant Fusion by @timlee0212 in #14476
[None][refactor] Remove TensorRT performance baseline and update to PyTorch only by @yufeiwu-nv in #15256
[None][test] Waive 1 failed cases for main in QA CI by @tensorrt-cicd in #15315
[https://nvbugs/6029882][fix] Fix attentionOp fp8 mla kvreuse workspace calculation by @pengbowang-nv in #14852
[None][infra] pin pytest and click workaround by @cascade812 in #15357
[None][feat] skip-softmax on SM120: TMA-load + sync-MMA warp-specialized context FMHA for sm_120/sm_121 by @dcampora in #15163
[None][fix] Fix beam search log_probs non-determinism with batch_size > 1 by @achartier in #15125
[Bugfix] Forward secondary_offload_min_priority to KVCacheManager in PyTorch executor by @Saddss in #13768
[None][chore] Bump version to 1.3.0rc19 by @yuanjingx87 in #15188
[TRTLLMINF-103][feat] Keep SLURM timeouts non-retryable by @dpitman-nvda in #15183
[TRTLLM-12982][feat] support multi item scoring in LLM.encode by @ixlmar in #14693
[https://nvbugs/6281014][fix] fix the repeated cute.compile and simpilify the test by @JadoTu in #15331
[None][chore] Integration tests for MoE lora & bugfixes by @brb-nv in #15271
[TRTLLM-12339][feat] enable TRTLLM cross attention backend by @cascade812 in #15345
[TRTLLM-12807][test] Guard thop attention kwarg aliases by @yuxianq in #15335
[None][infra] Waive 21 failed cases for main in post-merge 2780 by @ZhanruiSunCh in #15373
[None][fix] pool-qualify KV cache transfer pending keys by @chienchunhung in #15272
[None][refactor] Enhance pytest integration by updating test node generation to support fixture inheritance and dynamic collection by @yufeiwu-nv in #15374
[None][test] Waive 1 failed cases for main in QA CI by @tensorrt-cicd in #15377
[https://nvbugs/312578][fix] split test_cache_transceiver_single_process by @chuangz0 in #15369
[None][infra] Update the new duration base on opensearch result by @EmmaQiaoCh in #15364
[https://nvbugs/6245861][fix] Gate the two ID None-checks on finish_reason in _GEN_PENDING_FINISH_REASONS… by @tensorrt-cicd in #14908
[https://nvbugs/6223556][fix] Propagate gen-first ctx usage via aux buffer to postproc by @reasonsolo in #15246
[None][test] Fix Mamba hybrid transceiver helper by @chienchunhung in #15323
[None][feat] Qwen3-VL: support per-request mm_processor_kwargs by @aswinvisva in #14702
[TRTLLM-12982][chore] NVTX-annotate logits processor by @ixlmar in #15408
[TRTLLM-12339][feat] Support T5 and BART in the PyTorch backend by @cascade812 in #13919
[TRTLLM-13333][feat] Add prefetch_reuse_blocks and configurable prefetch count by @reasonsolo in #15149
[None][feat] DSv4 prep: attention op plumbing by @lfr-0531 in #15384
[None][test] Waive 8 failed cases for main in post-merge by @tensorrt-cicd in #15389
[#15182][fix] Fix embedding vocab mask for handling rejection sampling in Kimi-K2.5 by @chungen04 in #15233
[None][test] Waive 1 failed cases for main in QA CI by @tensorrt-cicd in #15320
[None][refactor] Refactor Skip Softmax Attention Interface by @bobboli in #14687
[None][infra] Waive 1 failed cases for main in pre-merge 43656 by @ZhanruiSunCh in #15439
[None][infra] Waive 11 failed cases for main in post-merge 2782 by @ZhanruiSunCh in #15395
[https://nvbugs/6248837][fix] Densify trtllm-gen fmha warmup grid to catch missing kernels by @pengbowang-nv in #15305
[TRTLLM-13378][feat] Drop legacy --extra_visual_gen_options CLI alias by @zhenhuaw-me in #15262
[TRTLLM-12950][feat] Add MegaMoECuteDsl NVFP4 MoE backend by @xxi-nv in #14608
[None][perf] DSv4 prep: attention fusion custom ops by @lfr-0531 in #15390
[TRTLLM-12669][refactor] Eagle3 sampling: auto-detect greedy fast-path, mixed-batch rejection sampling, draft honors target params by @zhaoyangwang-nvidia in #14745
[TRTLLMINF-137][infra] Skip to create perf report when there is not perf test results by @yiqingy0 in #15446
[https://nvbugs/6270671][fix] Replace the hardcoded multiBlock=1 with a call to… by @tensorrt-cicd in #15312
[TRTLLMINF-113][infra] Add timeout protection to Setup/Initialize stages by @ZhanruiSunCh in #14682
[None][infra] Waive 1 failed cases for main in pre-merge 43720 by @ZhanruiSunCh in #15449
[None][infra] Waive 2 failed cases for main in post-merge 2785 by @ZhanruiSunCh in #15450
[None][perf] executor: avoid deepcopy of prompt_token_ids on enqueue by @lancelly in #14895
[None][infra] Waive 1 failed cases for main in pre-merge 43712 by @ZhanruiSunCh in #15447
[None][ci] tighten VisualGen CBTS routing by @zhenhuaw-me in #15259
[None][fix] fix tinygemm barrier bug by @yweng0828 in #15338
[TRTLLM-12199][feat] WideEP FT: add EPGroupHealth thread-safe rank mask (1a.1) by @chienchunhung in #13302
[None][infra] Waive 18 failed cases for main in pre-merge 43878 by @ZhanruiSunCh in #15469
[None][fix] Fix stale sparse attention kwargs by @bobboli in #15460
[None][test] Waive 1 failed cases for main in QA CI by @tensorrt-cicd in #15411
[TRTLLM-12807][feat] Add multiple FMHA library support to TRTLLM attention backend by @yuxianq in #15204
[None][infra] Waive 1 failed cases for main in pre-merge 43917 by @ZhanruiSunCh in #15478
[None][feat] Side-stream for MM encoder by @2ez4bz in #14322
[None][feat] BREAKING: Add MiniMax-M3 PyTorch backend bring-up with API changes by @WeiHaocheng in #15292
[https://nvbugs/6215678][fix] Point --output-artifact-dir at a unique per-run subdir `{model}-openai-complet by @tensorrt-cicd in #14742
[None][fix] fix CppMambaHybridCacheManager to handle dp dummy request by @bo-nv in #15054
[None][test] Waive 5 failed cases for main in post-merge by @tensorrt-cicd in #15392
[None][test] Waive 9 failed cases for main in post-merge by @tensorrt-cicd in #15391
[None][test] Waive 5 failed cases for main in QA CI by @tensorrt-cicd in #15360
[None][test] Waive 8 failed cases for main in QA CI by @tensorrt-cicd in #15342
[None][feat] Checkpointing variant of replay for MTP for mamba models by @hnover-nv in #14203
[None][test] Waive 23 failed cases for main in QA CI by @tensorrt-cicd in #15337
[None][test] Waive 3 failed cases for main in QA CI by @tensorrt-cicd in #15319

New Contributors

@1MrazorT1 made their first contribution in #12998
@vedularaghu made their first contribution in #12636
@tedzhouhk made their first contribution in #14922
@anurags25 made their first contribution in #15079
@Erfandarzi made their first contribution in #13395
@chungen04 made their first contribution in #15023
@brnguyen2 made their first contribution in #14278
@plapagesse made their first contribution in #15175
@athena-nv made their first contribution in #14876
@Saddss made their first contribution in #13768

Full Changelog: v1.3.0rc18...v1.3.0rc19