-
Known Issues
- Llama 3.1 8B FP8 can hang during the autotuner warmup on GB200.
-
Model Support
-
API
-
Feature
- Enable TRTLLM MoE backend for Nemotron-H BF16 checkpoint (#14944)
- Add async Ulysses pipeline (enabled for LTX-2 and WAN) (#13978)
- Make
TrtllmGenAttentionthe default decode backend on Blackwell+ (#14618) - Skip redundant data expand in
DeepGemmFusedMoEvia fused expand+quant Triton kernel (#14591) - Add Prometheus metrics for prompt cache, speculative decoding, perplexity, and batch occupancy (#12636)
- Add Indexer TopK single-block / multi-pass radix implementation (#14268)
- Enable gen-only speculative decoding for disagg setups (#14546)
- Support EAGLE3 dynamic trees on Blackwell (#12958)
- Add CUDA graph support for per-expert LoRA in Cutlass backend (#14881)
- Add support for beam search in disaggregated serving (#14876)
- Add maximal LLMAPI capture in usage telemetry (#14398)
- Optimize Qwen2.5/3/3.5-VL performance (#11943)
- Add skip-softmax TMA-load + sync-MMA warp-specialized context FMHA for sm_120/sm_121 (#15163)
- Enable TRTLLM cross attention backend (#15345)
- Support per-request
mm_processor_kwargsfor Qwen3-VL (#14702) - Add
prefetch_reuse_blocksand configurable prefetch count (#15149) - Add MegaMoECuteDsl NVFP4 MoE backend (#14608)
- Make EAGLE3 honor sampling params by default (#14745)
- Add multiple FMHA library support to TRTLLM attention backend (#15204)
- Add checkpointing variant of replay for MTP for mamba models (#14203)
-
Fix
- Remove redundant
TikTokenTokenizershim from Kimi-K2.5 input processor (#14741) - Rename misnamed
tunable_fp4_quantizekwarg and add real SF-swizzle control (#15002) - Gate FlashInfer GDN kernels to supported configurations (#15094)
- Count DSA indexer K-cache correctly as UINT8 in KV cache size estimate (#15088)
- Select CUTLASS MoE backend on non-Blackwell SMs for Qwen3.5-35B-A3B FP8 (#15081)
- Fix SageAttention kernel regression by using static scheduler (#15047)
- Fall back to local cache when loading tokenizer for gated models (#12998)
- Fix PyExecutor FPM iteration timing (#14922)
- Register multimodal placeholders for Qwen3.5 MoE VLM serving (#15079)
- Fix and unwaive Nemotron-related bugs (#15085)
- Guard DSA DSL atom-split against MTP draft next (#14891)
- Scope disagg-ctx cache-transfer quorum vote to TP instead of WORLD (#15136)
- Clear workspace in
run_mla_generationto avoid illegal memory access (#15173) - Fix
MAX_UTILIZATIONreuse token budget (#15066) - Add
kv_transfer_timeout_msto avoid timeout (#15152) - Preserve ip:port for
trtllm-servevisual-gen (#14355) - Fix guided decoding (xgrammar) + EAGLE-3 +
draft_len_schedulecrash during CUDA graph capture (#15023) - Stabilize Mamba replay state update (#14841)
- Fix
max_context_lengthvalue for attention workspace sizing (#15156) - Fix issue where host KV cache usage would double when speculative decoding is used (#14373)
- Disable
NCCL_SYMMETRICtactic on GB10 (DGX Spark) (#12902) - Fix
attentionOpFP8 MLA KV-reuse workspace calculation (#14852) - Fix beam search
log_probsnon-determinism withbatch_size > 1(#15125) - Forward
secondary_offload_min_prioritytoKVCacheManagerin PyTorch executor (#13768) - Enable multi-block mode for XQA HMMA spec-dec (#15312)
- Fix TinyGEMM barrier bug (#15338)
- Fix stale sparse attention kwargs (#15460)
- Fix
CppMambaHybridCacheManagerto handle dp dummy request (#15054) - Fix embedding vocab mask for rejection sampling in Kimi-K2.5 (#15233)
- Remove redundant
-
Documentation
-
Benchmark
- Weight trtllm-bench AR/AL averages by output length (#14998)
-
Test & Infra
- Add accuracy tests for nemotron-v3-ultra (#14808)
- Remove
TestLlama4ScoutInstructtests (#15144) - Require minimum of 4 GPUs in
llm_perf_core.ymland add new performance tests (#15090) - Add DFlash coverage for Qwen3.5 MoE variant (#15132)
- Add e2e example tests for flux1/2, ltx2, wan_i2v, and cosmos3 (#15126)
- Enable disagg cancellation stress test (#15174)
- Fix periodic-junit in unittest pytest (#14075)
- Update K2.5 and GLM-5 into CI perf test (#14960)
- Add Qwen3-32B FP8 disagg stress test (#14278)
- Sunset old disagg test cases for the QA side (#15290)
- Add e2e Tensor Parallel LPIPS tests for VisualGen (#15208)
- Remove TensorRT performance baseline and update to PyTorch only (#15256)
- Add integration tests for MoE LoRA and bugfixes (#15271)
What's Changed
- [None][infra] Waive TestQwen3NextInstruct nvfp4 cases by @mzweilz in #15086
- [https://nvbugs/6248757][fix] Avoid running all reduce in aux stream by @tensorrt-cicd in #14917
- [https://nvbugs/6221483][fix] AutoDeploy: Fix Eagle metadata host syncs by @govind-ramnarayan in #14714
- [None][feat] add FLUX visual generation examples by @karljang in #14987
- [https://nvbugs/6261164][fix] AutoDeploy: Don't allocate speculative caches when speculation is off by @tensorrt-cicd in #15020
- [https://nvbugs/6211189][fix] Lower the reference to 46.5 (matching cross-GPU empirical mean) and remove the t by @tensorrt-cicd in #14799
- [None][refactor] split VisualGen pipeline and model configs by @bobboli in #14956
- [TRTLLM-11457][feat] Async Ulysses pipeline (Enabled for LTX-2 + WAN) by @luyiyun1021 in #13978
- [TRTLLM-11548][doc] Add Qwen3.5 deployment guide doc by @nv-guomingz in #15111
- [https://nvbugs/6181383][fix] Build inner text/vision/audio sub-configs as empty PretrainedConfig() then setat by @tensorrt-cicd in #14399
- [https://nvbugs/6273850][chore] waive TestQwen3_5_4B::test_bf16 for all GPUs by @tburt-nv in #15112
- [None][doc] Add docs for AutoDeploy transforms by @bmarimuthu-nv in #15122
- [None][infra] Waive 4 failed cases for main in post-merge 2769 by @ZhanruiSunCh in #15140
- [https://nvbugs/6227203][fix] Remove redundant TikTokenTokenizer shim from KimiK25InputProcessor by @tianyuxbear in #14741
- [None][fix] tunable_fp4_quantize: rename misnamed kwarg + add real SF-swizzle control by @luyiyun1021 in #15002
- [None][test] Fix gen_only missing prev_device_step_time race in perf sanity by @tensorrt-cicd in #15108
- [None][test] Fix disagg test result dir by @fredricz-20070104 in #14864
- [TRTLLM-13332][test] Remove TestLlama4ScoutInstruct tests by @QiJune in #15144
- [https://nvbugs/6266705][fix] Gate FlashInfer GDN kernels to supporte… by @nv-guomingz in #15094
- [https://nvbugs/6255037][fix] Count DSA indexer K-cache correctly as UINT8 in KV cache size estimate by @eopXD in #15088
- [https://nvbugs/6194812][test] Update llm_perf_core.yml to require a minimum of 4 GPUs and add new performance tests by @yufeiwu-nv in #15090
- [TRTLLMINF-112][infra] Reduce the waiting time between check node is online or not by @EmmaQiaoCh in #14819
- [None][infra] Waive 1 failed cases for main in pre-merge 41821 by @ZhanruiSunCh in #15135
- [None][infra] CBTS Layer 3: pass test-db via Artifactory instead of env var by @crazydemo in #15142
- [TRTLLM-13264][feat] Add native bias epilogue to NVFP4 GEMM by @luyiyun1021 in #15053
- [https://nvbugs/6278380][unwaive] unwaive ad cases by @crazydemo in #15148
- [https://nvbugs/6244474][fix] AutoDeploy: Remove llama perf test from CI by @MrGeva in #15107
- [https://nvbugs/6212252][fix] Select CUTLASS MoE backend on non-Blackwell SMs in TestQwen3_5_35B_A3B::test_fp8 by @xxi-nv in #15081
- [TRTLLM-13302][feat] Register NVIDIA Wan2.2-T2V quantized checkpoints by @zhenhuaw-me in #15093
- [None][chore] add VisualGen team as the codeowner of the VisualGen Attention by @zhenhuaw-me in #15150
- [None][feat] Default on FlashInferTrtllmGenAttention by @yihwang-nv in #14618
- [None][infra] Test DFW with BSL branch by @yuanjingx87 in #14597
- [TRTLLM-12214][perf] customMoeRoutingKernel: lower BLOCK_SIZE to 128, raise maxNumBlocks by @xwang233 in #14590
- [TRTLLM-12214][perf] DeepGemmFusedMoE: skip redundant data expand via fused expand+quant Triton kernel by @xwang233 in #14591
- [TRTLLM-12648][test] implement disagg cancellation load thread by @chienchunhung in #15124
- [None][fix] Fix regression from SageAttention kernel: Use static scheduler by @xrq-phys in #15047
- [TRTLLM-12467][feat] EPD improvements by @venkywonka in #13864
- [None][feat] Expose stored block-hash chain to KV cache connector by @jthomson04 in #14806
- [#12805][fix] Fall back to local cache when loading tokenizer for gated models by @1MrazorT1 in #12998
- [None][feat] Support partial RoPE fusion for Hopper kernels in XQA for Laguna by @DomBrown in #15110
- [None][infra] Add nv-xtf, rahul-steiger-nv, tedzhouhk, tensorrt-cicd to blossom-ci allowlist by @ZhanruiSunCh in #14955
- [None][feat] Add Prometheus metrics for prompt cache, speculative decoding, perplexity, and batch occupancy by @vedularaghu in #12636
- [None][chore] Unwaive DSV32 helix tests by @brb-nv in #14871
- [None][fix] unset UCX_TLS=tcp by @tburt-nv in #15008
- [None][feat] Port 13 AutoDeploy custom models to sharding IR + opt them in via registry by @greg-kwasniewski1 in #14778
- [None][chore] Make image paths absolute in blog22 by @brb-nv in #15177
- Fix PyExecutor FPM iteration timing by @tedzhouhk in #14922
- [#13816][feat] AutoDeploy: Optimize gpt-oss-120b perf by @taylor-yb-lee in #14202
- [None][fix] Register Multimodal Placeholders for Qwen3.5 MoE VLM Serving by @anurags25 in #15079
- [None][feat] Weight trtllm-bench AR/AL averages by output length by @zhaoyangwang-nvidia in #14998
- [TRTLLM-13052][feat] Enable TRTLLM moe backend for nemotron-h BF16 ckpt by @Wanli-Jiang in #14944
- [None][fix] Fix and unwaive nemotron related bugs by @Wanli-Jiang in #15085
- [https://nvbugs/6140226][test] Add DFlash coverage for Qwen3.5 MoE variant by @yingguo-trt in #15132
- [None][test] temporarily waive Cosmos3 B200 failures by @bobboli in #15195
- [NVBUG-6241842][fix] DSA DSL atom-split: guard against MTP draft next… by @limin2021 in #14891
- [#11423][feat] AutoDeploy: Basic Disagg Support by @govind-ramnarayan in #14057
- [https://nvbugs/6280060][fix] Scope disagg-ctx cache-transfer quorum vote to TP instead of WORLD by @tensorrt-cicd in #15136
- [None][test] Add e2e example tests for flux1/2, ltx2, wan_i2v, and cosmos3 by @chang-l in #15126
- [#12632][feat] Add pipeline cache support for AutoDeploy by @nvchenghaoz in #13729
- [None][test] Add support for nemotron_3_ultra_550b_nvfp4 model in performance tests and configurations by @yufeiwu-nv in #15166
- [None][feat] Indexer TopK: single-block / multi-pass radix by @dcampora in #14268
- [None][fix] Clear workspace in run_mla_generation to avoid potential illegal memory access issue by @yihwang-nv in #15173
- [None][chore] Unwaive AutoDeploy accuracy tests by @bmarimuthu-nv in #14971
- [None][test] Increase kv_transfer_timeout_ms for b200 deepseek-r1 disagg gen_only perf test by @tensorrt-cicd in #15205
- [None][feat] Enable MTP for Step-3.7 NVFP4 and port Step-3.7VL vision tower to TRT-LLM modules by @kaiyux in #14926
- [https://nvbugs/6266370][fix] Fix MAX_UTILIZATION reuse token budget on main by @brb-nv in #15066
- [https://nvbugs/6272573][ci] Unwaive skipped test by @2ez4bz in #15118
- [https://nvbugs/6245279][fix] AutoDeploy: Unwaive accuracy tests by @galagam in #15214
- [TRTLLM-12491][feat] Align VisualGen serve request schema with VisualGenParams by @zhenhuaw-me in #14733
- [None][test] Add MLA chunked-prefill SM dispatch regression coverage by @DhineshPonnarasan in #13904
- [TRTLLM-12648][test] enable disagg cancellation stress test by @chienchunhung in #15174
- [None][feat] Preserve cache_salt string in KV cache events by @jthomson04 in #13051
- [https://nvbugs/6104831][fix] Port dataTransceiver shared_ptr lifetime fix by @chienchunhung in #14979
- [None][fix] Fix AutoDeploy transform docs generation by @bmarimuthu-nv in #15228
- [None][feat] Targeted warmup-waste cleanup by @dominicshanshan in #14609
- [None][fix] Remove TLLM_RUBIN_FEATURES by @yuxianq in #15143
- [https://nvbugs/6108994][fix] add kv_transfer_timeout_ms to avoid timeout by @bo-nv in #15152
- [TRTLLM-12657][infra] Fix periodic-junit in unittest pytest by @yiqingy0 in #14075
- [https://nvbugs/6143883][fix] Preserve ip:port for trtllm-serve visual-gen by @JunyiXu-nv in #14355
- [TRTLLM-12958][feat] Enable gen-only spec dec by @bo-nv in #14546
- [https://nvbugs/6162120][test] Remove 78 closed-bug waive entries for main by @tensorrt-cicd in #15061
- [https://nvbugs/6278399][fix] Add x86_64 path using CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR with… by @tensorrt-cicd in #15129
- [TRTLLM-11538][feat] Blackwell custom mask fmha support by @sunnyqgg in #12958
- [None][infra] Waive 6 failed cases for main in post-merge 2773 by @ZhanruiSunCh in #15250
- [None][feat] Enhance CuteDSL NVF4 MOE by @liyuhannnnn in #15092
- [None][infra] Waive 3 failed cases for main in post-merge 2772 by @ZhanruiSunCh in #15253
- [None][test] Update K2.5 andGLM-5 into CI Perf Test by @chenfeiz0326 in #14960
- [None][feat] enable GQA and cross-attention for attn2d by @NVShreyas in #14961
- [#12230][fix] Add bounds checking in autotuner _find_nearest_profile for SM121 by @mihai-chiorean in #12310
- [None][refactor] visual_gen Attention: drop redundant enable_ulysses kwarg (rebase artifact from #13978) by @luyiyun1021 in #15141
- [None][fix] Generalize FP8 checkpoint loading for Qwen3.5 by @amukkara in #15067
- [#13858][fix] AutoDeploy fix the piecewise vlm issue by @nvchenghaoz in #14006
- [TRTLLM-12507][feat] Cudagraph support for per-expert lora in Cutlass backend - Part 2 by @brb-nv in #14881
- [None][test] Remove stale perf sanity waives by @cascade812 in #15269
- [None][infra] Waive 8 failed cases for main in pre-merge 42699 by @ZhanruiSunCh in #15273
- [None][fix] Install processor-output validation filter at module import by @aswinvisva in #14832
- [None][infra] Waive 10 failed cases for main in pre-merge 42753 by @ZhanruiSunCh in #15275
- [TRTLLM-12534][fix] Nemotron Nano - properly account for text prompts in inflight batching with EVS on by @moraxu in #15016
- [None][doc] Fix stale --disable_xqa reference in legacy docs by @Erfandarzi in #13395
- [TRTLLM-11403][doc] Cache-DiT documentation by @o-stoner in #15268
- [#15022][fix] Guided decoding (xgrammar) + EAGLE-3 + draft_len_schedule reaching 0 crashes during CUDA graph capture, "bitmask must have the same batch size as logits" by @chungen04 in #15023
- [TRTLLM-12154][test] Add Qwen3-32B FP8 disagg stress test by @brnguyen2 in #14278
- [TRTLLM-13141][feat] Add backend-agnostic SourceIdentity gate for weight sharing by @chienchunhung in #14878
- [None][feat] Add PyTorch reset_prefix_cache API by @milesial in #14970
- [None][fix] Stabilize Mamba replay state update by @sunnyqgg in #14841
- [None][infra] Waive remaining AutoDeploy Disagg tests until fix lands by @govind-ramnarayan in #15282
- [None][test] Sunset the old disagg test cases for the qa side by @fredricz-20070104 in #15290
- [None][infra] Waive 1 failed cases for main in pre-merge 42836 by @ZhanruiSunCh in #15293
- [None][fix] Fix max_context_length value for attention workspace sizing by @pengbowang-nv in #15156
- [TRTLLM-12038][feat] Add accuracy tests for nemotron-v3-ultra by @Wanli-Jiang in #14808
- [#14672][fix] AutoDeploy: Vendor OpenELMConfig locally to fix OpenELM config loading by @plapagesse in #15175
- [https://nvbugs/6035425][fix] Fix KV cache host splitting logic by @mikeiovine in #14373
- [None][refactor] Move KV cache manager V2 to separate file by @jiaganc in #14680
- [TRTLLM-12963][refactor] LTX-2 attention: drop dead k_pe parameter; require cached cross-attn by @luyiyun1021 in #14555
- [TRTLLM-10184][chore] Remove legacy XQA precompiled path by @pengbowang-nv in #14941
- [TRTLLM-35882][feat] cute dsl gvr-top multi-cta optimization by @limin2021 in #15198
- [None][fix] Revert "Add PyTorch reset_prefix_cache API (#14970)" by @xxi-nv in #15306
- Revert "[None][test] Add support for nemotron_3_ultra_550b_nvfp4 model in performance tests and configurations" by @tburt-nv in #15310
- [https://nvbugs/6309375][test] AutoDeploy: Remove stale fallback test by @govind-ramnarayan in #15316
- [None][fix] AutoDeploy: set enable_spec_decode on ADEngine for disagg by @Shixiaowei02 in #15260
- [TRTLLM-12498][feat] Add support for beam search in disaggregated serving by @athena-nv in #14876
- [None][chore] 2 more WAN multi-gpu tests by @NVShreyas in #15223
- [TRTLLM-12721][feat] Add disagg transfer state consensus by @chienchunhung in #15139
- [None][infra] Waive 1 failed cases for main in pre-merge 43047 by @ZhanruiSunCh in #15326
- [#12715][fix] disable NCCL_SYMMETRIC tactic on GB10 (DGX Spark) by @nv-lschneider in #12902
- [None][feat] AutoDeploy: Qwen3.5: Apply whielist based sharding and apply lm_head sharding by @taylor-yb-lee in #15185
- [https://nvbugs/6293015][fix] Add a delegating `@property def vocab_size_padded(self) -> int: return… by @tensorrt-cicd in #15219
- [TRTLLM-12842][feat] Maximal LLMAPI capture in usage telemetry by @venkywonka in #14398
- [TRTLLM-12427][perf] Qwen2.5/3/3.5-VL Performance Optimization by @yechank-nvidia in #11943
- [TRTLLM-11408][test] Add e2e Tensor Parallel LPIPS tests for VisualGen by @yingguo-trt in #15208
- [None][infra] Waive 1 failed cases for main in pre-merge 43173 by @ZhanruiSunCh in #15358
- [None][infra] Record CBTS decision to OpenSearch for CI-health monitoring by @crazydemo in #15210
- [None][feat] MNNVL Performance Optimization and FP8/NVFP4 Quant Fusion by @timlee0212 in #14476
- [None][refactor] Remove TensorRT performance baseline and update to PyTorch only by @yufeiwu-nv in #15256
- [None][test] Waive 1 failed cases for main in QA CI by @tensorrt-cicd in #15315
- [https://nvbugs/6029882][fix] Fix attentionOp fp8 mla kvreuse workspace calculation by @pengbowang-nv in #14852
- [None][infra] pin pytest and click workaround by @cascade812 in #15357
- [None][feat] skip-softmax on SM120: TMA-load + sync-MMA warp-specialized context FMHA for sm_120/sm_121 by @dcampora in #15163
- [None][fix] Fix beam search log_probs non-determinism with batch_size > 1 by @achartier in #15125
- [Bugfix] Forward secondary_offload_min_priority to KVCacheManager in PyTorch executor by @Saddss in #13768
- [None][chore] Bump version to 1.3.0rc19 by @yuanjingx87 in #15188
- [TRTLLMINF-103][feat] Keep SLURM timeouts non-retryable by @dpitman-nvda in #15183
- [TRTLLM-12982][feat] support multi item scoring in LLM.encode by @ixlmar in #14693
- [https://nvbugs/6281014][fix] fix the repeated cute.compile and simpilify the test by @JadoTu in #15331
- [None][chore] Integration tests for MoE lora & bugfixes by @brb-nv in #15271
- [TRTLLM-12339][feat] enable TRTLLM cross attention backend by @cascade812 in #15345
- [TRTLLM-12807][test] Guard thop attention kwarg aliases by @yuxianq in #15335
- [None][infra] Waive 21 failed cases for main in post-merge 2780 by @ZhanruiSunCh in #15373
- [None][fix] pool-qualify KV cache transfer pending keys by @chienchunhung in #15272
- [None][refactor] Enhance pytest integration by updating test node generation to support fixture inheritance and dynamic collection by @yufeiwu-nv in #15374
- [None][test] Waive 1 failed cases for main in QA CI by @tensorrt-cicd in #15377
- [https://nvbugs/312578][fix] split test_cache_transceiver_single_process by @chuangz0 in #15369
- [None][infra] Update the new duration base on opensearch result by @EmmaQiaoCh in #15364
- [https://nvbugs/6245861][fix] Gate the two ID None-checks on
finish_reason in _GEN_PENDING_FINISH_REASONS… by @tensorrt-cicd in #14908 - [https://nvbugs/6223556][fix] Propagate gen-first ctx usage via aux buffer to postproc by @reasonsolo in #15246
- [None][test] Fix Mamba hybrid transceiver helper by @chienchunhung in #15323
- [None][feat] Qwen3-VL: support per-request mm_processor_kwargs by @aswinvisva in #14702
- [TRTLLM-12982][chore] NVTX-annotate logits processor by @ixlmar in #15408
- [TRTLLM-12339][feat] Support T5 and BART in the PyTorch backend by @cascade812 in #13919
- [TRTLLM-13333][feat] Add prefetch_reuse_blocks and configurable prefetch count by @reasonsolo in #15149
- [None][feat] DSv4 prep: attention op plumbing by @lfr-0531 in #15384
- [None][test] Waive 8 failed cases for main in post-merge by @tensorrt-cicd in #15389
- [#15182][fix] Fix embedding vocab mask for handling rejection sampling in Kimi-K2.5 by @chungen04 in #15233
- [None][test] Waive 1 failed cases for main in QA CI by @tensorrt-cicd in #15320
- [None][refactor] Refactor Skip Softmax Attention Interface by @bobboli in #14687
- [None][infra] Waive 1 failed cases for main in pre-merge 43656 by @ZhanruiSunCh in #15439
- [None][infra] Waive 11 failed cases for main in post-merge 2782 by @ZhanruiSunCh in #15395
- [https://nvbugs/6248837][fix] Densify trtllm-gen fmha warmup grid to catch missing kernels by @pengbowang-nv in #15305
- [TRTLLM-13378][feat] Drop legacy --extra_visual_gen_options CLI alias by @zhenhuaw-me in #15262
- [TRTLLM-12950][feat] Add MegaMoECuteDsl NVFP4 MoE backend by @xxi-nv in #14608
- [None][perf] DSv4 prep: attention fusion custom ops by @lfr-0531 in #15390
- [TRTLLM-12669][refactor] Eagle3 sampling: auto-detect greedy fast-path, mixed-batch rejection sampling, draft honors target params by @zhaoyangwang-nvidia in #14745
- [TRTLLMINF-137][infra] Skip to create perf report when there is not perf test results by @yiqingy0 in #15446
- [https://nvbugs/6270671][fix] Replace the hardcoded multiBlock=1 with a call to… by @tensorrt-cicd in #15312
- [TRTLLMINF-113][infra] Add timeout protection to Setup/Initialize stages by @ZhanruiSunCh in #14682
- [None][infra] Waive 1 failed cases for main in pre-merge 43720 by @ZhanruiSunCh in #15449
- [None][infra] Waive 2 failed cases for main in post-merge 2785 by @ZhanruiSunCh in #15450
- [None][perf] executor: avoid deepcopy of prompt_token_ids on enqueue by @lancelly in #14895
- [None][infra] Waive 1 failed cases for main in pre-merge 43712 by @ZhanruiSunCh in #15447
- [None][ci] tighten VisualGen CBTS routing by @zhenhuaw-me in #15259
- [None][fix] fix tinygemm barrier bug by @yweng0828 in #15338
- [TRTLLM-12199][feat] WideEP FT: add EPGroupHealth thread-safe rank mask (1a.1) by @chienchunhung in #13302
- [None][infra] Waive 18 failed cases for main in pre-merge 43878 by @ZhanruiSunCh in #15469
- [None][fix] Fix stale sparse attention kwargs by @bobboli in #15460
- [None][test] Waive 1 failed cases for main in QA CI by @tensorrt-cicd in #15411
- [TRTLLM-12807][feat] Add multiple FMHA library support to TRTLLM attention backend by @yuxianq in #15204
- [None][infra] Waive 1 failed cases for main in pre-merge 43917 by @ZhanruiSunCh in #15478
- [None][feat] Side-stream for MM encoder by @2ez4bz in #14322
- [None][feat] BREAKING: Add MiniMax-M3 PyTorch backend bring-up with API changes by @WeiHaocheng in #15292
- [https://nvbugs/6215678][fix] Point
--output-artifact-dirat a unique per-run subdir `{model}-openai-complet by @tensorrt-cicd in #14742 - [None][fix] fix CppMambaHybridCacheManager to handle dp dummy request by @bo-nv in #15054
- [None][test] Waive 5 failed cases for main in post-merge by @tensorrt-cicd in #15392
- [None][test] Waive 9 failed cases for main in post-merge by @tensorrt-cicd in #15391
- [None][test] Waive 5 failed cases for main in QA CI by @tensorrt-cicd in #15360
- [None][test] Waive 8 failed cases for main in QA CI by @tensorrt-cicd in #15342
- [None][feat] Checkpointing variant of replay for MTP for mamba models by @hnover-nv in #14203
- [None][test] Waive 23 failed cases for main in QA CI by @tensorrt-cicd in #15337
- [None][test] Waive 3 failed cases for main in QA CI by @tensorrt-cicd in #15319
New Contributors
- @1MrazorT1 made their first contribution in #12998
- @vedularaghu made their first contribution in #12636
- @tedzhouhk made their first contribution in #14922
- @anurags25 made their first contribution in #15079
- @Erfandarzi made their first contribution in #13395
- @chungen04 made their first contribution in #15023
- @brnguyen2 made their first contribution in #14278
- @plapagesse made their first contribution in #15175
- @athena-nv made their first contribution in #14876
- @Saddss made their first contribution in #13768
Full Changelog: v1.3.0rc18...v1.3.0rc19