Highlights
-
Model Support
- Support tensor parallelism of TRTLLM MoE backend for Nemotron-H model (#11470)
- Add Kimi-K2.5 text model support (NVFP4) (#11777)
- Add Helix CP support for DSV3.2 (#11507)
- Support mix quantization between shared experts and routed experts for DSV3 (#11215)
- Support Cohere Command A model (#11505)
- Extract embeddings as
.safetensorsand support float8-quantized models (#11180)
-
API
- Add
--served-model-nameoption toservecommand (#11711) - Add flag to
trtllm serveto override KV cache dtype (#11487) - Use string stop/bad words in gRPC proto instead of pre-tokenized
TokenSequence(#11888) - Support multimodal image input in gRPC server (#11800)
- Expose
use_python_schedulerinSchedulerConfigand add associated tests (#11884) - Add
max_gpu_total_bytesto control KVCacheManagerV2 capacity (#11907)
- Add
-
Feature
- Support PARD (Parallel Draft Model) in one-model speculative decoding (#11438)
- Enable autotuner for VisualGen and compilation config support (#11660)
- Add globaltimer-based timing backend for autotuner profiling (#11657)
- Support heterogeneous
tokens_per_block(#11751) - Refactor KVCacheManagerV2 to simplify new model support (#11749)
- Support Helix CP with GQA (#11570)
- Add option to skip KV cache memory estimation (#11714)
- Implement suffix automaton on device for speculative decoding and one-model support (#11434)
- Separate radix search tree implementation (#10862)
- Add support for
expert_number(\le 2048) andK(\le 32) (#11510) - Add support for bidirectional sliding window attention mask to
fmha_v2(#11212) - Avoid duplicated computation with ADP + Helix CP in GQA (#11891)
- Add explicit video encode format support (#11830)
- Refactor video encoding to use ffmpeg CLI or pure Python fallback (#11672)
- Integrate CuTe DSL top-k kernel for Blackwell (#11900)
- Integrate suffix automaton with EAGLE3 and PARD (#11878)
- Add 5D A2A for fused Ulysses (#11787)
- Add SiLU to
trtllm-genMoE (#11663) - Optimize by fusing
nvfp4_quantintolayernorm_gatedformamba2_mixer(#11473) - Wire
KVCacheBlocktoUnifiedBlockTreeusing lookup-node pointers (#11919) - Run extra general warmup to warm up memory pool (#10340)
-
Fix
- Add async worker to MTP/EAGLE3 sampler (#11573)
- Fix disaggregated cancellation (#11730)
- Use
prefer_pinned()inpard.py(#11762) - Release KVCacheManagerV2 memory immediately on shutdown (#11746)
- Remove duplicated MoE computation with Helix CP+DP (#11167)
- Register add+norm fallback pass for
torch.compilein multi-GPU mode (#11739) - Propagate logprobs from prefill to decode in disaggregated serving (#11727)
- Propagate logits from prefill to decode in disaggregated serving (#11767)
- Enable separate draft KV cache pool for aggregated mode and KVBM (#11689)
- Fix warnings when building
moe_kernels.cu(#11703) - Fix
available_blockstypo in scheduler (#11801) - Clean up memory in rollout process (#11658)
- Warm up
maybe_compiled_catinforward_context_with_chunked_prefill(#11743) - Fix DeepEPLowLatency with CuTe DSL MoE backend (#11769)
- Fix FP8 per-tensor
torch.compilegraph break in dynamic quantization (#11759) - Fix streaming generation logits and speed up logits testcase (#10637)
- Fix overly aggressive capacity scheduler (#11731)
- Use proper tokens when
exclude_input_in_outputis true (#9453) - Move
launch_dependent_gridsaftertmemfree to fix race (#11812) - Fix E/PD disaggregated chunked prefill bug (#11805)
- Fix SM120 issue for
rms_normwithnvfp4_quant_fusion(#11774) - Remove dead code (#11813)
- Fix KVCacheManagerV2 OOM and dummy request allocation in chunked prefill / pipeline parallel (#11710)
- Fix AttributeError when DSA indexer accesses non-DSA KVCacheManager (#11858)
- Override
mMaxAttentionWindowwith actual largest window size (#11842) - Update
check_is_moeto supportmlp_layer_typesafterconfig.jsonupdate (#11477) - Fix incorrect GPU timing in time breakdown under overlap scheduler (#11860)
- Fix OOM hang with
NCCL_SYMMETRICfallback during long-context inference (#11870) - Fix position IDs input for Qwen3.5 text-only usage (#11877)
- Disable preload for Llama4 Scout (#11873)
- Fix formatting issue in
tensorrt_llm/serve/openai_server.py(#11920) - Prevent RuntimeError from dict mutation during iteration in EXAONE MoE weight mapper (#11862)
- Fix Nemotron MTP crash on SM90 (#11807)
- Fix Mistral Large3 + EAGLE bug (#11942, #11885)
- Fix TeaCache broken caching for FLUX.1 and FLUX.2 (#11868)
- Fix FLUX.1 TeaCache polynomial coefficients and defaults (#12007)
- Implement workaround for
ClientPayloadError(#12018) - Fix duplicate model entry in model list (#12029)
- Fix Python string truthiness bug in FMHA cubin selection (#11909)
-
Documentation
- Fix typos, grammar, and accuracy across documentation (#11766)
- Add sparse attention tech blog (#11644)
- Add known issue for disaggregated serving hang with asymmetric PP/TP (#11789)
- Fix documentation links (#11912)
- Replace “TensorRT-LLM” with “TensorRT LLM” (#11914)
- Add CI trigger and test-failure retrieval instructions to
AGENTS.md(#11803)
-
Benchmark
- Vectorize
quantize_fp8_blockwisewith CUDA kernel (#11724) - Use
F.rms_normfor per-head QK normalization in VisualGen (#11798) - Short-sequence MHA optimization for DSA MLA prefill (#11677)
- Parallel VAE harness and implementation for WAN (#11875)
- Add Triton FP8 blockwise quant kernel and autotuner bucket-skip for VisualGen (#11854)
- Optimize
_prepare_inputshost time (#11704) - Improve
are_stop_wordsperformance (#11196) - Add DeepSeek RCCA performance test case (#11736)
- Add VisualGen benchmarking script (#11651)
- Vectorize
-
Test & Infra
- Add tests for all database configs (#11653)
- Move B200 test stage to AIHub (#11692)
- Support local wheel installation and add GB300 demo cases (#11742)
- Remove submodule pulls from TRT-LLM git checkouts (#11693)
- Add back WAN VBench test in CI (#11804)
- Add E2E test for cancelled disaggregated generation requests with overlap scheduler (#11795)
- Pass Nsight options to
ray_executorand trigger profiling throughcollective_rpc(#11493) - Add B200 multi-node tests DB (#11783)
- Add sanity tests for release 1.2 version (#11738)
- Add QA test case for
trust-remote-codeon multi-node failure (#11905) - Fix
model_nameStarcoder 15B allowed-models issue (#11981) - Upgrade
xgrammarfrom 0.1.25 to 0.1.32 (#12016) - Limit TileIRAS to CUDA 13.1 (#12042)
- Remove VisualGen benchmark test from YAML (#12027)
What's Changed
- [None][feat] Support tensor parallelism for nemotron-h model by @Wanli-Jiang in #11470
- [None][test] Add tests for all database configs. by @fsaady in #11653
- [https://nvbugs/5911143][fix] add async worker to MTP/Eagle3 sampler,… by @dhansen-nvidia in #11573
- [TRTLLM-10886][feat] Support PARD(Parallel Draft Model) in one-model spec dec by @ziyixiong-nv in #11438
- [None][fix] Fix disagg cancellation by @Tabrizian in #11730
- [None][fix] Use prefer_pinned() in pard.py by @mikeiovine in #11762
- [None][fix] Make KVCacheManagerV2 release mem immediately on shutdown by @lowsfer in #11746
- [TRTLLM-11115][feat] enable autotuner for visual gen + Compilation Config by @NVShreyas in #11660
- [None][chore] Minor fix in w4a8 mxfp4 mxfp8 test. by @Tracin in #11745
- [None][infra] Move B200 test stage to AIHub by @yuanjingx87 in #11692
- [None][infra] Waive failed cases for main on 02/27 by @EmmaQiaoCh in #11770
- [TRTLLM-11064][fix] Remove duplicated MoE Computation with Helix CP+DP by @brb-nv in #11167
- [TRTLLM-10386][fix] torch.compile: register add+norm fallback pass in multi-GPU mode by @luyiyun1021 in #11739
- [None][feat] Support heterogeneous tokens_per_block by @lowsfer in #11751
- [None][chore] Remove closed bugs by @xinhe-nv in #11527
- [None][test] local wheel installation support and add gb300 cases demo by @fredricz-20070104 in #11742
- [None][feat] Refactor cache manager v2 to simplify new model support by @jiaganc in #11749
- [https://nvbugs/5879614][fix] Waive test_guided_decoding_with_eagle3 xgrammar in disaggregated serving by @ziyixiong-nv in #11773
- [https://nvbugs/5911788][test] Waive test_llm_partial_update_weights[Qwen3/Qwen3-8B] by @liji-nv in #11785
- [None][feat] add globaltimer-based timing backend for autotuner profi… by @dhansen-nvidia in #11657
- [https://nvbugs/5926823][fix] Propagate logprobs from prefill to decode in disagg by @brb-nv in #11727
- [TRTLLMINF-9][chore] Remove submodule pulls from TRT-LLM git checkouts by @dpitman-nvda in #11693
- [https://nvbugs/5685010][fix] Delete test_eagle3_output_repetition_4gpus flaky assertions. by @zheyuf in #11725
- [None][fix] enable separate draft KV cache pool for aggregated + KVBM… by @zyang-Modular in #11689
- [TRTLLM-11058][feat] Support Helix CP with GQA by @brb-nv in #11570
- [None][perf] Vectorize quantize_fp8_blockwise with CUDA kernel by @karljang in #11724
- [https://nvbugs/5868616][fix] Fix warnings when building moe_kernels.cu by @yumin066 in #11703
- [None][chore] Add CI trigger and test failure retrieval instructions to AGENTS.md by @lucaslie in #11803
- [None][fix] Fix typo: avaiable_blocks -> available_blocks in scheduler by @kaiyux in #11801
- [TRTLLM-11568][feat] Fix collective calls by @greg-kwasniewski1 in #11632
- [None][perf] Use F.rms_norm for per-head QK normalization in visual gen by @karljang in #11798
- [TRTLLM-11185][test] Add back WAN VBench test in CI by @chang-l in #11804
- [TRTLLM-9782][feat] Support to skip KV cache memory estimation by @HuiGao-NV in #11714
- [None][doc] Fix typos, grammar, and accuracy across documentation by @kaiyux in #11766
- [None][fix] cleanup mem in rollout process by @hchings in #11658
- [None][feat] Add --served-model-name option to serve command by @slin1237 in #11711
- [None][chore] Update AGENTS.md by @lucaslie in #11809
- [None][fix] AutoDeploy: Fix shape handling for singleton prefill by @galagam in #11679
- [None][infra] Waive failed cases for main on 03/01 by @EmmaQiaoCh in #11811
- [None][feat] TRT-LLM Gen MoE finalize kernel optimization by @nekorobov in #11501
- [None][test] Add E2E test for cancelled disagg gen request with overlap scheduler by @Tabrizian in #11795
- [None][chore] pass nsight options to ray_executor and trigger profiling through collective_rpc by @davidmlw in #11493
- [TRTLLM-10962][feat] Refactor video encoding to use ffmpeg CLI or pur… by @JunyiXu-nv in #11672
- [https://nvbugs/5823212][fix] Warmup maybe_compiled_cat in forward_context_with_chunked_prefill by @yuantailing in #11743
- [None][feat] Extract embeding as .savetensors and support float8 quantized model by @nvyocox in #11180
- [https://nvbugs/5885070][fix] fix deepeplowlatency with cutedsl moe backend by @leslie-fang25 in #11769
- [None][fix] Fix FP8 per-tensor torch.compile graph break in dynamic quantization by @karljang in #11759
- [TRTLLM-9687][feat] Improve are_stop_words performance by @stnie in #11196
- [https://nvbugs/5883738][fix] fix bug for illegal memory access on Qwen3-235B-A22B-Thinking-2507-NVFP4 + Eagle3 by @sunnyqgg in #11474
- [#10693][chore] AutoDeploy: Add L1 tests from coverage dashboard by @marinayanov in #11530
- [https://nvbugs/5764627][fix] Fix generation logits with streaming and improve runtime of logits testcase. Also fixes https://nvbugs/5573238 by @stnie in #10637
- [https://nvbugs/5934461][fix] Propagate logits from prefill to decode in disagg by @brb-nv in #11767
- [#11726][feat] AutoDeploy: Fuse gemms of mixed children by @taylor-yb-lee in #11793
- [None][fix] Fix overly aggressive capacity scheduler by @jthomson04 in #11731
- [https://nvbugs/5689262][fix] use proper tokens when exclude_input_in_output is true by @lazykyama in #9453
- [https://nvbugs/5863912][fix] Fix with move launch_dependent_grids after tmem free by @benzh-2025 in #11812
- [https://nvbugs/5938603][fix] Fix E/PD disagg chunked prefill bug by @2ez4bz in #11805
- [None][test] add deepseek RCCA perf test case by @ruodil in #11736
- [None][fix] remove torch compile models arg by @NVShreyas in #11836
- [None][test] add b200 multi nodes tests db by @xinhe-nv in #11783
- [None][fix] Fix SM120 issue for rms_norm with nvfp4_quant_fusion by @Wanli-Jiang in #11774
- [None][infra] Waive failed cases for main for post-merge 2564 by @ZhanruiSunCh in #11848
- [https://nvbugs/5936502][fix] remove dead codes by @bo-nv in #11813
- [None][chore] a GitHub Action to assign the PR to the author by @zhenhuaw-me in #11673
- [None][infra] Fix a typo in waives.txt by @EmmaQiaoCh in #11852
- [None][test] Fix wrong lora config by @yufeiwu-nv in #11818
- [None][test] fix flaky issues by @xinhe-nv in #11814
- [None][fix] Fix OOM issue/dummy request allocation/chunked prefill/pp for KV Cache Manager V2 by @yizhang-nv in #11710
- [None][test] update waive list by @xinhe-nv in #11815
- [TRTLLM-9939][perf] Short-sequence MHA optimization for DSA MLA prefill by @kaiyux in #11677
- [None][refactor] Revisit attention interface for AutoDeploy by @lucaslie in #11796
- [None][feat] Add a flag in trtllm serve to support overriding kv cache dtype by @cjluo-nv in #11487
- [TRTLLMINF-9][chore] Use checkoutFile in mergeWaiveList to avoid full clone by @dpitman-nvda in #11794
- [None][chore] Refresh inferenceX configs in recipes by @venkywonka in #11595
- [TRTLLM-11042][feat] Implement suffix automaton on device for spec and support one model by @cascade812 in #11434
- [https://nvbugs/5941681][fix] Handle dict type for speculative_config by @ziyixiong-nv in #11828
- [None][feat] Add Kimi-K2.5 text model support (NVFP4) by @lancelly in #11777
- [None][chore] Bump version to 1.3.0rc7 by @yuanjingx87 in #11864
- [https://nvbugs/5919026][fix] Fix AttributeError when DSA indexer accesses non-DSA kv_cache_manager by @ziyixiong-nv in #11858
- [TRTLLM-11184][feat] Explicit video encode format support by @JunyiXu-nv in #11830
- [None][test] Enable DeepGemm + DeepEPLowLatency MoE test combination by @Tabrizian in #11876
- [#10009][fix] Fix json_schema response_format to support OpenAI API w… by @JunyiXu-nv in #11497
- [https://nvbugs/5927620][fix] Override mMaxAttentionWindow with the actual largest window size by @ziyixiong-nv in #11842
- [None][feat] Support mix quantization between shared experts and routed experts for dsv3 by @dmtri35 in #11215
- [#11666][fix] Fix inmemory model dir detection by @capyun007 in #11753
- [None][infra] Waive 3 failed cases for main in post-merge 2566 by @ZhanruiSunCh in #11881
- [None][doc] Add sparse attention tech blog by @heyuhhh in #11644
- [TRTLLM-9392][feat] Support MoE output to alltoall's workspace for all the quantization recipe of trtllm-gen. by @bobboli in #11449
- [TRTLLM-10852][feat] Enhance logprobs functionality to always return prompt token logprobs in prompt logprobs by @stnie in #11235
- [None][fix] Fix typos, grammar, and formatting in comments and docstrings by @kaiyux in #11826
- [None][fix] Update check_is_moe into support mlp_layer_types after config.json update by @eagle705 in #11477
- [https://nvbugs/5946303][fix] Fix incorrect GPU timing in time breakdown under overlap scheduler by @luyiyun1021 in #11860
- [None][chore] Update autotuner by @jiahanc in #11859
- [None][chore] Handle failure in auto-assign author workflow by @zhenhuaw-me in #11906
- [https://nvbugs/5930934][fix] Fix OOM hang with NCCL_SYMMETRIC fallback during long-context inference by @peihu-nv in #11870
- [None][fix] Qwen3.5 fix positions ids input for text-only usage by @bmarimuthu-nv in #11877
- [None][fix] Refactor nanoV3+superV3 accuracy tests to load example config by @galagam in #11458
- [None][chore] Deprecate eagle3 2-model by @mikeiovine in #11761
- [#11819][fix] Disable preload for Llama4 scout by @taylor-yb-lee in #11873
- [None][chore] Fix format issue in tensorrt_llm/serve/openai_server.py by @chienchunhung in #11920
- [None][feat] Separate radix search tree implementation by @thorjohnsen in #10862
- [None][feat] Add support for expert_number<=2048 and K<=32 by @ChristinaZ in #11510
- [None][infra] Waive 1 failed cases for main in pre-merge 29212 by @ZhanruiSunCh in #11929
- [None][fix] remove leak check for kimi by @xinhe-nv in #11825
- [https://nvbugs/5907477][chore] unwaive test by @reasonsolo in #11896
- [TRTLLM-10956][infra] Support build-only mode for GenPostMergeBuilds job by @mzweilz in #11895
- [#11755][feat] AutoDeploy onboarding agent + Kimi K2.5 AD modeling code by @bmarimuthu-nv in #11780
- [None][fix] Prevent RuntimeError from dict mutation during iteration in EXAONE MoE weight mapper by @Bias92 in #11862
- [TRTLLM-11101][feat] VisualGen benchmarking script by @zhenhuaw-me in #11651
- [https://nvbugs/5820734][fix] Run extra general warmup to warm up memory pool by @liji-nv in #10340
- [None][fix] Fix nemotron super MTP crash on SM90 by @sunnyqgg in #11807
- [None][chore] Use cluster service discover in disagg CI tests by @ekou24 in #11242
- [None][feat] External Drafter One Model by @IzzyPutterman in #11758
- [None][chore] Update model list by @tcherckez-nvidia in #11827
- [#11578][fix] Use string stop/bad words in gRPC proto instead of pre-tokenized TokenSequence by @CatherineSue in #11888
- [None][feat] Add support for bidirectional sliding window attention mask to fmha_v2 by @djns99 in #11212
- [TRTLLM-11036][feat] Enable new moe test and clean the legacy moe test in the CI by @xxi-nv in #11817
- [None][infra] Waive 4 failed cases for main in post-merge 2571 by @ZhanruiSunCh in #11968
- [None][test] Fix deepseek-r1 OOM issue for H100 perf test by @yufeiwu-nv in #11948
- [None][fix] Remove incorrect Python import style rule from AGENTS.md by @yuxianq in #11940
- [https://nvbugs/5896577][fix] fix bug of mistral large3 with eagle by @byshiue in #11942
- [https://nvbugs/5819048][fix] unwaive test of qwen3-235b eagle3 by @byshiue in #11969
- [None][feat] Avoid duplicated computation with ADP + Helix CP in GQA by @brb-nv in #11891
- [https://nvbugs/5624818][fix] Add unittest for GPT-OSS non-paged_context_fmha by @pengbowang-nv in #11415
- [#10245][feat] AutoDeploy: Support Finegrained FP8 quantization by @bmarimuthu-nv in #10897
- [TRTLLM-11284][infra] Move large models test to post-merge by @EmmaQiaoCh in #11933
- [TRTLLM-11155][infra] Run multi-GPU tests even single-GPU tests are failed when use --disable-fail-fast by @yiqingy0 in #11740
- [None][fix] Refine tests/unittest/_torch/flashinfer/test_trtllm_flashinfer_symbol_collision.py to reduce jit-compile time by @yihwang-nv in #11890
- [#11422][feat] AutoDeploy: Piecewise cudagraph support Prototype by @nvchenghaoz in #11515
- [TRTLLM-11189][fix] VisualGen isolated TeaCache Wan fix by @o-stoner in #11964
- [https://nvbugs/5846166][fix] Update Perf Triage Scripts to Fix gen_only issue by @chenfeiz0326 in #11802
- [TRTLLM-11057][feat] Add Helix CP support for DSV3.2 by @brb-nv in #11507
- [#2912][feat] Support Cohere Command A model by @torotoki in #11505
- [TRTLLM-11259][perf] Parallel VAE harness and implementation for WAN by @NVShreyas in #11875
- [#11578][feat] support multimodal image input in gRPC server by @CatherineSue in #11800
- [TRTLLM-11093][feat] add 5D A2A for fused ulysses by @NVShreyas in #11787
- [TRTLLM-11189][fix] Fix TeaCache broken caching for FLUX.1 and FLUX.2 by @karljang in #11868
- [None][refactor] Request management in ScheduledRequests by @Funatiq in #11784
- [None][perf] Add Triton FP8 blockwise quant kernel and autotuner bucket-skip for visual gen by @chang-l in #11854
- [TRTLLM-11290][feat] Enable trtllm-serve E2E tests by @JunyiXu-nv in #11985
- [None][feat] Optimize by fuse nvfp4_quant to layernorm_gated for mamba2_mixer by @Wanli-Jiang in #11473
- [None][chore] Autodeploy: add models for sprint by @nvchenghaoz in #11999
- [None][infra] Update CI allow list 20260305 by @yuanjingx87 in #11965
- [None][chore] Mass integration of release/1.2 weekly - 6th by @dominicshanshan in #11934
- [None][fix] Fix Collect Perf Sanity Result's import requests Error by @chenfeiz0326 in #12002
- [TRTLLM-10956][infra] Skip updating gitlab status for GenPostMergeBuilds by @mzweilz in #11954
- [None][feat] add ReLU2 NVFP4 fusion for AutoDeploy with tests by @tcherckez-nvidia in #11957
- [TRTLLM-11159][feat] Wire KVCacheBlock to UnifiedBlockTree, replacing mPrevBlock/mNextBlocks with lookup-node pointers. by @SimengLiu-nv in #11919
- [#11166][infra] AutoDeploy: improve test organization in CI and add overview doc by @lucaslie in #11291
- [None][chore] Model update 260308 by @tcherckez-nvidia in #12011
- [None][infra] Update AutoDeploy CODEOWNERS coverage by @lucaslie in #12013
- [https://nvbugs/5732958][bug] Fix TestLlama4MinLatency::test_llama_allclose_to_hf failure by @nvpohanh in #10191
- [None][chore] Unwaive some skip for trtllm moe backend by @leslie-fang25 in #11975
- [TRTLLM-11134][feat] export VisualGen API and update doc by @zhenhuaw-me in #11911
- [https://nvbugs/5823783][test] add qa test case for trust-remote-code on multinode failure by @crazydemo in #11905
- [None][feat] Use max_gpu_total_bytes to control v2's capacity by @jiaganc in #11907
- [TRTLLM-11342][fix] Fix FLUX.1 TeaCache polynomial coefficients and default t… by @karljang in #12007
- [None][fix] Use try/except fallback for Pydantic ValidatorIterator in chat message parsing by @Wanli-Jiang in #11903
- [None][infra] Unwaive 2 cases on rtx-pro-6000d by @EmmaQiaoCh in #12003
- [TRTLLM-11276][chore] Expose use_python_scheduler in SchedulerConfig and add UTs/ITs for python scheduler by @lancelly in #11884
- [None][infra] Waive 7 failed cases for main in post-merge 2576 by @ZhanruiSunCh in #12014
- [https://nvbugs/5948878][fix] Implement workaround for ClientPayloadError by @yingguo-trt in #12018
- [TRTLLM-10407][feat] Integrate CuTE DSL top-k kernel for Blackwell by @limin2021 in #11900
- [TRTLLM-11148][perf] _prepare_inputs host time optimization by @hyukn in #11704
- [None][test] Fix model_name starcoder_15b is not in allowed_models issue by @yufeiwu-nv in #11981
- [None][infra] Waive 5 failed cases for main in post-merge 2578 by @ZhanruiSunCh in #12023
- [None][chore] AutoDeploy: re-enable nvfp4 superv3 accuracy test by @galagam in #11945
- [None][chore] Remove visual_gen benchmark test from YAML by @zhenhuaw-me in #12027
- [None][fix] Fix the model list as it had a dup model by @tcherckez-nvidia in #12029
- [https://nvbugs/5863806][fix] Fix Python string truthiness bug in FMHA cubin selection by @luyiyun1021 in #11909
- [None][feat] Upgrade xgrammar from 0.1.25 to 0.1.32 by @sunnyqgg in #12016
- [https://nvbugs/5924144][test] unwaive cpp/test_unit_tests.py::test_unit_tests[kernels-80] by @Funatiq in #11902
- [None][chore] limit tileiras to CUDA13.1 by @tburt-nv in #12042
- [None][feat] Add silu to trtllm-gen MoE by @IwakuraRein in #11663
- [TRTLLM-11045][feat] Integrate SA with EAGLE3 and PARD by @cascade812 in #11878
- [None][chore] waive test_visual_gen_quickstart by @tburt-nv in #12043
- [None][feat] NIXL support for hybrid model cache transfer by @NVShreyas in #11608
New Contributors
- @zyang-Modular made their first contribution in #11689
- @slin1237 made their first contribution in #11711
- @davidmlw made their first contribution in #11493
- @marinayanov made their first contribution in #11530
- @lazykyama made their first contribution in #9453
- @capyun007 made their first contribution in #11753
- @Bias92 made their first contribution in #11862
- @ekou24 made their first contribution in #11242
- @o-stoner made their first contribution in #11964
- @torotoki made their first contribution in #11505
- @IwakuraRein made their first contribution in #11663
Full Changelog: v1.3.0rc6...v1.3.0rc7