Announcement Highlights
- Model Support
- API
- Support new structural tag API (upgrade XGrammar to 0.1.25) (#7893)
- Enable regex and EBNF grammar in trtllm-serve (#7925)
- Optionally disable server GC and worker GC (#7995)
- Add serialization/deserialization options for AutoTuner profiling cache (#7738)
- Cherry-pick from (#7598) Make low_precision_combine as a llm arg (#7898)
- Benchmark
- Add gpt-oss serve benchmark tests (#7638)
- Exit as early as possible and propagate exit status correctly for multi-node testing (#7739)
- Add gpt oss model for trtllm perf test (#7328)
- Add generation logits case for llama3 (#7759)
- Feature fix model issue for disagg serving (#7785)
- Add deepseek r1/v3 model with chunked prefill cases (#7124)
- Add accuracy benchmark in stress test (#7561)
- Add NoSmem epilogue schedule and dynamic cluster shape for sm10x group gemm (#7757)
- Rename llm_perf_full to llm_perf_core and add missing cases (#7899)
- Update benchmark script (#7860)
- Add multi-nodes test for disagg-serving (#7470)
- Update llm_models_root to improve path handling on BareMetal environment (#7876)
- Add DS-R1/Qwen3 test cases for RTX 6000 (#7662)
- Add NIM perf test cases (#7924)
- Fix the tactic sorting in TrtllmGenBatchedGemmRunner::getValidConfigIndices (#7419)
- Improve the failure message for accuracy test suite (#7994)
- Update get_sysinfo.py to avoid UnboundLocalError (#7982)
- Update disagg gen-only benchmark. (#7917)
- Feature
- Phi4-mm image modality inference optimization (#7918)
- Add NVFP4 x FP8 moe kernels (#7821)
- Enable KV cache reuse and chunked prefill for mistral3.1 (#7628)
- Enable two-model spec dec for MTP Eagle (#7001)
- Support EPLB in Qwen3 MoE (#7443)
- Eagle3 cuda graph support for the first draft model inference (#7363)
- Enable run_post_quant_allgather for MoE TRTLLM backend (#6794)
- Enable gpt oss on DGX H100. (#6775)
- Add gpt-oss chunked prefill tests (#7779)
- Eagle, use last hidden post norm (#7546)
- Optimize Qwen2/2.5-VL performance (#7250)
- Support kvcache reuse and chunk prefill for phi4mm (#7723)
- Support attention dp for qwen3 dense model (#7618)
- AutoDeploy Fix memory leak in fuse_moe (#7844)
- Enable overlap scheduler for two-model spec decoding (#7651)
- Add support of CUDA13 and sm103 devices (#7568)
- Add Cute DSL nvfp4 linear op (#7632)
- Enable LM tp for MTP, under attention dp case (cherry-pick #7128) (#7571)
- Add an example of KV cache host offloading (#7767)
- Helix: make softmax stats pointer available to attention gen (#6865)
- AutoDeploy: graph-less transformers mode for HF (#7635)
- Cherry-pick DeepGEMM related commits from release/1.1.0rc2 (#7716)
- Add swapab, tileN64, cga sync support for cute dsl nvfp4 gemm (#7764)
- FP8 Context MLA integration (Cherry-pick #6059 from release/1.1.0rc2) (#7610)
- Update CUTLASS to 4.2 and enable SM103 group gemm (#7832)
- Cherry-pick fix to reuse pytorch memory segments occupied by cudagraph (#7747)
- Helix: add custom position ids to MLA kernels (#6904)
- Support for partial sharding from factory (#7393)
- KV cache transmission in disagg with CP on gen side (#7624)
- Cherry-pick from #7423 Support fp8 block wide ep cherry pick (#7712)
- E-PD Disagg Support via llmapi (3/N) (#7577)
- Add batch waiting when scheduling (#7416)
- Use list instead of torch tensor for new tokens in update requests (#7730)
- Support multi-threaded tokenizers for trtllm-serve (cherry-pick) (#7776)
- Support JIT mha.cu for SPEC_DEC in runtime (#6078)
- Batched sampling by strategy (supersedes enable_mixed_sampler, cf. TRTLLM-7156) (#7294)
- Enable
prompt_logprobs
in pytorch backend (#7580) - Support SWA KV cache reuse (#6768)
- Return topk logprobs in torch backend (#7756)
- CapturedGraph to support max_batch_size > max(cuda_graph_batch_sizes) (#7888)
- Revert " Return topk logprobs in torch backend (#7756)" (#7969)
- DeepEP LL fp8 dispatch/combine (#7927)
- Helix: add alltoall op (#6815)
- Optimize kv cache transfer TEP (#7613)
- Add environment variable to adjust block pool allocation ration under kv cache manager (#7923)
- Add a standalone buffer cache class and reuse buffers between cduagraph and no-graph flow (#7669)
- Add static tree sampling and verification (#7161)
- Add support for KVCache transfer from KVCache reuse path (#6348)
- Added AutoDeploy backend support to test_perf.py (#7588)
- Speed up concat k and copy k_nope in context phase using torch.compile (#8044)
- Documentation
- Fix the link in the doc (#7713)
- Clean the doc folder and move the outdated docs into lega… (#7729)
- Add doc for KV cache salting support (#7772)
- Fix section header of llm_kv_cache_offloading example (#7795)
- Update Documentation link to point to docs instead of docs source code (#6495)
- Cherry-pick deployment guide update from 1.1.0rc2 branch to main branch (#7774)
- Tech blog: Combining Guided Decoding and Speculative Decoding: Making CPU and GPU Cooperate Seamlessly (#7864)
- Update tech blog12 (#7884)
- Add known issues to llmapi doc (#7560)
- Add blackwell information into support matrix (#6740)
- Fix a invalid link and a typo. (#7634)
- Use hash id for external link (#7641)
- Add labels description note into llm api section (#7696)
- Enhance api reference doc by labeling stable APIs (#7751)
- Add 1.0 release notes (#7605)
- Scaffolding tech blog part one (#7835)
- Update docker cmd in quick start guide and trtllm-serve … (#7787)
- Replace the main in the examples' link with commit id. (#7837)
- Rename TensorRT-LLM to TensorRT LLM for homepage and the … (#7850)
- Add a guide for modifying APIs (#7866)
- Update Perf-Overview.md for release/1.0 (#7848)
- Add stable label to all the un-labelled arguments in LLM class (#7863)
- Fix invalid links in perf benchmarking. (#7933)
- Add Llama PP known issue to release note (#7959)
- Add acknowledgements in scaffolding tech blog (#7983)
- Add scaffolding tech blog to cover (#8021)
- Refine perf overview.md and correct the error link in per… (#8035)
- Scaffolding tech blog fix a typo (#8042)
- Document hang issue caused by
UnpicklingError
(#8049)
What's Changed
- [None][feat] Eagle, use last hidden post norm by @IzzyPutterman in #7546
- [None][infra] AutoDeploy: codeowners for autodeploy unit tests by @lucaslie in #7743
- [TRTLLM-6668][feat] Enable overlap scheduler for two-model spec decoding by @ziyixiong-nv in #7651
- [None][ci] move qwen3 tests from GB200 to B200 by @QiJune in #7733
- [None][feat] support attention dp for qwen3 dense model by @Nekofish-L in #7618
- [None][doc] Fix the link in the doc by @Shixiaowei02 in #7713
- [TRTLLM-4629] [feat] Add support of CUDA13 and sm103 devices by @VALLIS-NERIA in #7568
- [TRTLLM-6295][test] Exit as early as possible and propagate exit status correctly for multi-node testing by @chzblych in #7739
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #7735
- [None][fix] Ensure that the W4A8 custom input scale remains aligned across all ranks by @yilin-void in #7614
- [None][chore] Fix error when running trtllm-bench without cuda graph. by @bobboli in #7725
- [None][doc] Clean the doc folder and move the outdated docs into lega… by @nv-guomingz in #7729
- [TRTLLM-6898][feat] Add Cute DSL nvfp4 linear op by @limin2021 in #7632
- [None] [chore] cherry pick changes on slurm scripts from
release/1.1.0rc2
by @kaiyux in #7750 - [https://nvbugs/5503529][fix] Change test_llmapi_example_multilora to get adapters path from cmd line to avoid downloading from HF by @amitz-nv in #7740
- [TRTLLM-7070][feat] add gpt-oss serve benchmark tests by @xinhe-nv in #7638
- [None][fix] waive hang tests on main by @xinhe-nv in #7720
- [https://nvbugs/5471106][fix] Remove the waivers by @ziyixiong-nv in #7711
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #7746
- Revert "[None][feat] support attention dp for qwen3 dense model" by @byshiue in #7765
- [TRTLLM-8044][refactor] Rename data -> cache for cacheTransceiver by @Tabrizian in #7659
- [None][chore] AutoDeploy: neat disablement of transforms in pipeline by @lucaslie in #7736
- [None][chore] Remove unused get_quant_scales methods by @achartier in #7687
- [None][infra] add nspect allow list for false positive secrets by @yuanjingx87 in #5797
- [TRTLLM-7398][doc] Add doc for KV cache salting support by @chang-l in #7772
- [None][infra] Update CI allowlist 2025-09-16 by @yuanjingx87 in #7773
- [None][infra] Add nightly pipeline to generate lock files by @yuanjingx87 in #5798
- [https://nvbugs/5516666][fix] cherrypick fix to the CUDA graph warmup issue when using speculative decoding by @HuiGao-NV in #7737
- [None][waive] Waive tests by @Tabrizian in #7775
- [https://nvbugs/5489015][fix] Support communicator split in MNNVL allreduce and fix the binding issues. by @timlee0212 in #7387
- [https://nvbugs/5488582][fix] Cherry-pick 7495: Avoid unexpected Triton recompilation in DG fused_moe by @hyukn in #7708
- [TRTLLM-6741] [feat] enable LM tp for MTP, under attention dp case (cherry-pick #7128) by @kaiyux in #7571
- [None][chore] AutoDeploy: clean up of model unit test configuration by @lucaslie in #7742
- [None][ci] waive test_llm_gemma_1gpu_summary_vswa by @QiJune in #7781
- [https://nvbugs/5517260][fix] move scaffolding contrib module's import to subdirectory by @dc3671 in #7758
- [None][feat] add an example of KV cache host offloading by @QiJune in #7767
- [https://nvbugs/5485325][fix] Cherry-pick #7373: fix the CUDA graph warmup issue when using speculative decoding by @lfr-0531 in #7734
- [None][ci] waive test_llama_eagle3[True-FLASHINFER-False-False-False-False-True] by @QiJune in #7788
- [None][chore] Remove closed bugs by @xinhe-nv in #7697
- [None][test] add gpt oss model for trtllm perf test by @ruodil in #7328
- [TRTLLM-7250][fix] waive block tests by @xinhe-nv in #7782
- [None][doc] fix section header of llm_kv_cache_offloading example by @QiJune in #7795
- [TRTLLM-7410][feat] Enable KV cache reuse and chunked prefill for mistral3.1 by @2ez4bz in #7628
- [None][infra] Waive failed tests on main 09/17 by @EmmaQiaoCh in #7812
- [None][doc] Update Documentation link to point to docs instead of docs source code by @asrivas in #6495
- [TRTLLM-5966][feat] Helix: make softmax stats pointer available to attention gen by @MatthiasKohl in #6865
- [https://nvbugs/5516661][fix] Drop waive case 5516661 by @yunruis in #7791
- [https://nvbugs/5508536][fix] Revert #7041: Move stop_criteria to sample_async (#7041) by @netanel-haber in #7796
- [#7308] [feat] AutoDeploy: graph-less transformers mode for HF by @lucaslie in #7635
- [None][ci] restore unwaive list by @Superjomn in #7802
- [None][fix] Make tile_tokens_dim calculation just in time before kernel launching. by @hyukn in #7529
- [None][chore] Version bump for 1.1.0rc6 by @chzblych in #7824
- [https://nvbugs/5519544][fix] fix invalid expression for disabling pa… by @nv-guomingz in #7806
- [TRTLLM-8070][test] add generation logits case for llama3 by @crazydemo in #7759
- [https://nvbugs/5523080][fix] Correct the batch index in device tensors by @ziyixiong-nv in #7803
- [None][feat] Cherry-pick DeepGEMM related commits from release/1.1.0rc2 by @Barry-Delaney in #7716
- [None][fix] Fix CI issue for dsl pkg install by @limin2021 in #7784
- [https://nvbugs/5508890][fix] gen. result cleanup when using PostprocWorker by @ixlmar in #7771
- [None][infra] update ci allow list 2025/09/17 by @yuanjingx87 in #7816
- [None][chore] Remove executor config in create_py_executor by @leslie-fang25 in #7599
- [TRTLLM-7250][fix] Add failed cases into waives.txt by @xinhe-nv in #7801
- [https://nvbugs/5519530][fix] Fix gptoss 2-gpu test by @dongfengy in #7819
- [TRTLLM-6577][feat] Support nano_v2_vlm in pytorch backend by @Wanli-Jiang in #7207
- [None][fix] Add TP information in weight scale loading in WeightOnlyQuantLinearMethod by @stnie in #7732
- [TRTLLM-7250][fix] Add failed cases into waives.txt by @xinhe-nv in #7807
- [TRTLLM-7918][feat] Support kvcache reuse and chunk prefill for phi4mm by @Wanli-Jiang in #7723
- [https://nvbugs/5519462][fix] skip deepseek test on preHopper by @xinhe-nv in #7817
- [None][chore] remove generated fmha_cubin.h from source tree by @QiJune in #7836
- [None][fix] Revert "Revert "[None][feat] support attention dp for qwen3 dense model"" by @byshiue in #7780
- [TRTLLM-6898][feat] Add swapab, tileN64, cga sync support for cute dsl nvfp4 gemm by @limin2021 in #7764
- [None][doc] Cherry-pick deployment guide update from 1.1.0rc2 branch to main branch by @dongfengy in #7774
- [TRTLLM-6746][feat] Enable two-model spec dec for MTP Eagle by @sunnygqq in #7001
- [None][ci] set TORCHINDUCTOR_COMPILE_THREADS correctly by @QiJune in #7800
- [https://nvbugs/5522851][fix] Correct the logic to update kv_lens_cuda by @ziyixiong-nv in #7790
- [TRTLLM-6994][feat] FP8 Context MLA integration (Cherry-pick #6059 from release/1.1.0rc2) by @yuxianq in #7610
- [TRTLLM-6286] [feat] Update CUTLASS to 4.2 and enable SM103 group gemm by @VALLIS-NERIA in #7832
- [None][fix] get Local IP by connect remote by @chuangz0 in #7719
- [TRTLLM-7183][test] Feature fix model issue for disagg serving by @fredricz-20070104 in #7785
- [https://nvbugs/5481434][feat] cherry-pick fix to reuse pytorch memory segments occupied by cudagraph by @HuiGao-NV in #7747
- [None][test] add deepseek r1/v3 model with chunked prefill cases by @ruodil in #7124
- [None][chore] polish error message in cute_dsl_utils.py by @QiJune in #7852
- [None][fix] fix load_model_on_cpu on qwen/convert_checkpoint.py by @lkm2835 in #2382
- [None][infra] Waive failed tests in post-merge by @EmmaQiaoCh in #7859
- [None][ci] Waive llama3 auto dtype test bug in https://nvbugs/5527956. by @dominicshanshan in #7853
- [None][test] Add accuracy benchmark in stress test by @crazydemo in #7561
- [None][chore] remove cli cases for rtx6k by @crazydemo in #7833
- [None][feat] Support EPLB in Qwen3 MoE by @lucifer1004 in #7443
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #7841
- [https://nvbugs/5503440][fix] Fix potential hang due to wrong type of ZMQ socket and protocol for worker_init_status_queue by @lancelly in #7646
- [None][doc] Tech blog: Combining Guided Decoding and Speculative Decoding: Making CPU and GPU Cooperate Seamlessly by @syuoni in #7864
- [https://nvbugs/5522332][fix] Pin numpy version for Gemma. (cherry-pick #7783) by @yuxianq in #7797
- [TRTLLM-5966][feat] Helix: add custom position ids to MLA kernels by @MatthiasKohl in #6904
- [https://nvbugs/5471108][chore] Unwaiving disagg acc test by @pcastonguay in #7686
- [https://nvbugs/5522462][fix] Fix FP8 scout illegal memory access by @mikeiovine in #7845
- [#7704][chore] Enable MathJax to fix formulas in documentation by @karljang in #7744
- [TRTLLM-6342][feat] Support for partial sharding from factory by @greg-kwasniewski1 in #7393
- [https://nvbugs/5520490][fix] Fix intermittent test failures by avoiding external web data pulls by @chang-l in #7879
- [None][doc] Update tech blog12 by @syuoni in #7884
- [TRTLLM-7731][feat] KV cache transmission in disagg with CP on gen side by @brb-nv in #7624
- [TRTLLM-8188][chore] refactor GenerationExecutorWorker with WorkerBase for better code reusing by @Superjomn in #7840
- [https://nvbugs/5517404][fix] Use the correct cuda graph for dynamic spec dec by @ziyixiong-nv in #7728
- [TRTLLM-6286] [perf] Add NoSmem epilogue schedule and dynamic cluster shape for sm10x group gemm by @VALLIS-NERIA in #7757
- [TRTLLM-7008][fix] cherrypick to main Add automatic shared memory delete if already exist by @dongxuy04 in #7727
- [None][fix] Disable torch.compile for CapturableGuidedDecoder by @syuoni in #7871
- [None][fix] cherrypick to main: Fix possible mpi broadcast and gather issue on large object by @dongxuy04 in #7854
- [https://nvbugs/5512556][unwaive] Unwaive DeepSeek PP tests by @peaceh-nv in #7828
- [https://nvbugs/5513423][fix] Correctly respect min_tokens in PyTorch Workflow by @stnie in #7808
- [None][fix] Fix DeepGEMM commit by @Barry-Delaney in #7875
- [None][chore] Mass integration of release/1.0 - 5th by @dominicshanshan in #7640
- [TRTLLM-7070][feat] add gpt-oss chunked prefill tests by @xinhe-nv in #7779
- [None][infra] Waive a failed case on main by @EmmaQiaoCh in #7901
- [TRTLLM-7989][infra] Bundle UCX and NIXL libs in the TRTLLM python package by @bo-nv in #7766
- [https://nvbugs/5525849][fix] Cherry-pick to fix mismatch of max seq len between kv cache manager and dummy requests by @HuiGao-NV in #7855
- [TRTLLM-7385][feat] Optimize Qwen2/2.5-VL performance by @yechank-nvidia in #7250
- [None][infra] Skip failed test for nvbugs 5532023 by @EmmaQiaoCh in #7905
- [https://nvbugs/5351244][fix] CHERRY-PICK test_mpi_session (#7501) by @Superjomn in #7900
- [None][chore] Upgrade transformers to 4.56.0 by @Wanli-Jiang in #7523
- [https://nvbugs/5477359][fix] Removing test waivers by @Linda-Stadter in #7877
- [https://nvbugs/5516665][fix] Fix CUTLASS moe fake impl errors by @liji-nv in #7714
- [None] [feat] Enable run_post_quant_allgather for MoE TRTLLM backend by @ChristinaZ in #6794
- [https://nvbugs/5504086][fix] Fix MTP vanilla by @syuoni in #7904
- [TRTLLM-7831][feat] Cherry-pick from #7423 Support fp8 block wide ep cherry pick by @xxi-nv in #7712
- [TRTLLM-8209][feat] Support new structural tag API (upgrade XGrammar to 0.1.25) by @syuoni in #7893
- [https://nvbugs/5522847][fix] Disable GC on disagg server and client by @yuantailing in #7858
- [None][feat] Add Tencent HunYuanDenseV1 model support by @sorenwu in #7081
- [TRTLLM-7328][feat] E-PD Disagg Support via llmapi (3/N) by @chang-l in #7577
- [None][opt] Add batch waiting when scheduling by @yunruis in #7416
- [https://nvbugs/5355128][fix] Add missing wgmma intrinsic for starcoder by @pengbowang-nv in #7643
- [None][fix] Read eos_token_id from generation_config for kimi_k2 by @pengbowang-nv in #7120
- [None][fix] Fix and add test for TRTLLM MoE backend by @pengbowang-nv in #7755
- [None][test] rename llm_perf_full to llm_perf_core and add missing cases by @ruodil in #7899
- [None][fix] CHERRY-PICK trtllm-serve yaml loading (#7551) by @Superjomn in #7897
- [https://nvbugs/5367180][fix] Fix xgrammar import before loading tensorrt_llm binary by @syuoni in #7906
- [None][fix] fix a bug with trtllm-gen kernels + attention sinks by @PerkzZheng in #7919
- [https://nvbugs/5532023][fix] executor with-statement bug by @Superjomn in #7895
- [None][fix] Re-add the import for allgather that was mistakenly removed. by @ChristinaZ in #7920
- [None][chore] Update benchmark script by @zerollzeng in #7860
- [None][fix] Assign [] to req.py_draft_tokens instead of None when spec decode is off by @zheyuf in #7511
- [None][test] Waive another intermittent OOM test by @chzblych in #7930
- [None][feat] Use list instead of torch tensor for new tokens in update requests by @dcampora in #7730
- [None][feat] Enable gpt oss on DGX H100. by @Tracin in #6775
- [TRTLLM-7292][feat] Support multi-threaded tokenizers for trtllm-serve (cherry-pick) by @nv-yilinf in #7776
- [TRTLLM-6549][fix] add kv cache time output back by @zhengd-nv in #7798
- [None][feat] support JIT mha.cu for SPEC_DEC in runtime by @jhaotingc in #6078
- [TRTLLM-7728][feat] batched sampling by strategy (supersedes enable_mixed_sampler, cf. TRTLLM-7156) by @ixlmar in #7294
- [TRTLLM-7182][test] add multi-nodes test for disagg-serving by @reasonsolo in #7470
- [TRTLLM-7015] [feat] Enable
prompt_logprobs
in pytorch backend by @venkywonka in #7580 - [https://nvbugs/5528405][fix] Set up draft_tokens before scheduling by @ziyixiong-nv in #7903
- [https://nvbugs/5477404][chore] unwaive test_disaggregated_single_gpu.py::test_disaggregated_llama_context_capacity by @reasonsolo in #7857
- [None][fix] refine
backend
option handling for commands by @tongyuantongyu in #7829 - [#7692][fix] recognize RequestError as per-request error in background handler by @tongyuantongyu in #7726
- [None][chore] Make sampler type beta. by @dcampora in #7934
- [TRTLLM-6341][feature] Support SWA KV cache by @eopXD in #6768
- [https://nvbugs/5532225] [fix] MoE use stream-dependent workspace by @VALLIS-NERIA in #7940
- [None][infra] Skip failed test for nvbugs 5537738 by @pengbowang-nv in #7946
- [None][chore] remove cubins for ci cases by @qsang-nv in #7902
- [None][chore] update chunked prefill cases by @xinhe-nv in #7921
- [None][feat] Return topk logprobs in torch backend by @dcaox in #7756
- [None][ci] optimize test cases of dgx b200 by @QiJune in #7948
- [None][chore] Recover cutlass-dsl pkg install and dsl op testing. by @limin2021 in #7945
- [https://nvbugs/5521799][fix] Trim incorrectly generated harmony messages by @JunyiXu-nv in #7849
- [https://nvbugs/5532248][fix] Fix fused_moe OOM by @HuiGao-NV in #7931
- [None][test] Update llm_models_root to improve path handling on BareMetal environment by @yufeiwu-nv in #7876
- [None][ci] remove duplicate test cases by @QiJune in #7956
- [None][chore] add test_w4_1gpu[True-True-cutlass-fp8] & TestKimiK2::test_fp8_blocks… by @xinhe-nv in #7944
- [TRTLLM-5235][feat] Enable regex and EBNF grammar in trtllm-serve by @syuoni in #7925
- [None][feat] add model seed-oss by @Nekofish-L in #7496
- [None][ci] Waive some intermittent failures by @HuiGao-NV in #7955
- [None][fix] trtllm-gen cubins compiled with wrong arch. by @PerkzZheng in #7953
- [None][chore] cleanup build script by @tongyuantongyu in #7865
- [#7675][feat] CapturedGraph to support max_batch_size > max(cuda_graph_batch_sizes) by @MrGeva in #7888
- [None][fix] fix get_iteration_stats IndexError by @macrocell in #7216
- [None][fix] Fix dummy load format for DeepSeek. by @yuxianq in #7874
- [TRTLLM-7399][test] Add DS-R1/Qwen3 test cases for RTX 6000 by @pamelap-nvidia in #7662
- [https://nvbugs/5473781][fix] Fix llama 4 FP8 for PP>1 by @mikeiovine in #7220
- [None][bug] Fix transformers version for Triton backend by @Tabrizian in #7964
- [OMNIML-2336][feat] Add NVFP4 x FP8 moe kernels by @sychen52 in #7821
- [None][fix] Revert "[None][feat] Return topk logprobs in torch backend (#7756)" by @Tabrizian in #7969
- [None][chore] Validate features combination by @leslie-fang25 in #7630
- [https://nvbugs/5456485][bug] unwaive triton test by @Tabrizian in #7966
- [None][feat] DeepEP LL fp8 dispatch/combine by @yilin-void in #7927
- [None][chore] Update trtllm-bench documentation on setting FP8 KV cache by @achartier in #7885
- [None][chroe] Update the cuda and tensorrt version in homepage icons. by @nv-guomingz in #7963
- [TRTLLM-6541][test] Add NIM perf test cases by @fredricz-20070104 in #7924
- [None][doc] scaffolding tech blog part one by @WeiHaocheng in #7835
- [TRTLLM-7758][feat] Optimize phi4-mm image modality inference by @Wanli-Jiang in #7918
- [None][infra] Unwaive some tests since dev already have a PR to collect more info by @EmmaQiaoCh in #7984
- [None][perf] Fix the tactic sorting in TrtllmGenBatchedGemmRunner::getValidConfigIndices by @jinyangyuan-nvidia in #7419
- [https://nvbugs/5536141][fix] fix_disagg_single_gpu_test by @chuangz0 in #7990
- [https://nvbugs/4955671][fix] update test list by @xinhe-nv in #7980
- [None][chore] Mass integration of release/1.0 - 6th by @dominicshanshan in #7928
- [None][chore] Remove developer name in comment by @eopXD in #7981
- [None][chore] relax version constraints on fastapi by @PeganovAnton in #7935
- [TRTLLM-5966][feat] Helix: add alltoall op by @MatthiasKohl in #6815
- [None][fix] fix a bug in wideEp use DeepEP with num_chunks > 1 by @xxi-nv in #7954
- [None][doc] Add acknowledgements in scaffolding tech blog by @WeiHaocheng in #7983
- [None][infra] Waive failed tests on main 09/25 by @EmmaQiaoCh in #8001
- [None][chore] extract weights loading related logic to model loader by @QiJune in #7579
- [https://nvbugs/5525951][fix] Clarify that PP is not supported for GPTOSS by @dongfengy in #7911
- [None][chore] Some clean-ups for CUDA 13.0 dependencies by @chzblych in #7979
- [TRTLLM-7999][infra] Add B300/GB300 single gpu test by @yiqingy0 in #7951
- [None][infra] Improve the failure message for accuracy test suite by @syuoni in #7994
- [#6102][fix] support non-system python installation by @tongyuantongyu in #7763
- [None][ci] Waive test_mm_encoder_standalone.py::test_multi_request_batch_chat[llava-v1.6-mistral-7b-hf] by @QiJune in #8010
- [None][feat] Optimize kv cache transfer TEP by @chuangz0 in #7613
- [TRTLLM-7330][feat] Eagle3 cuda graph support for the first draft model inference by @sunnygqq in #7363
- [https://nvbugs/5527956][fix] AutoDeploy: fix IMA due to outdated metadata by @lucaslie in #8002
- [https://nvbugs/5451740][fix] Add DP padding back on SM120 by @peaceh-nv in #7965
- [None][chore] Report NCCL error message but not OOM when NCCL error happens by @HuiGao-NV in #8009
- [None][feature] Add environment variable to adjust block pool allocation ration under kv cache manager by @eopXD in #7923
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #7986
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #8004
- [None][doc] Add scaffolding tech blog to cover by @WeiHaocheng in #8021
- [None][chore] Require NVIDIA developers to use their full name or NVIDIA account in GitHub profiles by @MartinMarciniszyn in #8022
- [https://nvbugs/5495789][feat] Optionally disable server GC and worker GC by @yuantailing in #7995
- [None][feat] Add a standalone buffer cache class and reuse buffers between cduagraph and no-graph flow by @HuiGao-NV in #7669
- [TRTLLM-6393][feat] add static tree sampling and verification by @yweng0828 in #7161
- [None][infra] Waive failed cases in post-merge 2305 by @EmmaQiaoCh in #8019
- [TRTLLM-8271][fix] Fix CDL overlap scheduling performance by @mikeiovine in #7971
- [https://nvbugs/5518713][fix] Trtllm-gen moe backend for blockwise fp8 ckpt (Qwen3-235B-A22B-FP8) by @jhaotingc in #7856
- [#5860][autodeploy] GPT-OSS MXFP4 support by @Fridah-nv in #7451
- [TRTLLM-6106][feat] Add support for KVCache transfer from KVCache reuse path by @Tabrizian in #6348
- [None] [feat] Update disagg gen-only benchmark. by @qiaoxj07 in #7917
- [https://nvbugs/5461712] [fix] Use DG for Qwen3 Linear layers by @achartier in #8030
- [https://nvbugs/5537738][fix] Add fp8 post-quant allgather support by @ChristinaZ in #8008
- [None][doc] Refine perf overview.md and correct the error link in per… by @nv-guomingz in #8035
- [None][infra] Skip failed test for main branch on 9/28 by @EmmaQiaoCh in #8040
- [None][chore] Disable concurrent weights loading for _load_weights_im… by @nv-guomingz in #8034
- [None][doc] Scaffolding tech blog fix a typo by @WeiHaocheng in #8042
- [TRTLLM-4500][feat] Add serialization/deserialization options for AutoTuner profiling cache by @hyukn in #7738
- [None][chore] Cherry-pick from (#7598) Make low_precision_combine as a llm arg by @zongfeijing in #7898
- [None][chore] Update chunked prefill test case configs by @crazydemo in #7868
- [None][chroe] Update cron schedule for closing inactive issues by @zhenhuaw-me in #8048
- [None] [doc] Document hang issue caused by
UnpicklingError
by @kaiyux in #8049 - [#7288][feat] Added AutoDeploy backend support to test_perf.py by @MrGeva in #7588
- [None][chore] update test case constraint by @crazydemo in #8020
- [TRTLLM-8348][feat] Speed up concat k and copy k_nope in context phase using torch.compile by @yuantailing in #8044
- [https://nvbugs/5532087][ci] Enable test case by @HuiGao-NV in #8029
- [None][ci] Disable tensorRT cases in post-merge by @HuiGao-NV in #8028
- [None][fix] only support deepep post quant all2all on nvfp4 by @yilin-void in #8041
- [None][infra] Waive failed cases for main on 0929 by @EmmaQiaoCh in #8053
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #8043
- [#4674][bugfix] AutoDeploy Fix memory leak in fuse_moe by @galagam in #7844
- [None][test] Update get_sysinfo.py to avoid UnboundLocalError by @yufeiwu-nv in #7982
- [https://nvbugs/5541494] [fix] add back missing sm100f bmm kernels by @VALLIS-NERIA in #8051
- [None][chore] Bump version to 1.2.0rc0 by @yiqingy0 in #7941
New Contributors
- @Nekofish-L made their first contribution in #7618
- @asrivas made their first contribution in #6495
- @sunnygqq made their first contribution in #7001
- @yufeiwu-nv made their first contribution in #7876
- @macrocell made their first contribution in #7216
- @PeganovAnton made their first contribution in #7935
Full Changelog: v1.1.0rc5...v1.2.0rc0