github NVIDIA/TensorRT-LLM v1.2.0rc0

pre-release21 hours ago

Announcement Highlights

  • Model Support
    • Support nano_v2_vlm in pytorch backend (#7207)
    • Add Tencent HunYuanDenseV1 model support (#7081)
    • Support Seed-OSS model in pytorch backend (#7496)
    • GPT-OSS MXFP4 support (#7451)
  • API
    • Support new structural tag API (upgrade XGrammar to 0.1.25) (#7893)
    • Enable regex and EBNF grammar in trtllm-serve (#7925)
    • Optionally disable server GC and worker GC (#7995)
    • Add serialization/deserialization options for AutoTuner profiling cache (#7738)
    • Cherry-pick from (#7598) Make low_precision_combine as a llm arg (#7898)
  • Benchmark
    • Add gpt-oss serve benchmark tests (#7638)
    • Exit as early as possible and propagate exit status correctly for multi-node testing (#7739)
    • Add gpt oss model for trtllm perf test (#7328)
    • Add generation logits case for llama3 (#7759)
    • Feature fix model issue for disagg serving (#7785)
    • Add deepseek r1/v3 model with chunked prefill cases (#7124)
    • Add accuracy benchmark in stress test (#7561)
    • Add NoSmem epilogue schedule and dynamic cluster shape for sm10x group gemm (#7757)
    • Rename llm_perf_full to llm_perf_core and add missing cases (#7899)
    • Update benchmark script (#7860)
    • Add multi-nodes test for disagg-serving (#7470)
    • Update llm_models_root to improve path handling on BareMetal environment (#7876)
    • Add DS-R1/Qwen3 test cases for RTX 6000 (#7662)
    • Add NIM perf test cases (#7924)
    • Fix the tactic sorting in TrtllmGenBatchedGemmRunner::getValidConfigIndices (#7419)
    • Improve the failure message for accuracy test suite (#7994)
    • Update get_sysinfo.py to avoid UnboundLocalError (#7982)
    • Update disagg gen-only benchmark. (#7917)
  • Feature
    • Phi4-mm image modality inference optimization (#7918)
    • Add NVFP4 x FP8 moe kernels (#7821)
    • Enable KV cache reuse and chunked prefill for mistral3.1 (#7628)
    • Enable two-model spec dec for MTP Eagle (#7001)
    • Support EPLB in Qwen3 MoE (#7443)
    • Eagle3 cuda graph support for the first draft model inference (#7363)
    • Enable run_post_quant_allgather for MoE TRTLLM backend (#6794)
    • Enable gpt oss on DGX H100. (#6775)
    • Add gpt-oss chunked prefill tests (#7779)
    • Eagle, use last hidden post norm (#7546)
    • Optimize Qwen2/2.5-VL performance (#7250)
    • Support kvcache reuse and chunk prefill for phi4mm (#7723)
    • Support attention dp for qwen3 dense model (#7618)
    • AutoDeploy Fix memory leak in fuse_moe (#7844)
    • Enable overlap scheduler for two-model spec decoding (#7651)
    • Add support of CUDA13 and sm103 devices (#7568)
    • Add Cute DSL nvfp4 linear op (#7632)
    • Enable LM tp for MTP, under attention dp case (cherry-pick #7128) (#7571)
    • Add an example of KV cache host offloading (#7767)
    • Helix: make softmax stats pointer available to attention gen (#6865)
    • AutoDeploy: graph-less transformers mode for HF (#7635)
    • Cherry-pick DeepGEMM related commits from release/1.1.0rc2 (#7716)
    • Add swapab, tileN64, cga sync support for cute dsl nvfp4 gemm (#7764)
    • FP8 Context MLA integration (Cherry-pick #6059 from release/1.1.0rc2) (#7610)
    • Update CUTLASS to 4.2 and enable SM103 group gemm (#7832)
    • Cherry-pick fix to reuse pytorch memory segments occupied by cudagraph (#7747)
    • Helix: add custom position ids to MLA kernels (#6904)
    • Support for partial sharding from factory (#7393)
    • KV cache transmission in disagg with CP on gen side (#7624)
    • Cherry-pick from #7423 Support fp8 block wide ep cherry pick (#7712)
    • E-PD Disagg Support via llmapi (3/N) (#7577)
    • Add batch waiting when scheduling (#7416)
    • Use list instead of torch tensor for new tokens in update requests (#7730)
    • Support multi-threaded tokenizers for trtllm-serve (cherry-pick) (#7776)
    • Support JIT mha.cu for SPEC_DEC in runtime (#6078)
    • Batched sampling by strategy (supersedes enable_mixed_sampler, cf. TRTLLM-7156) (#7294)
    • Enable prompt_logprobs in pytorch backend (#7580)
    • Support SWA KV cache reuse (#6768)
    • Return topk logprobs in torch backend (#7756)
    • CapturedGraph to support max_batch_size > max(cuda_graph_batch_sizes) (#7888)
    • Revert " Return topk logprobs in torch backend (#7756)" (#7969)
    • DeepEP LL fp8 dispatch/combine (#7927)
    • Helix: add alltoall op (#6815)
    • Optimize kv cache transfer TEP (#7613)
    • Add environment variable to adjust block pool allocation ration under kv cache manager (#7923)
    • Add a standalone buffer cache class and reuse buffers between cduagraph and no-graph flow (#7669)
    • Add static tree sampling and verification (#7161)
    • Add support for KVCache transfer from KVCache reuse path (#6348)
    • Added AutoDeploy backend support to test_perf.py (#7588)
    • Speed up concat k and copy k_nope in context phase using torch.compile (#8044)
  • Documentation
    • Fix the link in the doc (#7713)
    • Clean the doc folder and move the outdated docs into lega… (#7729)
    • Add doc for KV cache salting support (#7772)
    • Fix section header of llm_kv_cache_offloading example (#7795)
    • Update Documentation link to point to docs instead of docs source code (#6495)
    • Cherry-pick deployment guide update from 1.1.0rc2 branch to main branch (#7774)
    • Tech blog: Combining Guided Decoding and Speculative Decoding: Making CPU and GPU Cooperate Seamlessly (#7864)
    • Update tech blog12 (#7884)
    • Add known issues to llmapi doc (#7560)
    • Add blackwell information into support matrix (#6740)
    • Fix a invalid link and a typo. (#7634)
    • Use hash id for external link (#7641)
    • Add labels description note into llm api section (#7696)
    • Enhance api reference doc by labeling stable APIs (#7751)
    • Add 1.0 release notes (#7605)
    • Scaffolding tech blog part one (#7835)
    • Update docker cmd in quick start guide and trtllm-serve … (#7787)
    • Replace the main in the examples' link with commit id. (#7837)
    • Rename TensorRT-LLM to TensorRT LLM for homepage and the … (#7850)
    • Add a guide for modifying APIs (#7866)
    • Update Perf-Overview.md for release/1.0 (#7848)
    • Add stable label to all the un-labelled arguments in LLM class (#7863)
    • Fix invalid links in perf benchmarking. (#7933)
    • Add Llama PP known issue to release note (#7959)
    • Add acknowledgements in scaffolding tech blog (#7983)
    • Add scaffolding tech blog to cover (#8021)
    • Refine perf overview.md and correct the error link in per… (#8035)
    • Scaffolding tech blog fix a typo (#8042)
    • Document hang issue caused by UnpicklingError (#8049)

What's Changed

  • [None][feat] Eagle, use last hidden post norm by @IzzyPutterman in #7546
  • [None][infra] AutoDeploy: codeowners for autodeploy unit tests by @lucaslie in #7743
  • [TRTLLM-6668][feat] Enable overlap scheduler for two-model spec decoding by @ziyixiong-nv in #7651
  • [None][ci] move qwen3 tests from GB200 to B200 by @QiJune in #7733
  • [None][feat] support attention dp for qwen3 dense model by @Nekofish-L in #7618
  • [None][doc] Fix the link in the doc by @Shixiaowei02 in #7713
  • [TRTLLM-4629] [feat] Add support of CUDA13 and sm103 devices by @VALLIS-NERIA in #7568
  • [TRTLLM-6295][test] Exit as early as possible and propagate exit status correctly for multi-node testing by @chzblych in #7739
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #7735
  • [None][fix] Ensure that the W4A8 custom input scale remains aligned across all ranks by @yilin-void in #7614
  • [None][chore] Fix error when running trtllm-bench without cuda graph. by @bobboli in #7725
  • [None][doc] Clean the doc folder and move the outdated docs into lega… by @nv-guomingz in #7729
  • [TRTLLM-6898][feat] Add Cute DSL nvfp4 linear op by @limin2021 in #7632
  • [None] [chore] cherry pick changes on slurm scripts from release/1.1.0rc2 by @kaiyux in #7750
  • [https://nvbugs/5503529][fix] Change test_llmapi_example_multilora to get adapters path from cmd line to avoid downloading from HF by @amitz-nv in #7740
  • [TRTLLM-7070][feat] add gpt-oss serve benchmark tests by @xinhe-nv in #7638
  • [None][fix] waive hang tests on main by @xinhe-nv in #7720
  • [https://nvbugs/5471106][fix] Remove the waivers by @ziyixiong-nv in #7711
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #7746
  • Revert "[None][feat] support attention dp for qwen3 dense model" by @byshiue in #7765
  • [TRTLLM-8044][refactor] Rename data -> cache for cacheTransceiver by @Tabrizian in #7659
  • [None][chore] AutoDeploy: neat disablement of transforms in pipeline by @lucaslie in #7736
  • [None][chore] Remove unused get_quant_scales methods by @achartier in #7687
  • [None][infra] add nspect allow list for false positive secrets by @yuanjingx87 in #5797
  • [TRTLLM-7398][doc] Add doc for KV cache salting support by @chang-l in #7772
  • [None][infra] Update CI allowlist 2025-09-16 by @yuanjingx87 in #7773
  • [None][infra] Add nightly pipeline to generate lock files by @yuanjingx87 in #5798
  • [https://nvbugs/5516666][fix] cherrypick fix to the CUDA graph warmup issue when using speculative decoding by @HuiGao-NV in #7737
  • [None][waive] Waive tests by @Tabrizian in #7775
  • [https://nvbugs/5489015][fix] Support communicator split in MNNVL allreduce and fix the binding issues. by @timlee0212 in #7387
  • [https://nvbugs/5488582][fix] Cherry-pick 7495: Avoid unexpected Triton recompilation in DG fused_moe by @hyukn in #7708
  • [TRTLLM-6741] [feat] enable LM tp for MTP, under attention dp case (cherry-pick #7128) by @kaiyux in #7571
  • [None][chore] AutoDeploy: clean up of model unit test configuration by @lucaslie in #7742
  • [None][ci] waive test_llm_gemma_1gpu_summary_vswa by @QiJune in #7781
  • [https://nvbugs/5517260][fix] move scaffolding contrib module's import to subdirectory by @dc3671 in #7758
  • [None][feat] add an example of KV cache host offloading by @QiJune in #7767
  • [https://nvbugs/5485325][fix] Cherry-pick #7373: fix the CUDA graph warmup issue when using speculative decoding by @lfr-0531 in #7734
  • [None][ci] waive test_llama_eagle3[True-FLASHINFER-False-False-False-False-True] by @QiJune in #7788
  • [None][chore] Remove closed bugs by @xinhe-nv in #7697
  • [None][test] add gpt oss model for trtllm perf test by @ruodil in #7328
  • [TRTLLM-7250][fix] waive block tests by @xinhe-nv in #7782
  • [None][doc] fix section header of llm_kv_cache_offloading example by @QiJune in #7795
  • [TRTLLM-7410][feat] Enable KV cache reuse and chunked prefill for mistral3.1 by @2ez4bz in #7628
  • [None][infra] Waive failed tests on main 09/17 by @EmmaQiaoCh in #7812
  • [None][doc] Update Documentation link to point to docs instead of docs source code by @asrivas in #6495
  • [TRTLLM-5966][feat] Helix: make softmax stats pointer available to attention gen by @MatthiasKohl in #6865
  • [https://nvbugs/5516661][fix] Drop waive case 5516661 by @yunruis in #7791
  • [https://nvbugs/5508536][fix] Revert #7041: Move stop_criteria to sample_async (#7041) by @netanel-haber in #7796
  • [#7308] [feat] AutoDeploy: graph-less transformers mode for HF by @lucaslie in #7635
  • [None][ci] restore unwaive list by @Superjomn in #7802
  • [None][fix] Make tile_tokens_dim calculation just in time before kernel launching. by @hyukn in #7529
  • [None][chore] Version bump for 1.1.0rc6 by @chzblych in #7824
  • [https://nvbugs/5519544][fix] fix invalid expression for disabling pa… by @nv-guomingz in #7806
  • [TRTLLM-8070][test] add generation logits case for llama3 by @crazydemo in #7759
  • [https://nvbugs/5523080][fix] Correct the batch index in device tensors by @ziyixiong-nv in #7803
  • [None][feat] Cherry-pick DeepGEMM related commits from release/1.1.0rc2 by @Barry-Delaney in #7716
  • [None][fix] Fix CI issue for dsl pkg install by @limin2021 in #7784
  • [https://nvbugs/5508890][fix] gen. result cleanup when using PostprocWorker by @ixlmar in #7771
  • [None][infra] update ci allow list 2025/09/17 by @yuanjingx87 in #7816
  • [None][chore] Remove executor config in create_py_executor by @leslie-fang25 in #7599
  • [TRTLLM-7250][fix] Add failed cases into waives.txt by @xinhe-nv in #7801
  • [https://nvbugs/5519530][fix] Fix gptoss 2-gpu test by @dongfengy in #7819
  • [TRTLLM-6577][feat] Support nano_v2_vlm in pytorch backend by @Wanli-Jiang in #7207
  • [None][fix] Add TP information in weight scale loading in WeightOnlyQuantLinearMethod by @stnie in #7732
  • [TRTLLM-7250][fix] Add failed cases into waives.txt by @xinhe-nv in #7807
  • [TRTLLM-7918][feat] Support kvcache reuse and chunk prefill for phi4mm by @Wanli-Jiang in #7723
  • [https://nvbugs/5519462][fix] skip deepseek test on preHopper by @xinhe-nv in #7817
  • [None][chore] remove generated fmha_cubin.h from source tree by @QiJune in #7836
  • [None][fix] Revert "Revert "[None][feat] support attention dp for qwen3 dense model"" by @byshiue in #7780
  • [TRTLLM-6898][feat] Add swapab, tileN64, cga sync support for cute dsl nvfp4 gemm by @limin2021 in #7764
  • [None][doc] Cherry-pick deployment guide update from 1.1.0rc2 branch to main branch by @dongfengy in #7774
  • [TRTLLM-6746][feat] Enable two-model spec dec for MTP Eagle by @sunnygqq in #7001
  • [None][ci] set TORCHINDUCTOR_COMPILE_THREADS correctly by @QiJune in #7800
  • [https://nvbugs/5522851][fix] Correct the logic to update kv_lens_cuda by @ziyixiong-nv in #7790
  • [TRTLLM-6994][feat] FP8 Context MLA integration (Cherry-pick #6059 from release/1.1.0rc2) by @yuxianq in #7610
  • [TRTLLM-6286] [feat] Update CUTLASS to 4.2 and enable SM103 group gemm by @VALLIS-NERIA in #7832
  • [None][fix] get Local IP by connect remote by @chuangz0 in #7719
  • [TRTLLM-7183][test] Feature fix model issue for disagg serving by @fredricz-20070104 in #7785
  • [https://nvbugs/5481434][feat] cherry-pick fix to reuse pytorch memory segments occupied by cudagraph by @HuiGao-NV in #7747
  • [None][test] add deepseek r1/v3 model with chunked prefill cases by @ruodil in #7124
  • [None][chore] polish error message in cute_dsl_utils.py by @QiJune in #7852
  • [None][fix] fix load_model_on_cpu on qwen/convert_checkpoint.py by @lkm2835 in #2382
  • [None][infra] Waive failed tests in post-merge by @EmmaQiaoCh in #7859
  • [None][ci] Waive llama3 auto dtype test bug in https://nvbugs/5527956. by @dominicshanshan in #7853
  • [None][test] Add accuracy benchmark in stress test by @crazydemo in #7561
  • [None][chore] remove cli cases for rtx6k by @crazydemo in #7833
  • [None][feat] Support EPLB in Qwen3 MoE by @lucifer1004 in #7443
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #7841
  • [https://nvbugs/5503440][fix] Fix potential hang due to wrong type of ZMQ socket and protocol for worker_init_status_queue by @lancelly in #7646
  • [None][doc] Tech blog: Combining Guided Decoding and Speculative Decoding: Making CPU and GPU Cooperate Seamlessly by @syuoni in #7864
  • [https://nvbugs/5522332][fix] Pin numpy version for Gemma. (cherry-pick #7783) by @yuxianq in #7797
  • [TRTLLM-5966][feat] Helix: add custom position ids to MLA kernels by @MatthiasKohl in #6904
  • [https://nvbugs/5471108][chore] Unwaiving disagg acc test by @pcastonguay in #7686
  • [https://nvbugs/5522462][fix] Fix FP8 scout illegal memory access by @mikeiovine in #7845
  • [#7704][chore] Enable MathJax to fix formulas in documentation by @karljang in #7744
  • [TRTLLM-6342][feat] Support for partial sharding from factory by @greg-kwasniewski1 in #7393
  • [https://nvbugs/5520490][fix] Fix intermittent test failures by avoiding external web data pulls by @chang-l in #7879
  • [None][doc] Update tech blog12 by @syuoni in #7884
  • [TRTLLM-7731][feat] KV cache transmission in disagg with CP on gen side by @brb-nv in #7624
  • [TRTLLM-8188][chore] refactor GenerationExecutorWorker with WorkerBase for better code reusing by @Superjomn in #7840
  • [https://nvbugs/5517404][fix] Use the correct cuda graph for dynamic spec dec by @ziyixiong-nv in #7728
  • [TRTLLM-6286] [perf] Add NoSmem epilogue schedule and dynamic cluster shape for sm10x group gemm by @VALLIS-NERIA in #7757
  • [TRTLLM-7008][fix] cherrypick to main Add automatic shared memory delete if already exist by @dongxuy04 in #7727
  • [None][fix] Disable torch.compile for CapturableGuidedDecoder by @syuoni in #7871
  • [None][fix] cherrypick to main: Fix possible mpi broadcast and gather issue on large object by @dongxuy04 in #7854
  • [https://nvbugs/5512556][unwaive] Unwaive DeepSeek PP tests by @peaceh-nv in #7828
  • [https://nvbugs/5513423][fix] Correctly respect min_tokens in PyTorch Workflow by @stnie in #7808
  • [None][fix] Fix DeepGEMM commit by @Barry-Delaney in #7875
  • [None][chore] Mass integration of release/1.0 - 5th by @dominicshanshan in #7640
  • [TRTLLM-7070][feat] add gpt-oss chunked prefill tests by @xinhe-nv in #7779
  • [None][infra] Waive a failed case on main by @EmmaQiaoCh in #7901
  • [TRTLLM-7989][infra] Bundle UCX and NIXL libs in the TRTLLM python package by @bo-nv in #7766
  • [https://nvbugs/5525849][fix] Cherry-pick to fix mismatch of max seq len between kv cache manager and dummy requests by @HuiGao-NV in #7855
  • [TRTLLM-7385][feat] Optimize Qwen2/2.5-VL performance by @yechank-nvidia in #7250
  • [None][infra] Skip failed test for nvbugs 5532023 by @EmmaQiaoCh in #7905
  • [https://nvbugs/5351244][fix] CHERRY-PICK test_mpi_session (#7501) by @Superjomn in #7900
  • [None][chore] Upgrade transformers to 4.56.0 by @Wanli-Jiang in #7523
  • [https://nvbugs/5477359][fix] Removing test waivers by @Linda-Stadter in #7877
  • [https://nvbugs/5516665][fix] Fix CUTLASS moe fake impl errors by @liji-nv in #7714
  • [None] [feat] Enable run_post_quant_allgather for MoE TRTLLM backend by @ChristinaZ in #6794
  • [https://nvbugs/5504086][fix] Fix MTP vanilla by @syuoni in #7904
  • [TRTLLM-7831][feat] Cherry-pick from #7423 Support fp8 block wide ep cherry pick by @xxi-nv in #7712
  • [TRTLLM-8209][feat] Support new structural tag API (upgrade XGrammar to 0.1.25) by @syuoni in #7893
  • [https://nvbugs/5522847][fix] Disable GC on disagg server and client by @yuantailing in #7858
  • [None][feat] Add Tencent HunYuanDenseV1 model support by @sorenwu in #7081
  • [TRTLLM-7328][feat] E-PD Disagg Support via llmapi (3/N) by @chang-l in #7577
  • [None][opt] Add batch waiting when scheduling by @yunruis in #7416
  • [https://nvbugs/5355128][fix] Add missing wgmma intrinsic for starcoder by @pengbowang-nv in #7643
  • [None][fix] Read eos_token_id from generation_config for kimi_k2 by @pengbowang-nv in #7120
  • [None][fix] Fix and add test for TRTLLM MoE backend by @pengbowang-nv in #7755
  • [None][test] rename llm_perf_full to llm_perf_core and add missing cases by @ruodil in #7899
  • [None][fix] CHERRY-PICK trtllm-serve yaml loading (#7551) by @Superjomn in #7897
  • [https://nvbugs/5367180][fix] Fix xgrammar import before loading tensorrt_llm binary by @syuoni in #7906
  • [None][fix] fix a bug with trtllm-gen kernels + attention sinks by @PerkzZheng in #7919
  • [https://nvbugs/5532023][fix] executor with-statement bug by @Superjomn in #7895
  • [None][fix] Re-add the import for allgather that was mistakenly removed. by @ChristinaZ in #7920
  • [None][chore] Update benchmark script by @zerollzeng in #7860
  • [None][fix] Assign [] to req.py_draft_tokens instead of None when spec decode is off by @zheyuf in #7511
  • [None][test] Waive another intermittent OOM test by @chzblych in #7930
  • [None][feat] Use list instead of torch tensor for new tokens in update requests by @dcampora in #7730
  • [None][feat] Enable gpt oss on DGX H100. by @Tracin in #6775
  • [TRTLLM-7292][feat] Support multi-threaded tokenizers for trtllm-serve (cherry-pick) by @nv-yilinf in #7776
  • [TRTLLM-6549][fix] add kv cache time output back by @zhengd-nv in #7798
  • [None][feat] support JIT mha.cu for SPEC_DEC in runtime by @jhaotingc in #6078
  • [TRTLLM-7728][feat] batched sampling by strategy (supersedes enable_mixed_sampler, cf. TRTLLM-7156) by @ixlmar in #7294
  • [TRTLLM-7182][test] add multi-nodes test for disagg-serving by @reasonsolo in #7470
  • [TRTLLM-7015] [feat] Enable prompt_logprobs in pytorch backend by @venkywonka in #7580
  • [https://nvbugs/5528405][fix] Set up draft_tokens before scheduling by @ziyixiong-nv in #7903
  • [https://nvbugs/5477404][chore] unwaive test_disaggregated_single_gpu.py::test_disaggregated_llama_context_capacity by @reasonsolo in #7857
  • [None][fix] refine backend option handling for commands by @tongyuantongyu in #7829
  • [#7692][fix] recognize RequestError as per-request error in background handler by @tongyuantongyu in #7726
  • [None][chore] Make sampler type beta. by @dcampora in #7934
  • [TRTLLM-6341][feature] Support SWA KV cache by @eopXD in #6768
  • [https://nvbugs/5532225] [fix] MoE use stream-dependent workspace by @VALLIS-NERIA in #7940
  • [None][infra] Skip failed test for nvbugs 5537738 by @pengbowang-nv in #7946
  • [None][chore] remove cubins for ci cases by @qsang-nv in #7902
  • [None][chore] update chunked prefill cases by @xinhe-nv in #7921
  • [None][feat] Return topk logprobs in torch backend by @dcaox in #7756
  • [None][ci] optimize test cases of dgx b200 by @QiJune in #7948
  • [None][chore] Recover cutlass-dsl pkg install and dsl op testing. by @limin2021 in #7945
  • [https://nvbugs/5521799][fix] Trim incorrectly generated harmony messages by @JunyiXu-nv in #7849
  • [https://nvbugs/5532248][fix] Fix fused_moe OOM by @HuiGao-NV in #7931
  • [None][test] Update llm_models_root to improve path handling on BareMetal environment by @yufeiwu-nv in #7876
  • [None][ci] remove duplicate test cases by @QiJune in #7956
  • [None][chore] add test_w4_1gpu[True-True-cutlass-fp8] & TestKimiK2::test_fp8_blocks… by @xinhe-nv in #7944
  • [TRTLLM-5235][feat] Enable regex and EBNF grammar in trtllm-serve by @syuoni in #7925
  • [None][feat] add model seed-oss by @Nekofish-L in #7496
  • [None][ci] Waive some intermittent failures by @HuiGao-NV in #7955
  • [None][fix] trtllm-gen cubins compiled with wrong arch. by @PerkzZheng in #7953
  • [None][chore] cleanup build script by @tongyuantongyu in #7865
  • [#7675][feat] CapturedGraph to support max_batch_size > max(cuda_graph_batch_sizes) by @MrGeva in #7888
  • [None][fix] fix get_iteration_stats IndexError by @macrocell in #7216
  • [None][fix] Fix dummy load format for DeepSeek. by @yuxianq in #7874
  • [TRTLLM-7399][test] Add DS-R1/Qwen3 test cases for RTX 6000 by @pamelap-nvidia in #7662
  • [https://nvbugs/5473781][fix] Fix llama 4 FP8 for PP>1 by @mikeiovine in #7220
  • [None][bug] Fix transformers version for Triton backend by @Tabrizian in #7964
  • [OMNIML-2336][feat] Add NVFP4 x FP8 moe kernels by @sychen52 in #7821
  • [None][fix] Revert "[None][feat] Return topk logprobs in torch backend (#7756)" by @Tabrizian in #7969
  • [None][chore] Validate features combination by @leslie-fang25 in #7630
  • [https://nvbugs/5456485][bug] unwaive triton test by @Tabrizian in #7966
  • [None][feat] DeepEP LL fp8 dispatch/combine by @yilin-void in #7927
  • [None][chore] Update trtllm-bench documentation on setting FP8 KV cache by @achartier in #7885
  • [None][chroe] Update the cuda and tensorrt version in homepage icons. by @nv-guomingz in #7963
  • [TRTLLM-6541][test] Add NIM perf test cases by @fredricz-20070104 in #7924
  • [None][doc] scaffolding tech blog part one by @WeiHaocheng in #7835
  • [TRTLLM-7758][feat] Optimize phi4-mm image modality inference by @Wanli-Jiang in #7918
  • [None][infra] Unwaive some tests since dev already have a PR to collect more info by @EmmaQiaoCh in #7984
  • [None][perf] Fix the tactic sorting in TrtllmGenBatchedGemmRunner::getValidConfigIndices by @jinyangyuan-nvidia in #7419
  • [https://nvbugs/5536141][fix] fix_disagg_single_gpu_test by @chuangz0 in #7990
  • [https://nvbugs/4955671][fix] update test list by @xinhe-nv in #7980
  • [None][chore] Mass integration of release/1.0 - 6th by @dominicshanshan in #7928
  • [None][chore] Remove developer name in comment by @eopXD in #7981
  • [None][chore] relax version constraints on fastapi by @PeganovAnton in #7935
  • [TRTLLM-5966][feat] Helix: add alltoall op by @MatthiasKohl in #6815
  • [None][fix] fix a bug in wideEp use DeepEP with num_chunks > 1 by @xxi-nv in #7954
  • [None][doc] Add acknowledgements in scaffolding tech blog by @WeiHaocheng in #7983
  • [None][infra] Waive failed tests on main 09/25 by @EmmaQiaoCh in #8001
  • [None][chore] extract weights loading related logic to model loader by @QiJune in #7579
  • [https://nvbugs/5525951][fix] Clarify that PP is not supported for GPTOSS by @dongfengy in #7911
  • [None][chore] Some clean-ups for CUDA 13.0 dependencies by @chzblych in #7979
  • [TRTLLM-7999][infra] Add B300/GB300 single gpu test by @yiqingy0 in #7951
  • [None][infra] Improve the failure message for accuracy test suite by @syuoni in #7994
  • [#6102][fix] support non-system python installation by @tongyuantongyu in #7763
  • [None][ci] Waive test_mm_encoder_standalone.py::test_multi_request_batch_chat[llava-v1.6-mistral-7b-hf] by @QiJune in #8010
  • [None][feat] Optimize kv cache transfer TEP by @chuangz0 in #7613
  • [TRTLLM-7330][feat] Eagle3 cuda graph support for the first draft model inference by @sunnygqq in #7363
  • [https://nvbugs/5527956][fix] AutoDeploy: fix IMA due to outdated metadata by @lucaslie in #8002
  • [https://nvbugs/5451740][fix] Add DP padding back on SM120 by @peaceh-nv in #7965
  • [None][chore] Report NCCL error message but not OOM when NCCL error happens by @HuiGao-NV in #8009
  • [None][feature] Add environment variable to adjust block pool allocation ration under kv cache manager by @eopXD in #7923
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #7986
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #8004
  • [None][doc] Add scaffolding tech blog to cover by @WeiHaocheng in #8021
  • [None][chore] Require NVIDIA developers to use their full name or NVIDIA account in GitHub profiles by @MartinMarciniszyn in #8022
  • [https://nvbugs/5495789][feat] Optionally disable server GC and worker GC by @yuantailing in #7995
  • [None][feat] Add a standalone buffer cache class and reuse buffers between cduagraph and no-graph flow by @HuiGao-NV in #7669
  • [TRTLLM-6393][feat] add static tree sampling and verification by @yweng0828 in #7161
  • [None][infra] Waive failed cases in post-merge 2305 by @EmmaQiaoCh in #8019
  • [TRTLLM-8271][fix] Fix CDL overlap scheduling performance by @mikeiovine in #7971
  • [https://nvbugs/5518713][fix] Trtllm-gen moe backend for blockwise fp8 ckpt (Qwen3-235B-A22B-FP8) by @jhaotingc in #7856
  • [#5860][autodeploy] GPT-OSS MXFP4 support by @Fridah-nv in #7451
  • [TRTLLM-6106][feat] Add support for KVCache transfer from KVCache reuse path by @Tabrizian in #6348
  • [None] [feat] Update disagg gen-only benchmark. by @qiaoxj07 in #7917
  • [https://nvbugs/5461712] [fix] Use DG for Qwen3 Linear layers by @achartier in #8030
  • [https://nvbugs/5537738][fix] Add fp8 post-quant allgather support by @ChristinaZ in #8008
  • [None][doc] Refine perf overview.md and correct the error link in per… by @nv-guomingz in #8035
  • [None][infra] Skip failed test for main branch on 9/28 by @EmmaQiaoCh in #8040
  • [None][chore] Disable concurrent weights loading for _load_weights_im… by @nv-guomingz in #8034
  • [None][doc] Scaffolding tech blog fix a typo by @WeiHaocheng in #8042
  • [TRTLLM-4500][feat] Add serialization/deserialization options for AutoTuner profiling cache by @hyukn in #7738
  • [None][chore] Cherry-pick from (#7598) Make low_precision_combine as a llm arg by @zongfeijing in #7898
  • [None][chore] Update chunked prefill test case configs by @crazydemo in #7868
  • [None][chroe] Update cron schedule for closing inactive issues by @zhenhuaw-me in #8048
  • [None] [doc] Document hang issue caused by UnpicklingError by @kaiyux in #8049
  • [#7288][feat] Added AutoDeploy backend support to test_perf.py by @MrGeva in #7588
  • [None][chore] update test case constraint by @crazydemo in #8020
  • [TRTLLM-8348][feat] Speed up concat k and copy k_nope in context phase using torch.compile by @yuantailing in #8044
  • [https://nvbugs/5532087][ci] Enable test case by @HuiGao-NV in #8029
  • [None][ci] Disable tensorRT cases in post-merge by @HuiGao-NV in #8028
  • [None][fix] only support deepep post quant all2all on nvfp4 by @yilin-void in #8041
  • [None][infra] Waive failed cases for main on 0929 by @EmmaQiaoCh in #8053
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #8043
  • [#4674][bugfix] AutoDeploy Fix memory leak in fuse_moe by @galagam in #7844
  • [None][test] Update get_sysinfo.py to avoid UnboundLocalError by @yufeiwu-nv in #7982
  • [https://nvbugs/5541494] [fix] add back missing sm100f bmm kernels by @VALLIS-NERIA in #8051
  • [None][chore] Bump version to 1.2.0rc0 by @yiqingy0 in #7941

New Contributors

  • @Nekofish-L made their first contribution in #7618
  • @asrivas made their first contribution in #6495
  • @sunnygqq made their first contribution in #7001
  • @yufeiwu-nv made their first contribution in #7876
  • @macrocell made their first contribution in #7216
  • @PeganovAnton made their first contribution in #7935

Full Changelog: v1.1.0rc5...v1.2.0rc0

Don't miss a new TensorRT-LLM release

NewReleases is sending notifications on new releases.