TensorRT LLM Release 1.0
TensorRT LLM 1.0 brings 2 major changes: the PyTorch-based architecture is now stable and the default experience, and the LLM API is now stable. For more details on new developments in 1.0, please see below.
Key Features and Enhancements
-
Model Support
- Add Mistral3.1 VLM model support
- Add TensorRT-Engine Qwen3 (dense) model support
- Add phi-4-multimodal model support
- Add EXAONE 4.0 model support
- Add Qwen3 MoE support to TensorRT backend
-
Features
- Add support for sm121
- Add LoRA support for Gemma3
- Support PyTorch LoRA adapter eviction
- Add LoRA support for PyTorch backend in trtllm-serve
- Add support of scheduling attention dp request
- Remove padding of FusedMoE in attention DP
- Support torch compile for attention dp
- Add KV events support for sliding window attention
- Add TRTLLM MoE nvfp4 cubins for mid-high concurrency; attention_dp for TRTLLM MoE
- Add Piecewise CUDA Graph support for MLA
- Support mutliCtasKvMode for high-throughput MLA kernels
- Enable kvcache to be reused during request generation
- Add ADP schedule balance optimization
- Add chunked prefill support for MLA (Blackwell)
- Enable Multi-block mode for Hopper spec dec XQA kernel
- Add vLLM KV Pool support for XQA kernel
- Allow sending more than 2GiB through MPI by using mpi4py.util.pkl5
- Add support for fused gate_up_proj scales for FP8 blockwise
- Support FP8 row-wise dense GEMM in torch flow
- Enable fp8 SwiGLU to minimize host overhead
- Add Deepseek R1 FP8 Support on Blackwell
- Add support for MXFP8xMXFP4 in pytorch
- Support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell)
- Opensource MOE MXFP8-MXFP4 implementation
- Add support for Modelopt fp8_pb_wo quantization scheme
- Support deepEP fp4 post quant all2all dispatch
- Fuse w4a8 moe pre-quant scale on Hopper
- Support Weight-Only-Quantization in PyTorch Workflow
- Add support for per expert activation scaling factors
- Add ReDrafter support for Qwen
- Enable CUDA Graph for Nemotron-H
- Add support for YARN in NemotronNAS models
- Switch to internal version of MMProjector in Gemma3
- Disable add special tokens for Llama3.3 70B
- Auto-enable ngram with concurrency <= 32
- Support turning on/off spec decoding dynamically
- Support structural tag in C++ runtime and upgrade xgrammar to 0.1.21
- Add support for external multimodal embeddings
- Add support for disaggregation with pp with pytorch backend
- Add status tags to LLM API reference
- Support JSON Schema in OpenAI-Compatible API
- Support chunked prefill on spec decode 2 model
- Add KV cache reuse support for multimodal models
- Support nanobind bindings
- Add support for two-model engine KV cache reuse
- Add Eagle-3 support for qwen3 dense model
- Migrate Eagle-3 and draft/target speculation to Drafter
- Enable guided decoding with overlap scheduler
- Support n-gram speculative decoding with disagg
- Add beam search support to the PyTorch Workflow
- Add LLGuidance Support for PyTorch Backend
- Add NGrams V2 support
- Add MTP support for Online EPLB
- Support disaggregated serving in TRTLLM Sampler
- Add core infrastructure to enable loading of custom checkpoint formats
- Support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow run deep-ep on memory-constrained GPUs
- Use huge page mapping for host accessible memory on GB200
- Add user-provided speculative decoding support
- Add streaming scaffolding_llm.generate_async support
- Detokenize option in /v1/completions request
- Integrate TRT-LLM Gen FP4 block scale MoE with Pytorch workflow kernel autotuner
- Remove support for llmapi + TRT backend in Triton
- Add request_perf_metrics to triton LLMAPI backend
- Add support for Triton request cancellation
-
Benchmark:
- Add support for benchmarking individual gemms in MOE benchmark (#6080)
- Add speculative metrics for trtllm-bench
- Add the ability to write a request timeline for trtllm-bench
- Add no_kv_cache_reuse option and streaming support for trtllm-serve bench
- Add latency support for trtllm-bench
- Add Acceptance Rate calculation to benchmark_serving
- Add wide-ep benchmarking scripts
- Update trtllm-bench to support new Pytorch default
- Add support for TRTLLM CustomDataset
- Make benchmark_serving part of the library
-
Documentation:
- Refactored the doc structure to focus on the PyTorch workflow.
- Improved the LLMAPI and API reference documentation. Stable APIs are now protected and will remain consistent in subsequent versions following v1.0.
- Removed legacy documentation related to the TensorRT workflow.
Infrastructure Changes
- The base Docker image for TensorRT-LLM is updated to
nvcr.io/nvidia/pytorch:25.06-py3
. - The base Docker image for TensorRT-LLM Backend is updated to
nvcr.io/nvidia/tritonserver:25.06-py3
. - The dependent NVIDIA ModelOpt version is updated to 0.33.
- The dependent xgrammar version is updated to 0.1.21.
- The dependent transformers version is updated to 4.53.1.
API Changes
- BREAKING CHANGE Promote PyTorch to be the default LLM backend
- BREAKING CHANGE Change default backend to PyTorch in trtllm-serve
- BREAKING CHANGE Unify KvCacheConfig in LLM class for pytorch backend
- BREAKING CHANGE Rename cuda_graph_config padding_enabled field
- BREAKING CHANGE Rename mixed_sampler to enable_mixed_sampler
- BREAKING CHANGE Rename LLM.autotuner_enabled to enable_autotuner
- Add back allreduce_strategy parameter into TorchLlmArgs
- Add LLmArgs option to force using dynamic quantization
- Change default LoRA cache sizes and change peft_cache_config cache size fields to take effect when not explicitly set in lora_config
- Remove deprecated LoRA LLM args, that are already specified in lora_config
- Add request_perf_metrics to LLMAPI
- Remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead
- Remove TrtGptModelOptionalParams
- Remove ptuning knobs from TorchLlmArgs
Fixed Issues
- Fix illegal memory access in MLA (#6437)
- Fix nemotronNAS loading for TP>1 (#6447)
- Fix wide EP when using DeepEP with online EPLB (#6429)
- Fix bugs caused by None attention_bias during Qwen3 model convert engine (#6344)
- Fix PD + MTP + overlap scheduler accuracy issue (#6136)
- Fix bug of Qwen3 when using fp4 on sm120 (#6065)
- Fix TMA error with GEMM+AR on TP=2 (#6075)
- Fix scaffolding aime test in test_e2e (#6140)
- Fix KV Cache overrides in trtllm-bench (#6103)
- Fix MOE benchmark to rotate buffers to prevent L2 cache reuse (#4135)
- Fix eagle3 two model disaggregated serving test (#6014)
- Fix chunked prefill + overlap scheduling (#5761)
- Fix mgmn postprocess error (#5835)
- Fallback to cubins for fp8 fmha kernels on Ada (#5779)
- Fix disagg + speculative decoding (#5558)
- Fix test_generate_with_seed CI failure. (#5772)
- Fix prompt adapter TP2 case (#5782)
- Fix disaggregate serving with attention DP (#4993)
- Fix a quote error introduced in #5534 (#5816)
- Fix the accuracy issue when reduce_fusion is enabled for GEMMA model. (#5801)
- Fix lost requests for disaggregated serving (#5815)
- Update unit tests: skip all_close assert for dropout in attention, increase tolerance for rope op test (#5855)
- Fix GEMM+AR fusion on blackwell (#5563)
- Fix llama4 multimodal support (#5809)
- Fix Llama4 Scout FP4 crash issue (#5925)
- Fix max batch size and max tokens in kv cache estimations for Nemotron-H (#5371)
- Fix moe regression for sm120 (#5823)
- Fix Qwen2.5VL FP8 support (#5029)
- Fix the illegal memory access issue in moe gemm on SM120 (#5636)
- Fix tileN cannot % 16==0 & support sm89 deepgemm bmm (#5531)
- Fix incremental detokenization (#5825)
- Fix MoE workspace info by storing Torch tensor itself instead of data_ptr (#5900)
- Fix mistral unit tests due to transformers upgrade (#5904)
- Fix the Llama3.1 405B hanging issue. (#5698) (#5925)
- Fix Gemma3 unit tests due to transformers upgrade (#5921)
- Fix alltoall for llama4 (apply_router_weight_on_input=True) (#5902)
- Remove SpecConfig and fix thread leak issues (#5931)
- Fast redux detection in trtllm gen routing kernel (#5941)
- Fix cancel request logic (#5800)
- Fix errors in wide-ep scripts (#5992)
- Fix error in post-merge-tests (#5949)
- Fix missing arg to alltoall_prepare_maybe_dispatch (#5669)
- Fix attention DP doesn't work with embedding TP (#5642)
- Fix broken cyclic reference detect (#5417)
- Fix permission for local user issues in NGC docker container. (#5373)
- Fix mtp vanilla draft inputs (#5568)
- Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) (#5519)
- Fix block scale fp8 support for deepseek v3 on Blackwell. (#5514)
- Fix the issue MoE autotune fallback failed to query default heuristic (#5520)
- Fix the unexpected keyword argument 'streaming' (#5436)
Known Issues
- When using disaggregated serving with pipeline parallelism and KV cache reuse, a hang can occur. This will be fixed in a future release. In the meantime, disabling KV cache reuse will fix this issue.
- Running multi-node cases where each node has just a single GPU is known to fail. This will be addressed in a future release.
- For the Llama 3.x and Llama 4 models, there is an issue with pipeline parallelism when using FP8 and NVFP4 weights. As a workaround, you can set the environment variable
export TRTLLM_LLAMA_EAGER_FUSION_DISABLED=1
.
What's Changed
- Qwen3: Fix eagle hidden states by @IzzyPutterman in #6199
- [None][fix] Upgrade dependencies version to avoid security vulnerability by @yibinl-nvidia in #6506
- [None][chore] update readme for perf release test by @ruodil in #6664
- [None][test] remove trt backend cases in release perf test and move NIM cases to llm_perf_nim.yml by @ruodil in #6662
- [None][fix] Explicitly add tiktoken as required by kimi k2 by @pengbowang-nv in #6663
- [None][doc]: remove the outdated features which marked as Experimental by @nv-guomingz in #5995
- [https://nvbugs/5375966][chore] Unwaive test_disaggregated_deepseek_v3_lite_fp8_attention_dp_one by @yweng0828 in #6658
- [TRTLLM-6892][infra] Run guardwords scan first in Release Check stage by @yiqingy0 in #6659
- [None][chore] optimize kv cache transfer for context TEP and gen DEP by @chuangz0 in #6657
- [None][chore] Bump version to 1.0.0 by @yiqingy0 in #6652
- [None][test] Add Mistral Small 3.1 24B accuracy test to QA test list by @StanleySun639 in #6682
- [None][test] cherry-pick: correct test-db context for perf yaml file and add mistral cases by @ruodil in #6688
- [None][package] Pin cuda-python version to >=12,<13 by @yiqingy0 in #6703
- [TRTLLM-6656][chore] Validate FP8 support for Gemma3 by @brb-nv in #6678
- [TRTLLM-5574][test] Add NIM required VLM models multi-gpu test by @crazydemo in #6687
- [TRTLLM-6675][infra] Nixl test completion by @bo-nv in #6623
- [None][test] fix yml condition error under qa folder by @ruodil in #6733
- [None][doc] Add doc for multimodal feature support matrix (#6619) by @nv-guomingz in #6739
- [https://nvbugs/5344910][fix] Corrected memory position when setting buffers to 0 in standalone_stable_radix_topk_ by @stnie in #6712
- [https://nvbugs/5442608][fix] Update CUDA graph config for get_model_yaml_config. by @yuxianq in #6693
- [TRTLLM-4721][test] Add qa test for llm-api by @Superjomn in #6727
- [https://nvbugs/5409420][fix] Fix test_ptp_star_attention_example by @Superjomn in #6584
- [https://nvbugs/5444624][fix] Fix LLM_ROOT in triton_backend build.sh by @yiqingy0 in #6744
- [https://nvbugs/5429689][fix] Fix mllama model structure update with transformers issue by @dominicshanshan in #6699
- [None][chore] remove out-of-date comment in star attention test by @Superjomn in #6773
- [https://nvbugs/5431127][fix] Run test_disaggregated_deepseek_v3_lite_fp8_nixl[DeepSeek-V3-Lite-fp8] only on hopper by @bo-nv in #6737
- [None][infra] Waive failed tests on release branch 0811 by @EmmaQiaoCh in #6782
- [https://nvbugs/5444095][infra] waive test_ptp_quickstart_multimodal llava test by @yechank-nvidia in #6795
- [TRTLLM-5252][fix] Propagate mapping to intermediate layers (#6611) by @2ez4bz in #6765
- [TRTLLM-6683][feat] Support LoRA reload CPU cache evicted adapter by @amitz-nv in #6786
- [None][feat] adding support for disaggregated multi-instance tests by @raayandhar in #6674
- [None][infra] Avoid intermittent access broken to nvcr.io by @chzblych in #6715
- [https://nvbugs/5383702][fix] error propagation in GenerationExecutor by @Superjomn in #6793
- [https://nvbugs/5445774][fix] Unwaive Gemma3 27B fp8 test by @brb-nv in #6799
- [None][fix] fix CUDA graph config for test_llm_api_pytorch.py. by @yuxianq in #6826
- [TRTLLM-6975][test] Add multi-turn test cases for VLM models by @crazydemo in #6749
- [None][chore] waive GB300 known issues by @xinhe-nv in #6812
- [None][fix] fix Llama3 eagle3 test case OOM by @crazydemo in #6832
- [https://nvbugs/5375594][fix] fix oom issue on structural_tag test case by @nv-guomingz in #6838
- [https://nvbugs/5401114][fix] Unwaive Gemma3 tests by @brb-nv in #6870
- [TRTLLM-5252][feat] Add fp8 support for Mistral Small 3.1 by @2ez4bz in #6731
- [None][infra] Setup the code review rule on the release branch by @yiqingy0 in #6725
- [TRTLLM-6308][feat] Support Aggregate mode for phi4-mm by @Wanli-Jiang in #6820
- [None][fix] Fix batching bug in Mistral3 model by @2ez4bz in #6841
- [None][fix] Revert phi4-mm aggregate mode by @amukkara in #6907
- [None][fix] Complete the last missing allreduce op in Llama3/4. by @hyukn in #6850
- [None][chore] Add docs for Gemma3 VLMs by @brb-nv in #6880
- [None][doc] add legacy section for tensorrt engine by @Superjomn in #6724
- [TRTLLM-7048][feat] add benchmark TRT flow test for MIG by @xinhe-nv in #6884
- [https://nvbugs/5451434][fix] Fix triton docker build by @Tabrizian in #6898
- [TRTLLM-6481][fix] Fix deepseek r1 accuracy issue by @pengbowang-nv in #6868
- [None][ci] unwaive test_ptp_star_attention_example by @Superjomn in #6943
- [https://nvbugs/5455836][fix] Fix llama 4 FP4 by @mikeiovine in #6911
- [None][infra] update CODEOWNERS for release by @venkywonka in #6905
- [https://nvbugs/5453667] [fix] reverting a breaking change: make trtllm-bench
enable_chunked_context
defaults backend-dependent by @venkywonka in #6956 - [https://nvbugs/5405041][fix] Update wide ep doc by @qiaoxj07 in #6950
- [https://nvbugs/5412562][feat] Allocate MoE workspace only when necessary (release/1.0 retargeted) by @nv-yilinf in #6955
- [TRTLLM-6835][fix] Fix potential hang caused by python multiprocessing when prefetching weights by @lancelly in #6927
- [https://nvbugs/5448525][fix] Mistral Small 3.1 accuracy tests by @2ez4bz in #6909
- [https://nvbugs/5375646][fix] update waives.txt for nvbug 5375646 by @nv-guomingz in #6847
- [None][fix] update skip config by @crazydemo in #6891
- [https://nvbugs/5449218][fix] Fix KvCacheConfig error in test_perf by @peaceh-nv in #6937
- [None][infra] Waive failed tests for release branch 0818 by @EmmaQiaoCh in #6993
- [None][chore] Remove duplicate test waives by @yiqingy0 in #6999
- [None][infra] Cherry-pick #6836 from main branch and improve SSH connection by @chzblych in #6971
- [https://nvbugs/5462007][ci] Unwaive Mistral Small 3.1 FP8 test by @2ez4bz in #7008
- [https://nvbugs/5449155][fix] Fix DeepSeek R1 weight loading for TP16 by @achartier in #6913
- [https://nvbugs/5374016][fix] improve error message by @QiJune in #6893
- [https://nvbugs/5394685][fix] proper fix for the accuracy issue in 2CTA MLA kernels (release 1.0) by @PerkzZheng in #6946
- [https://nvbugs/5451028][fix] Constrain NemotronSuper test parameters… by @Naveassaf in #6987
- [https://nvbugs/5448579][fix] EXAONE-4.0 accuracy test bugfix by @yechank-nvidia in #6888
- [None][chore] Waive E2E GB200 tests for Gemma3 27B by @brb-nv in #6916
- [https://nvbugs/5451296][bug] Fix a thread leak in test_llm_args.py by @Tabrizian in #7017
- [None][infra] Waive failed tests for release branch 08/19 by @EmmaQiaoCh in #7036
- [None][doc] add status labels to LLM class's api reference by @Superjomn in #6899
- [https://nvbugs/5448437][fix] fix some nixl tests by @bo-nv in #6940
- [https://nvbugs/5427801][fix] Torch compile support for Llama4 and Ea… by @liji-nv in #6978
- [https://nvbugs/5394392][fix] Enlarge scheduler capacity under disagg bs == 1 by @yifeizhang-c in #6975
- [TRTLLM-7263][fix] Prevent recreation of cublas handles in lora_grouped_gemm every call by @amitz-nv in #7053
- [None][fix] Fix build of tritonbuild/tritonrelease image by @dbari in #7003
- [None][doc] update v1.0 doc for trtllm-serve by @hchings in #7056
- [https://nvbugs/5440241][fix] Fix 70B GSM8K Accuracy drop by @chenfeiz0326 in #7075
- [https://nvbugs/5451296][fix] zmq nonblock bug with retry by @Superjomn in #7019
- [https://nvbugs/5383702][fix] test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::test_fp8_4gpus by @Superjomn in #6889
- [https://nvbugs/5392414] [fix] For release 1.0 cherry pick. Add customized default routing method by @ChristinaZ in #7068
- [https://nvbugs/5464088] [fix] Guard against fp8 activations in lora forward; update perf test config by @venkywonka in #7014
- [None][infra] Skip failed tests for release branch 08/21 by @EmmaQiaoCh in #7130
- [https://nvbugs/5448442][fix] Skip trtllm moe backend for sm120 by @pamelap-nvidia in #7010
- [https://nvbugs/5449032][fix] Add more llm-args to llm_mgmn_trtllm_bench.sh by @brb-nv in #7144
- [https://nvbugs/5410391][bug] Support to share device buffers in attention meta by @HuiGao-NV in #6557
- [https://nvbugs/5467062][fix] pass logitsPostProcessorBatched by reference by @milesial in #7110
- [https://nvbugs/5450074][fix] Reduce the device memory requirements for testing by @Shixiaowei02 in #6990
- [https://nvbugs/5474037][fix] Fix building tritonbuild/tritonrelease images by @dbari in #7157
- [https://nvbugs/5433545][fix] TestPhi4MiniInstruct::test_auto_dtype - Use max_seq_len=4096 to fallback to the short RoPE factor by @moraxu in #6895
- [https://nvbugs/5461712] [fix] Disable deep_gemm for Qwen3 due to accuracy issues by @DomBrown in #7170
- [TRTLLM-6825][fix] Update lora for phi4-mm by @Wanli-Jiang in #7149
- [https://nvbugs/5448426][fix] Fix illegal memory access in cuda graph by @peaceh-nv in #7127
- [None][fix] Switch llm api quickstart example location per workflow. by @nv-guomingz in #7182
- [https://nvbugs/5467232][fix] Fix load_torch_hf_lora to override lora_config.trtllm_modules_to_hf_modules with default only when it has no value by @Wanli-Jiang in #7168
- [None][doc] fix tensorrt legacy quickstart page by @Superjomn in #7190
- [TRTLLM-7030][fix] BREAKING CHANGE: Mismatch between docs and actual commands by @Shixiaowei02 in #7191
- [https://nvbugs/5470840][fix] Disaggregated unit test MPI Init handling by @pcastonguay in #7139
- [None][test] add kv cache size in bench metric and fix failed cases by @ruodil in #7211
- [None][fix] update skip case by @crazydemo in #7193
- [https://nvbugs/5409416][fix] test_openai_multi_chat_example by @Linda-Stadter in #7174
- [https://nvbugs/5473789][bug] install cuda-toolkit to fix sanity check by @HuiGao-NV in #7159
- [None][fix] fix log_once usage by @yuxianq in #7210
- [None][infra] Waive failed cases for release/1.0 08/26 by @EmmaQiaoCh in #7258
- [https://nvbugs/5451342][fix] Use runtime max_batch_size when cuda_graph_config.max_batch_size is not provided in trtllm-bench by @jiaganc in #7031
- [None][feat] Skip prefetching consolidated safetensors when appropriate by @2ez4bz in #7225
- [https://nvbugs/5430125][ci] Unwaive test case for mistral 3.1 small by @2ez4bz in #7265
- [https://nvbugs/5478151][fix] Add missing spec for Llama-3.3 70B by @brb-nv in #7267
- [https://nvbugs/5451426][fix] Avoid torch compile on full eagle3 worker by @liji-nv in #7245
- [https://nvbugs/5448767][fix] fix mpi4py deadlocks in pp event-loop by @reasonsolo in #6976
- [https://nvbugs/5463720][fix] tp-split the inferred
mlp_hidden_size
for nemotron-nas by @venkywonka in #7231 - [https://nvbugs/5480550][fix] Increase timeout for Gemma3 27B test by @brb-nv in #7271
- [https://nvbugs/5434320][bug] Fix disagg pp bug by @Tabrizian in #7099
- [https://nvbugs/5480415][fix] Fix phi4mm multi-gpu test by @Wanli-Jiang in #7275
- [TRTLLM-7346][fix] Improve performance of PyTorchModelEngine._get_lora_params_from_requests by @amitz-nv in #7203
- [https://nvbugs/5467548][fix] DeepSeek illegal memory access. by @bobboli in #7298
- [https://nvbugs/5448767][fix] disable kv cache reuse for disagg pp>1 tests by @reasonsolo in #7354
- [https://nvbugs/5445466][fix] Eliminate race when loading HF dynamic modules (#7268) by @chang-l in #7379
- [https://nvbugs/5474169][fix]Adjust max seq len for kvcache for memory estimation by @HuiGao-NV in #7391
- [https://nvbugs/5448754][fix] Download HF model for all nodes. by @yuxianq in #6824
- [None][infra] Waive failed tests on release branch 0901 by @EmmaQiaoCh in #7448
- [None][doc] add blackwell information into support matrix by @nv-guomingz in #6740
- [TRTLLM-7008][fix] cherrypick fix to 1.0 Add automatic shared memory delete if already exist by @dongxuy04 in #7433
- [https://nvbugs/5351244][fix] test_mpi_session by @Superjomn in #7501
- [https://nvbugs/5461761][fix] Remove the waiver by @ziyixiong-nv in #7427
- [TRTLLM-5930][doc] 1.0 Documentation. by @nv-guomingz in #6696
- [https://nvbugs/5496960][fix] Fix Gemma model forward. by @hyukn in #7509
- [None][doc] Update kvcache part by @nv-guomingz in #7549
- [None][doc] Rename TensorRT-LLM to TensorRT LLM. by @nv-guomingz in #7554
- [https://nvbugs/5416501][doc] add known issues to llmapi doc by @Superjomn in #7560
- [None][doc] Fix a invalid link. by @nv-guomingz in #7617
- [https://nvbugs/5474169][fix] seq_len mismatch between kv cache manager and graph attn metadata by @HuiGao-NV in #7606
- [https://nvbugs/5503423][waive] Waive Llama3.1-70B-FP8 test on RTX PRO 6000 by @peaceh-nv in #7603
- [https://nvbugs/5455140][fix] unwaive release/1.0 DS R1 test cases with bug already fixed by @lancelly in #7432
- [https://nvbugs/5470782][chore] Remove the skip statement in 1.0 rele… by @SimengLiu-nv in #7573
- [None][doc] Fix a invalid link and a typo. by @nv-guomingz in #7634
- [None][doc] Use hash id for external link by @nv-guomingz in #7641
- [https://nvbugs/5477730][fix] Fix the alltoall case when tp_size larg… by @WeiHaocheng in #7671
- [https://nvbugs/5436461][fix] Adjust free_gpu_memory_fraction of test_eagle3 by @leslie-fang25 in #7673
- [https://nvbugs/5474409][fix] Disable concurrent loading by default by @nv-guomingz in #7663
- [https://nvbugs/5501557][fix] Fix out-of-bounds vector access for model with multiple layer types by @brb-nv in #7636
- [https://nvbugs/5398180][feat] Improve Llama4 performance for small max_seqlen cases by @nv-yilinf in #7681
- [None][ci] Test waives for the release/1.0 branch 09/15 by @chzblych in #7700
- [None][doc] Add labels description note into llm api section by @nv-guomingz in #7696
- [https://nvbugs/5437405][fix] cherry-pick PR 7000 (qwen3 235b eagle3 ci) by @byshiue in #7702
- [https://nvbugs/5512734][fix] Update kv cache config for maverick by @crazydemo in #7710
- [https://nvbugs/5355219][fix] Fix trtllm moe backend test config and Qwen3 MoE multi node by @yizhang-nv in #7724
- [None][doc] Fix the link in the doc by @Shixiaowei02 in #7754
- [https://nvbugs/5519525][fix] fix doc invalid link for bug 5519525 by @nv-guomingz in #7753
- [https://nvbugs/5509024][fix] Print full parsed outputs and update keywords for multimodal model by @Wanli-Jiang in #7670
- [None][doc] Enhance api reference doc by labeling stable APIs by @Superjomn in #7751
- [https://nvbugs/5468897][fix] fix invalid expression for disabling pa… by @nv-guomingz in #7762
- [https://nvbugs/5517023][fix] Pass allreduce strategy and force NCCL on pre-Blackwell arch by @hyukn in #7768
- [TRTLLM-7958][doc] add 1.0 release notes by @nv-guomingz in #7605
- [https://nvbugs/5522332][fix] Pin numpy version for Gemma. by @yuxianq in #7783
- [None][doc] Update docker cmd in quick start guide and trtllm-serve … by @nv-guomingz in #7787
- [https://nvbugs/1234567][fix] Revert https://github.com/NVIDIA/TensorRT-LLM/pull/7768/files by @litaotju in #7813
- [https://nvbugs/5516710][fix] fix Llama 3.3 TP PP case by @Superjomn in #7717
- [None][doc] Replace the main in the examples' link with commit id. by @nv-guomingz in #7837
- [None][doc] Rename TensorRT-LLM to TensorRT LLM for homepage and the … by @nv-guomingz in #7850
- [None][doc] add a guide for modifying APIs by @Superjomn in #7866
- [None][doc] Update Perf-Overview.md for release/1.0 by @zbpatel in #7848
- [None][doc] add stable label to all the un-labelled arguments in LLM class by @Superjomn in #7863
- [None][fix] api stability bug in status label by @Superjomn in #7861
- [https://nvbugs/5427043][fix] cherrypick: request length exceeds max_num_tokens by @Superjomn in #7718
- [https://nvbugs/5531963][fix] cherry pick #7725 by @QiJune in #7907
- [None][chroe] Rename TensorRT-LLM to TensorRT LLM for source code. by @nv-guomingz in #7851
- [None][doc] fix invalid links in perf benchmarking. by @nv-guomingz in #7933
Full Changelog: v1.0.0rc6...v1.0.0