NVIDIA/TensorRT-LLM v1.0.0rc5 on GitHub

Announcement Highlights:

Model Support
Feature
- Deepseek R1 FP8 Support on Blackwell (#6486)
- Auto-enable ngram with concurrency <= 32. (#6232)
- Support turning on/off spec decoding dynamically (#6363)
- Improve LoRA cache memory control (#6220)
- Support structural tag in C++ runtime and upgrade xgrammar to 0.1.21 (#6408)
- Update long rope for phi3.5/phi4-mini/phi4-mm (#6353)
- Add support for external multimodal embeddings (#6263)
- Add support for disaggregation with pp with pytorch backend (#6369)
- Add _prepare_and_schedule_batch function in PyExecutor (#6365)
- Add status tags to LLM API reference (#5707)
- Remove cudaStreamSynchronize when using relaxed acceptance (#5262)
- Support JSON Schema in OpenAI-Compatible API (#6321)
- Support chunked prefill on spec decode 2 model (#6104)
- Enhance beam search support with CUDA graph integration (#6217)
- Enable Overlap scheduler + Beam Search in TRTLLM Sampler (#6223)
- Add KV cache reuse support for multimodal models (#5444)
- Multistream initial support for torch compile flow (#5847)
- Support nanobind bindings (#6185)
- Support Weight-Only-Quantization in PyTorch Workflow (#5850)
- Support pytorch LoRA adapter eviction (#5616)
API
- [BREAKING CHANGE] Change default backend to PyTorch in trtllm-serve (#5717)
Bug Fixes
- fix: remove duplicate layer multiplication in KV cache size calculation (#6481)
- fix illeagel memory access in MLA (#6437)
- Fix nemotronNAS loading for TP>1 (#6447)
- Switch placement of image placeholder for mistral 3.1 (#6435)
- Fix wide EP when using DeepEP with online EPLB (#6429)
- Move kv_cache_free_gpu_mem_fraction arg to benchmark command in tests (#6463)
- Fix bugs caused by None attention_bias during Qwen3 model convert engine (#6344)
- Add a runtime flag to enable fail fast when attn window is too large to fit at least one sequence in KV cache (#5974)
- Fix PD + MTP + overlap scheduler accuracy issue (#6136)
- Fix bug of Qwen3 when using fp4 on sm120 (#6065)
Benchmark
- Fixes to parameter usage and low latency configuration. (#6343)
- Add Acceptance Rate calculation to benchmark_serving (#6240)
Performance
- Enable AllReduce-associated fusion patterns in Llama3/4. (#6205)
- Optimize Mtp performance (#5689)
- Customize cublastLt algo for Llamba 3.3 70B TP4 (#6315)
- Add non UB AR + Residual + Norm + Quant fusion (#6320)
Infrastructure
- Remove auto_assign_reviewers option from .coderabbit.yaml (#6490)
- Use build stage wheels to speed up docker release image build (#4939)
Documentation
- Add README for wide EP (#6356)
- Update Llama4 deployment guide: update config & note concurrency (#6222)
- Add Deprecation Policy section (#5784)
Known Issues
- If you encounter the OSError: CUDA_HOME environment variable is not set error, set the CUDA_HOME environment variable
- The aarch64 Docker image and wheel package for 1.0.0rc5 are broken. This will be fixed in the upcoming weekly release

What's Changed

DeepEP LL support variable hidden size and tokens num by @yilin-void in #6141
[Fix][Chore][Qwen3] fix bug of using fp4 on sm120 by @byshiue in #6065
fix: Ensure mlx5 library is installed for deep_ep and remove deprecated python bindings by @MartinMarciniszyn in #6189
[TRTLLM-5826][feat] Support pytorch LoRA adapter eviction by @amitz-nv in #5616
W4A8 GEMM by @danielafrimi in #6005
enh: Lift expectation of single image per sample in Gemma3 VLM by @brb-nv in #6195
test: add phi-4 multimodel and bielik-11b-v2.2 models for perf test by @ruodil in #5826
fix: Flush stale PlanParams with custom attention mask by @brb-nv in #6163
doc: remove cuda_graph_config: {} from doc since cuda_graph enabled b… by @nv-guomingz in #6150
[fix] Fix can_use_alltoall in fused_moe_wide_ep.py by @jinyangyuan-nvidia in #6173
[TRTLLM-5863][feat] Support Weight-Only-Quantization in PyTorch Workflow by @Yuening-wa in #5850
test: [CI] remove closed bugs by @xinhe-nv in #6201
feat: nanobind bindings by @Linda-Stadter in #6185
infra: [TRTLLM-5250] Add sanity check stage for ngc-release images (Build wheels for devel image) by @ZhanruiSunCh in #4656
doc: add Deprecation Policy section by @QiJune in #5784
[TRTLLM-4279] feat: Multistream initial support for torch compile flow by @liji-nv in #5847
[Infra] - Waive failed cases on recent post-merge by @EmmaQiaoCh in #6212
[BREAKING CHANGE]: change default backend to PyTorch in trtllm-serve by @LinPoly in #5717
test: Enable GB200 torch compile multi gpu tests by @yizhang-nv in #6145
[fix] Correct the returned value of has_spec_drafter by @ziyixiong-nv in #6178
[chore] Clean up quickstart_advanced.py by @mikeiovine in #6021
[Chore] Replace MODEL_CACHE_DIR with LLM_MODELS_ROOT and unwaive triton_server/test_triton.py::test_gpt_ib[gpt-ib] by @SimengLiu-nv in #5859
[TRTLLM-5059][feat] Add KV cache reuse support for multimodal models by @chang-l in #5444
feat: Refactor the fetching request logic by @Shunkangz in #5786
tests: add timeout_manager to tensorrt flow test cases by @crazydemo in #5942
feat: moe prepare support topk % 4 != 0 by @WeiHaocheng in #5742
[fix] Fix flaky mistral E2E test by @2ez4bz in #6230
bug: [https://nvbugs/5368507] Fix test_generate_with_seed. by @bobboli in #6206
chore: Mass integration of release/0.21 (part 4) by @dc3671 in #6211
doc: add supported data modality and types on multimodal serve by @yechank-nvidia in #5988
chore: bump version to 1.0.0rc5 by @yiqingy0 in #6252
[TRTLLM-6537][infra] extend multi-gpu tests related file list by @reasonsolo in #6139
test: update test list for RTX6KD by @StanleySun639 in #6213
fix: bindings unit tests for nanobind by @Linda-Stadter in #6221
Add register_fake for finegrained_mixed_dtype_gemm torch_op by @danielafrimi in #6255
[Issue 6193] Fix gemma3vl weight loader by @johncalesp in #6233
[feat] Enable TP and batching for PixtralVisionModel / Mistral3VLM by @2ez4bz in #6152
set NVIDIA_IMEX_CHANNELS for dlcluster slurm job only by @yuanjingx87 in #6234
[nvbug/5361223] doc: Update Llama4 deployment guide: update config & note concurrency by @raayandhar in #6222
[AutoDeploy] merge feat/ad-2025-07-07 by @lucaslie in #6196
[nvbugs/5401261][fix] Fix Triton backend disaggregated serving support by @Tabrizian in #6224
[refactor] Simplification of Speculative decoding configs - Part 2 by @wili-65535 in #5936
doc: Refactor documents and examples of disaggregated serving and wide ep by @kaiyux in #6054
Add basic Nemo Ckpt Lora Loading in pytorch flow by @venkywonka in #6019
[https://nvbugs/5387771] fix deadlocks due to insufficient numSemaphores by @PerkzZheng in #6262
fix: nvbug_5398806 by @hchings in #6239
chore: set default device to cpu on Multimodal models by @yechank-nvidia in #5994
chore: remove duplicate should_stop_processing check by @QiJune in #6242
hopper-style context MLA by @zhou-yuxin in #5713
[nvbug/5322354] fix PD + MTP + overlap scheduler accuracy issue by @yweng0828 in #6136
test: [CI] Add failed cases into waives.txt by @xinhe-nv in #6289
[TRTLLM-6651][feat] Enable Overlap scheduler + Beam Search in TRTLLM Sampler by @stnie in #6223
[Infra] - Skip failed cases by @EmmaQiaoCh in #6299
[AutoDeploy] disable flaky MoE nvfp4 test by @lucaslie in #6302
[feat] Update .coderabbit.yaml with review settings and code guidelines by @venkywonka in #6251
Waive tests by @Tabrizian in #6312
[Infra] - Increase unittest execution time since some test exceeds 1600 by @EmmaQiaoCh in #6277
Revert "tests: add timeout_manager to tensorrt flow test cases (#5942)" by @Tabrizian in #6309
doc: fix invalid links related with llm api example by @nv-guomingz in #6317
chore: remove unused variables in pyexecutor by @QiJune in #6280
[TRTLLM-6444] Add some UCX trouble shooting docs and print UCX related logs by @reasonsolo in #6085
feat: Add non UB AR + Residual + Norm + Quant fusion by @liji-nv in #6320
Update fmhaRunner.cpp to fix guardwords scan error by @zhou-yuxin in #6327
tests: only get timeout value from pytest marker by @crazydemo in #6287
[Infra] - Wiave failed tests in post-merge by @EmmaQiaoCh in #6331
[Fix][nvbug 5401163][nvbug 5404726][Qwen3] Fix bug of MoE on tp > 1 with trtllm moe backend by @byshiue in #6235
perf: customize cublastLt algo for Llamba 3.3 70B TP4 by @zhenhuaw-me in #6315
[Fix] the bug in the trtllm-gen heurisitcf for MLA kernels. by @PerkzZheng in #6284
Improve TransferAgentTest.SyncMessage by @bo-nv in #6250
[TRTLLM-6650][feat] Enhance beam search support with CUDA graph integration by @stnie in #6217
[fix] Update to remove popping of KV cache and other args. by @FrankD412 in #6310
[fix][nvbugs/5399355] Fix Lamport buffer clear issue for MNNVL TwoShot Allreduce and add FP16 support. by @timlee0212 in #6237
fix: integration tests with nanobind by @Linda-Stadter in #6326
[TRTLLM-6453][feat] Support chunked prefill on spec decode 2 model by @mikeiovine in #6104
test: skip llama3.3 70b test on cg4 by @xinhe-nv in #6293
[TRTLLM-5312] - Add bot run rules for triton tests by @yiqingy0 in #4988
tests: add test_chunked_prefill for llama4 by @xinhe-nv in #5549
[https://nvbugs/5340941] - fix: Correct custom ops used by Qwen3 Moe … by @liji-nv in #6285
test: [CI] Add failed cases into waives.txt by @xinhe-nv in #6333
[feat]: support logit_bias by @xq25478 in #5354
fix: Fixing kv_cache_events unit tests [nvbug5362412] by @pcastonguay in #6265
feat: Support JSON Schema in OpenAI-Compatible API by @noiji in #6321
[doc] Add NGram tech blog by @SimengLiu-nv in #6311
Mtp optimizations round1 by @ameynaik-hub in #5689
[fix][nvbugs/5390810] Improve the check for disaggregated serving test by @Tabrizian in #6301
[nvbug/5374773] chore: Add a runtime flag to enable fail fast when attn window is too large to fit at least one sequence in KV cache by @moraxu in #5974
fix precompiled multi_query_token kernel not having is_fp8_out hash key by @jhaotingc in #6279
[fix] README link directs to intended doc by @lianakoleva in #6340
[https://nvbugs/5402719][fix]: Add cuda graph dummy requests to the spec_resource_manager by @ziyixiong-nv in #6258
[nvbug/5320234] fix: test_trtllm_bench_llmapi_launch by @Superjomn in #6359
fix: remove cudaStreamSynchronize when using relaxed acceptance by @yweng0828 in #5262
[TRTLLM-6445] feat: Enable AllReduce-associated fusion patterns in Llama3/4. by @hyukn in #6205
DeepEP LL dispatch FP4 by @yilin-void in #6296
[nvbugs/5401156][fix] Avoid import all models when import trtllm._common by @chang-l in #6266
[fix] Fix perf regression caused by MoE autotuner when using DeepEPLowLatency by @jinyangyuan-nvidia in #6288
Add Acceptance Rate calculation to benchmark_serving by @zerollzeng in #6240
[Infa] - waive failed cases and fix a typo by @EmmaQiaoCh in #6384
[nvbug/5409414, 5355707] tests: adjust batchsize and decoding name by @crazydemo in #6292
[TRTLLM-5061] chore: add status tags to LLM API reference by @Superjomn in #5707
fix: compatibility with CUDA < 12.9 on __CUDA_ARCH_SPECIFIC__ macro by @tongyuantongyu in #5917
chore: add _prepare_and_schedule_batch function in PyExecutor by @QiJune in #6365
test: waive failed cases by @xinhe-nv in #6394
test: organize perf cases and add missing perflab cases in qa test list by @ruodil in #6283
chore: delete useless gitkeep files. by @nv-guomingz in #6400
[test] Add accuracy regression test for Mistral3.1 by @2ez4bz in #6322
[test] Unwaive mistral3.1 small E2E test by @2ez4bz in #6352
[None][infra]Update slurm config keys by @yuanjingx87 in #6370
[infra] Add an auto-labeling github action to TRTLLM by @poweiw in #6373
[nvbugs/5404000] fix: waive request_perf_metrics_draft test on pre-Hopper GPUs by @achartier in #6339
feat: Add Phi-4-Mini-Instruct in Pytorch backend for LLM API accuracy tests by @moraxu in #6303
[infra] Remove auto_apply_labels option from .coderabbit.yaml reviews section by @venkywonka in #6416
[fix] Add trust_remote_code option to prepare_dataset. by @FrankD412 in #6338
infra: [TRTLLM-6499] Split L0_Test into two pipeline by single GPU and multi GPU(For SBSA) by @ZhanruiSunCh in #6132
doc: Add README for wide EP by @kaiyux in #6356
[fix] Fixes to parameter usage and low latency configuration. by @FrankD412 in #6343
test:[nvbug 5415268] add kv_cache_free_gpu_mem_fraction param and llama4 rcca cases by @ruodil in #6430
chore: remove unused code in PyExecutor by @QiJune in #6351
[5385981] fix: Update the usage of VisionAttention init API. by @hyukn in #6413
test: [CI] Add failed cases into waives.txt by @xinhe-nv in #6423
test: [CI] remove closed bugs by @xinhe-nv in #6381
doc: remove backend parameter for trtllm-bench when backend is set to… by @nv-guomingz in #6428
infra: [TRTLLM-5873] Use build stage wheels to speed up docker release image build by @ZhanruiSunCh in #4939
[fix] Add detokenization-based stop word logic to LLM API by @moraxu in #5948
chore: remove unused kv_cache_dtype in api reference by @Superjomn in #6444
chore: disallow arbitrary arguments in llm_args.xxxConfigs by @Superjomn in #6367
[FIX] fix bugs caused by None attention_bias during Qwen3 model convert engine by @fyf2016 in #6344
fix: support mixture of text & multimodal prompts by @yechank-nvidia in #6345
[fix] Move kv_cache_free_gpu_mem_fraction arg to benchmark command in tests by @venkywonka in #6463
Rename layer to comply with deepseek by @peaceh-nv in #6393
test: [CI] Add failed cases into waives.txt by @xinhe-nv in #6457
[TRTLLM-6549] chore: record delay introduced by disaggregated serving in kv cache measure by @zhengd-nv in #6135
[fix] Fix wide EP when using DeepEP with online EPLB by @jinyangyuan-nvidia in #6429
[fix] Switch placement of image placeholder for mistral 3.1 by @2ez4bz in #6435
chore: clean code of PyExecutor by @QiJune in #6445
chore: remove draft_model_engine from init parameter list of PyExecutor by @QiJune in #6325
chore: add trtllm-serve json schema example into doc. by @nv-guomingz in #6418
tests: add TestNemotronH cuda graph tests by @xinhe-nv in #6390
[nvbugs/5414909] fix: Qwen2-VL keyword on L20 by @yechank-nvidia in #6427
[TRTLLM-5633] - Merge current waive list with the TOT waive list by @yiqingy0 in #5198
[doc] update the doc of feature combination matrix by @leslie-fang25 in #6441
[nvbug 5380101][fix] Fix nemotronNAS loading for TP>1 by @tomeras91 in #6447
feat: Add support for disaggregation with pp with pytorch backend by @pcastonguay in #6369
[TRTLLM-6654][feat] Add support for external multimodal embeddings by @chang-l in #6263
chore: update trtllm-serve usage doc by removing backend parameter when it use torch as backend. by @nv-guomingz in #6419
fix: Unwaive triton cpp test [nvbug 5401088] by @pcastonguay in #6412
[Perf]: Add residual, norm for nemotron_nas models by @NVShreyas in #6455
feat: TRTLLM-6450 update long rope for phi3.5/phi4-mini/phi4-mm by @Wanli-Jiang in #6353
[nvbug/5410296][fix] Fix OOM in Llama 4 disagg-serve tests by @bo-nv in #6439
Unwaive Gemma2 LoRA test on H100 by @brb-nv in #6461
[nvbug/5409417] Unwaive llava test case by @amukkara in #6460
add propagation of trust_remote_code to OpenAIServer by @shaharmor98 in #6446
test: Add time logging for lora tests by @brb-nv in #6466
[PERF] Move calculation Qwen2-VL's rotary_cos_sin to LLM worker process by @vadiklyutiy in #6004
doc: update multimodal models on support-matrix.md by @yechank-nvidia in #6431
[doc][ci][Qwen3][nvbugs 5374145] Add Qwen3 235B eagle3 CI by @byshiue in #6477
feat: Support structural tag in C++ runtime and upgrade xgrammar to 0.1.21 by @syuoni in #6408
fix: fix illeagel memory access by @dongjiyingdjy in #6437
test: add accuracy reference by @xinhe-nv in #6479
Fix e2e test failure for RTX6000 Pro by @farazkh80 in #6420
doc: add bielik on support-matrix.md by @Wanli-Jiang in #6480
[infra] Remove auto_assign_reviewers option from .coderabbit.yaml by @venkywonka in #6490
[TRTLLM-5830][feat] Improve LoRA cache memory control by @amitz-nv in #6220
[Infra][TRTLLM-5633] - Fix merge waive list by @yiqingy0 in #6504
[None][infra] Pin the version for triton to 3.3.1 by @EmmaQiaoCh in #6508
Bugfix/fix nemotron nas lora support by @shaharmor98 in #6380
[https://nvbugs/5404046][fix] Fix Nemotron-H flaky CUDA graph / overlap scheduler test by @tomeras91 in #6485
[nvbug/5374773] chore: Update nanobind with fail_fast_on_attention_window_too_large changes by @moraxu in #6491
[TRTLLM-6392][feat] Support turning on/off spec decoding dynamically by @ziyixiong-nv in #6363
[feat] Auto-enable ngram with concurrency <= 32. by @SimengLiu-nv in #6232
fix: Fix poor generation with FP8 Gemma3 1B checkpoint by @brb-nv in #6499
chore: Improve the AutoTuner log information. by @hyukn in #6368
[TRTLLM-6611][feat] Add warnings and stricter validation to LoraManager adapter loading by @venkywonka in #6453
Deepseek R1 FP8 Support on Blackwell by @zongfeijing in #6486
fix: remove duplicate layer multiplication in KV cache size calculation by @jaedeok-nvidia in #6481

New Contributors

@Yuening-wa made their first contribution in #5850
@zhou-yuxin made their first contribution in #5713
@ameynaik-hub made their first contribution in #5689
@lianakoleva made their first contribution in #6340
@fyf2016 made their first contribution in #6344
@NVShreyas made their first contribution in #6455
@vadiklyutiy made their first contribution in #6004

Full Changelog: v1.0.0rc4...v1.0.0rc5