github NVIDIA/TensorRT-LLM v1.0.0rc5

latest releases: v1.1.0rc2.post1, v1.1.0rc3, v1.1.0rc2...
pre-releaseone month ago

Announcement Highlights:

  • Model Support
  • Feature
    • Deepseek R1 FP8 Support on Blackwell (#6486)
    • Auto-enable ngram with concurrency <= 32. (#6232)
    • Support turning on/off spec decoding dynamically (#6363)
    • Improve LoRA cache memory control (#6220)
    • Support structural tag in C++ runtime and upgrade xgrammar to 0.1.21 (#6408)
    • Update long rope for phi3.5/phi4-mini/phi4-mm (#6353)
    • Add support for external multimodal embeddings (#6263)
    • Add support for disaggregation with pp with pytorch backend (#6369)
    • Add _prepare_and_schedule_batch function in PyExecutor (#6365)
    • Add status tags to LLM API reference (#5707)
    • Remove cudaStreamSynchronize when using relaxed acceptance (#5262)
    • Support JSON Schema in OpenAI-Compatible API (#6321)
    • Support chunked prefill on spec decode 2 model (#6104)
    • Enhance beam search support with CUDA graph integration (#6217)
    • Enable Overlap scheduler + Beam Search in TRTLLM Sampler (#6223)
    • Add KV cache reuse support for multimodal models (#5444)
    • Multistream initial support for torch compile flow (#5847)
    • Support nanobind bindings (#6185)
    • Support Weight-Only-Quantization in PyTorch Workflow (#5850)
    • Support pytorch LoRA adapter eviction (#5616)
  • API
    • [BREAKING CHANGE] Change default backend to PyTorch in trtllm-serve (#5717)
  • Bug Fixes
    • fix: remove duplicate layer multiplication in KV cache size calculation (#6481)
    • fix illeagel memory access in MLA (#6437)
    • Fix nemotronNAS loading for TP>1 (#6447)
    • Switch placement of image placeholder for mistral 3.1 (#6435)
    • Fix wide EP when using DeepEP with online EPLB (#6429)
    • Move kv_cache_free_gpu_mem_fraction arg to benchmark command in tests (#6463)
    • Fix bugs caused by None attention_bias during Qwen3 model convert engine (#6344)
    • Add a runtime flag to enable fail fast when attn window is too large to fit at least one sequence in KV cache (#5974)
    • Fix PD + MTP + overlap scheduler accuracy issue (#6136)
    • Fix bug of Qwen3 when using fp4 on sm120 (#6065)
  • Benchmark
    • Fixes to parameter usage and low latency configuration. (#6343)
    • Add Acceptance Rate calculation to benchmark_serving (#6240)
  • Performance
    • Enable AllReduce-associated fusion patterns in Llama3/4. (#6205)
    • Optimize Mtp performance (#5689)
    • Customize cublastLt algo for Llamba 3.3 70B TP4 (#6315)
    • Add non UB AR + Residual + Norm + Quant fusion (#6320)
  • Infrastructure
    • Remove auto_assign_reviewers option from .coderabbit.yaml (#6490)
    • Use build stage wheels to speed up docker release image build (#4939)
  • Documentation
    • Add README for wide EP (#6356)
    • Update Llama4 deployment guide: update config & note concurrency (#6222)
    • Add Deprecation Policy section (#5784)
  • Known Issues
    • If you encounter the OSError: CUDA_HOME environment variable is not set error, set the CUDA_HOME environment variable
    • The aarch64 Docker image and wheel package for 1.0.0rc5 are broken. This will be fixed in the upcoming weekly release

What's Changed

  • DeepEP LL support variable hidden size and tokens num by @yilin-void in #6141
  • [Fix][Chore][Qwen3] fix bug of using fp4 on sm120 by @byshiue in #6065
  • fix: Ensure mlx5 library is installed for deep_ep and remove deprecated python bindings by @MartinMarciniszyn in #6189
  • [TRTLLM-5826][feat] Support pytorch LoRA adapter eviction by @amitz-nv in #5616
  • W4A8 GEMM by @danielafrimi in #6005
  • enh: Lift expectation of single image per sample in Gemma3 VLM by @brb-nv in #6195
  • test: add phi-4 multimodel and bielik-11b-v2.2 models for perf test by @ruodil in #5826
  • fix: Flush stale PlanParams with custom attention mask by @brb-nv in #6163
  • doc: remove cuda_graph_config: {} from doc since cuda_graph enabled b… by @nv-guomingz in #6150
  • [fix] Fix can_use_alltoall in fused_moe_wide_ep.py by @jinyangyuan-nvidia in #6173
  • [TRTLLM-5863][feat] Support Weight-Only-Quantization in PyTorch Workflow by @Yuening-wa in #5850
  • test: [CI] remove closed bugs by @xinhe-nv in #6201
  • feat: nanobind bindings by @Linda-Stadter in #6185
  • infra: [TRTLLM-5250] Add sanity check stage for ngc-release images (Build wheels for devel image) by @ZhanruiSunCh in #4656
  • doc: add Deprecation Policy section by @QiJune in #5784
  • [TRTLLM-4279] feat: Multistream initial support for torch compile flow by @liji-nv in #5847
  • [Infra] - Waive failed cases on recent post-merge by @EmmaQiaoCh in #6212
  • [BREAKING CHANGE]: change default backend to PyTorch in trtllm-serve by @LinPoly in #5717
  • test: Enable GB200 torch compile multi gpu tests by @yizhang-nv in #6145
  • [fix] Correct the returned value of has_spec_drafter by @ziyixiong-nv in #6178
  • [chore] Clean up quickstart_advanced.py by @mikeiovine in #6021
  • [Chore] Replace MODEL_CACHE_DIR with LLM_MODELS_ROOT and unwaive triton_server/test_triton.py::test_gpt_ib[gpt-ib] by @SimengLiu-nv in #5859
  • [TRTLLM-5059][feat] Add KV cache reuse support for multimodal models by @chang-l in #5444
  • feat: Refactor the fetching request logic by @Shunkangz in #5786
  • tests: add timeout_manager to tensorrt flow test cases by @crazydemo in #5942
  • feat: moe prepare support topk % 4 != 0 by @WeiHaocheng in #5742
  • [fix] Fix flaky mistral E2E test by @2ez4bz in #6230
  • bug: [https://nvbugs/5368507] Fix test_generate_with_seed. by @bobboli in #6206
  • chore: Mass integration of release/0.21 (part 4) by @dc3671 in #6211
  • doc: add supported data modality and types on multimodal serve by @yechank-nvidia in #5988
  • chore: bump version to 1.0.0rc5 by @yiqingy0 in #6252
  • [TRTLLM-6537][infra] extend multi-gpu tests related file list by @reasonsolo in #6139
  • test: update test list for RTX6KD by @StanleySun639 in #6213
  • fix: bindings unit tests for nanobind by @Linda-Stadter in #6221
  • Add register_fake for finegrained_mixed_dtype_gemm torch_op by @danielafrimi in #6255
  • [Issue 6193] Fix gemma3vl weight loader by @johncalesp in #6233
  • [feat] Enable TP and batching for PixtralVisionModel / Mistral3VLM by @2ez4bz in #6152
  • set NVIDIA_IMEX_CHANNELS for dlcluster slurm job only by @yuanjingx87 in #6234
  • [nvbug/5361223] doc: Update Llama4 deployment guide: update config & note concurrency by @raayandhar in #6222
  • [AutoDeploy] merge feat/ad-2025-07-07 by @lucaslie in #6196
  • [nvbugs/5401261][fix] Fix Triton backend disaggregated serving support by @Tabrizian in #6224
  • [refactor] Simplification of Speculative decoding configs - Part 2 by @wili-65535 in #5936
  • doc: Refactor documents and examples of disaggregated serving and wide ep by @kaiyux in #6054
  • Add basic Nemo Ckpt Lora Loading in pytorch flow by @venkywonka in #6019
  • [https://nvbugs/5387771] fix deadlocks due to insufficient numSemaphores by @PerkzZheng in #6262
  • fix: nvbug_5398806 by @hchings in #6239
  • chore: set default device to cpu on Multimodal models by @yechank-nvidia in #5994
  • chore: remove duplicate should_stop_processing check by @QiJune in #6242
  • hopper-style context MLA by @zhou-yuxin in #5713
  • [nvbug/5322354] fix PD + MTP + overlap scheduler accuracy issue by @yweng0828 in #6136
  • test: [CI] Add failed cases into waives.txt by @xinhe-nv in #6289
  • [TRTLLM-6651][feat] Enable Overlap scheduler + Beam Search in TRTLLM Sampler by @stnie in #6223
  • [Infra] - Skip failed cases by @EmmaQiaoCh in #6299
  • [AutoDeploy] disable flaky MoE nvfp4 test by @lucaslie in #6302
  • [feat] Update .coderabbit.yaml with review settings and code guidelines by @venkywonka in #6251
  • Waive tests by @Tabrizian in #6312
  • [Infra] - Increase unittest execution time since some test exceeds 1600 by @EmmaQiaoCh in #6277
  • Revert "tests: add timeout_manager to tensorrt flow test cases (#5942)" by @Tabrizian in #6309
  • doc: fix invalid links related with llm api example by @nv-guomingz in #6317
  • chore: remove unused variables in pyexecutor by @QiJune in #6280
  • [TRTLLM-6444] Add some UCX trouble shooting docs and print UCX related logs by @reasonsolo in #6085
  • feat: Add non UB AR + Residual + Norm + Quant fusion by @liji-nv in #6320
  • Update fmhaRunner.cpp to fix guardwords scan error by @zhou-yuxin in #6327
  • tests: only get timeout value from pytest marker by @crazydemo in #6287
  • [Infra] - Wiave failed tests in post-merge by @EmmaQiaoCh in #6331
  • [Fix][nvbug 5401163][nvbug 5404726][Qwen3] Fix bug of MoE on tp > 1 with trtllm moe backend by @byshiue in #6235
  • perf: customize cublastLt algo for Llamba 3.3 70B TP4 by @zhenhuaw-me in #6315
  • [Fix] the bug in the trtllm-gen heurisitcf for MLA kernels. by @PerkzZheng in #6284
  • Improve TransferAgentTest.SyncMessage by @bo-nv in #6250
  • [TRTLLM-6650][feat] Enhance beam search support with CUDA graph integration by @stnie in #6217
  • [fix] Update to remove popping of KV cache and other args. by @FrankD412 in #6310
  • [fix][nvbugs/5399355] Fix Lamport buffer clear issue for MNNVL TwoShot Allreduce and add FP16 support. by @timlee0212 in #6237
  • fix: integration tests with nanobind by @Linda-Stadter in #6326
  • [TRTLLM-6453][feat] Support chunked prefill on spec decode 2 model by @mikeiovine in #6104
  • test: skip llama3.3 70b test on cg4 by @xinhe-nv in #6293
  • [TRTLLM-5312] - Add bot run rules for triton tests by @yiqingy0 in #4988
  • tests: add test_chunked_prefill for llama4 by @xinhe-nv in #5549
  • [https://nvbugs/5340941] - fix: Correct custom ops used by Qwen3 Moe … by @liji-nv in #6285
  • test: [CI] Add failed cases into waives.txt by @xinhe-nv in #6333
  • [feat]: support logit_bias by @xq25478 in #5354
  • fix: Fixing kv_cache_events unit tests [nvbug5362412] by @pcastonguay in #6265
  • feat: Support JSON Schema in OpenAI-Compatible API by @noiji in #6321
  • [doc] Add NGram tech blog by @SimengLiu-nv in #6311
  • Mtp optimizations round1 by @ameynaik-hub in #5689
  • [fix][nvbugs/5390810] Improve the check for disaggregated serving test by @Tabrizian in #6301
  • [nvbug/5374773] chore: Add a runtime flag to enable fail fast when attn window is too large to fit at least one sequence in KV cache by @moraxu in #5974
  • fix precompiled multi_query_token kernel not having is_fp8_out hash key by @jhaotingc in #6279
  • [fix] README link directs to intended doc by @lianakoleva in #6340
  • [https://nvbugs/5402719][fix]: Add cuda graph dummy requests to the spec_resource_manager by @ziyixiong-nv in #6258
  • [nvbug/5320234] fix: test_trtllm_bench_llmapi_launch by @Superjomn in #6359
  • fix: remove cudaStreamSynchronize when using relaxed acceptance by @yweng0828 in #5262
  • [TRTLLM-6445] feat: Enable AllReduce-associated fusion patterns in Llama3/4. by @hyukn in #6205
  • DeepEP LL dispatch FP4 by @yilin-void in #6296
  • [nvbugs/5401156][fix] Avoid import all models when import trtllm._common by @chang-l in #6266
  • [fix] Fix perf regression caused by MoE autotuner when using DeepEPLowLatency by @jinyangyuan-nvidia in #6288
  • Add Acceptance Rate calculation to benchmark_serving by @zerollzeng in #6240
  • [Infa] - waive failed cases and fix a typo by @EmmaQiaoCh in #6384
  • [nvbug/5409414, 5355707] tests: adjust batchsize and decoding name by @crazydemo in #6292
  • [TRTLLM-5061] chore: add status tags to LLM API reference by @Superjomn in #5707
  • fix: compatibility with CUDA < 12.9 on __CUDA_ARCH_SPECIFIC__ macro by @tongyuantongyu in #5917
  • chore: add _prepare_and_schedule_batch function in PyExecutor by @QiJune in #6365
  • test: waive failed cases by @xinhe-nv in #6394
  • test: organize perf cases and add missing perflab cases in qa test list by @ruodil in #6283
  • chore: delete useless gitkeep files. by @nv-guomingz in #6400
  • [test] Add accuracy regression test for Mistral3.1 by @2ez4bz in #6322
  • [test] Unwaive mistral3.1 small E2E test by @2ez4bz in #6352
  • [None][infra]Update slurm config keys by @yuanjingx87 in #6370
  • [infra] Add an auto-labeling github action to TRTLLM by @poweiw in #6373
  • [nvbugs/5404000] fix: waive request_perf_metrics_draft test on pre-Hopper GPUs by @achartier in #6339
  • feat: Add Phi-4-Mini-Instruct in Pytorch backend for LLM API accuracy tests by @moraxu in #6303
  • [infra] Remove auto_apply_labels option from .coderabbit.yaml reviews section by @venkywonka in #6416
  • [fix] Add trust_remote_code option to prepare_dataset. by @FrankD412 in #6338
  • infra: [TRTLLM-6499] Split L0_Test into two pipeline by single GPU and multi GPU(For SBSA) by @ZhanruiSunCh in #6132
  • doc: Add README for wide EP by @kaiyux in #6356
  • [fix] Fixes to parameter usage and low latency configuration. by @FrankD412 in #6343
  • test:[nvbug 5415268] add kv_cache_free_gpu_mem_fraction param and llama4 rcca cases by @ruodil in #6430
  • chore: remove unused code in PyExecutor by @QiJune in #6351
  • [5385981] fix: Update the usage of VisionAttention init API. by @hyukn in #6413
  • test: [CI] Add failed cases into waives.txt by @xinhe-nv in #6423
  • test: [CI] remove closed bugs by @xinhe-nv in #6381
  • doc: remove backend parameter for trtllm-bench when backend is set to… by @nv-guomingz in #6428
  • infra: [TRTLLM-5873] Use build stage wheels to speed up docker release image build by @ZhanruiSunCh in #4939
  • [fix] Add detokenization-based stop word logic to LLM API by @moraxu in #5948
  • chore: remove unused kv_cache_dtype in api reference by @Superjomn in #6444
  • chore: disallow arbitrary arguments in llm_args.xxxConfigs by @Superjomn in #6367
  • [FIX] fix bugs caused by None attention_bias during Qwen3 model convert engine by @fyf2016 in #6344
  • fix: support mixture of text & multimodal prompts by @yechank-nvidia in #6345
  • [fix] Move kv_cache_free_gpu_mem_fraction arg to benchmark command in tests by @venkywonka in #6463
  • Rename layer to comply with deepseek by @peaceh-nv in #6393
  • test: [CI] Add failed cases into waives.txt by @xinhe-nv in #6457
  • [TRTLLM-6549] chore: record delay introduced by disaggregated serving in kv cache measure by @zhengd-nv in #6135
  • [fix] Fix wide EP when using DeepEP with online EPLB by @jinyangyuan-nvidia in #6429
  • [fix] Switch placement of image placeholder for mistral 3.1 by @2ez4bz in #6435
  • chore: clean code of PyExecutor by @QiJune in #6445
  • chore: remove draft_model_engine from init parameter list of PyExecutor by @QiJune in #6325
  • chore: add trtllm-serve json schema example into doc. by @nv-guomingz in #6418
  • tests: add TestNemotronH cuda graph tests by @xinhe-nv in #6390
  • [nvbugs/5414909] fix: Qwen2-VL keyword on L20 by @yechank-nvidia in #6427
  • [TRTLLM-5633] - Merge current waive list with the TOT waive list by @yiqingy0 in #5198
  • [doc] update the doc of feature combination matrix by @leslie-fang25 in #6441
  • [nvbug 5380101][fix] Fix nemotronNAS loading for TP>1 by @tomeras91 in #6447
  • feat: Add support for disaggregation with pp with pytorch backend by @pcastonguay in #6369
  • [TRTLLM-6654][feat] Add support for external multimodal embeddings by @chang-l in #6263
  • chore: update trtllm-serve usage doc by removing backend parameter when it use torch as backend. by @nv-guomingz in #6419
  • fix: Unwaive triton cpp test [nvbug 5401088] by @pcastonguay in #6412
  • [Perf]: Add residual, norm for nemotron_nas models by @NVShreyas in #6455
  • feat: TRTLLM-6450 update long rope for phi3.5/phi4-mini/phi4-mm by @Wanli-Jiang in #6353
  • [nvbug/5410296][fix] Fix OOM in Llama 4 disagg-serve tests by @bo-nv in #6439
  • Unwaive Gemma2 LoRA test on H100 by @brb-nv in #6461
  • [nvbug/5409417] Unwaive llava test case by @amukkara in #6460
  • add propagation of trust_remote_code to OpenAIServer by @shaharmor98 in #6446
  • test: Add time logging for lora tests by @brb-nv in #6466
  • [PERF] Move calculation Qwen2-VL's rotary_cos_sin to LLM worker process by @vadiklyutiy in #6004
  • doc: update multimodal models on support-matrix.md by @yechank-nvidia in #6431
  • [doc][ci][Qwen3][nvbugs 5374145] Add Qwen3 235B eagle3 CI by @byshiue in #6477
  • feat: Support structural tag in C++ runtime and upgrade xgrammar to 0.1.21 by @syuoni in #6408
  • fix: fix illeagel memory access by @dongjiyingdjy in #6437
  • test: add accuracy reference by @xinhe-nv in #6479
  • Fix e2e test failure for RTX6000 Pro by @farazkh80 in #6420
  • doc: add bielik on support-matrix.md by @Wanli-Jiang in #6480
  • [infra] Remove auto_assign_reviewers option from .coderabbit.yaml by @venkywonka in #6490
  • [TRTLLM-5830][feat] Improve LoRA cache memory control by @amitz-nv in #6220
  • [Infra][TRTLLM-5633] - Fix merge waive list by @yiqingy0 in #6504
  • [None][infra] Pin the version for triton to 3.3.1 by @EmmaQiaoCh in #6508
  • Bugfix/fix nemotron nas lora support by @shaharmor98 in #6380
  • [https://nvbugs/5404046][fix] Fix Nemotron-H flaky CUDA graph / overlap scheduler test by @tomeras91 in #6485
  • [nvbug/5374773] chore: Update nanobind with fail_fast_on_attention_window_too_large changes by @moraxu in #6491
  • [TRTLLM-6392][feat] Support turning on/off spec decoding dynamically by @ziyixiong-nv in #6363
  • [feat] Auto-enable ngram with concurrency <= 32. by @SimengLiu-nv in #6232
  • fix: Fix poor generation with FP8 Gemma3 1B checkpoint by @brb-nv in #6499
  • chore: Improve the AutoTuner log information. by @hyukn in #6368
  • [TRTLLM-6611][feat] Add warnings and stricter validation to LoraManager adapter loading by @venkywonka in #6453
  • Deepseek R1 FP8 Support on Blackwell by @zongfeijing in #6486
  • fix: remove duplicate layer multiplication in KV cache size calculation by @jaedeok-nvidia in #6481

New Contributors

  • @Yuening-wa made their first contribution in #5850
  • @zhou-yuxin made their first contribution in #5713
  • @ameynaik-hub made their first contribution in #5689
  • @lianakoleva made their first contribution in #6340
  • @fyf2016 made their first contribution in #6344
  • @NVShreyas made their first contribution in #6455
  • @vadiklyutiy made their first contribution in #6004

Full Changelog: v1.0.0rc4...v1.0.0rc5

Don't miss a new TensorRT-LLM release

NewReleases is sending notifications on new releases.