Announcement Highlights:
- Model Support
- Feature
- Deepseek R1 FP8 Support on Blackwell (#6486)
- Auto-enable ngram with concurrency <= 32. (#6232)
- Support turning on/off spec decoding dynamically (#6363)
- Improve LoRA cache memory control (#6220)
- Support structural tag in C++ runtime and upgrade xgrammar to 0.1.21 (#6408)
- Update long rope for phi3.5/phi4-mini/phi4-mm (#6353)
- Add support for external multimodal embeddings (#6263)
- Add support for disaggregation with pp with pytorch backend (#6369)
- Add _prepare_and_schedule_batch function in PyExecutor (#6365)
- Add status tags to LLM API reference (#5707)
- Remove cudaStreamSynchronize when using relaxed acceptance (#5262)
- Support JSON Schema in OpenAI-Compatible API (#6321)
- Support chunked prefill on spec decode 2 model (#6104)
- Enhance beam search support with CUDA graph integration (#6217)
- Enable Overlap scheduler + Beam Search in TRTLLM Sampler (#6223)
- Add KV cache reuse support for multimodal models (#5444)
- Multistream initial support for torch compile flow (#5847)
- Support nanobind bindings (#6185)
- Support Weight-Only-Quantization in PyTorch Workflow (#5850)
- Support pytorch LoRA adapter eviction (#5616)
- API
- [BREAKING CHANGE] Change default backend to PyTorch in trtllm-serve (#5717)
- Bug Fixes
- fix: remove duplicate layer multiplication in KV cache size calculation (#6481)
- fix illeagel memory access in MLA (#6437)
- Fix nemotronNAS loading for TP>1 (#6447)
- Switch placement of image placeholder for mistral 3.1 (#6435)
- Fix wide EP when using DeepEP with online EPLB (#6429)
- Move kv_cache_free_gpu_mem_fraction arg to benchmark command in tests (#6463)
- Fix bugs caused by None attention_bias during Qwen3 model convert engine (#6344)
- Add a runtime flag to enable fail fast when attn window is too large to fit at least one sequence in KV cache (#5974)
- Fix PD + MTP + overlap scheduler accuracy issue (#6136)
- Fix bug of Qwen3 when using fp4 on sm120 (#6065)
- Benchmark
- Performance
- Infrastructure
- Documentation
- Known Issues
- If you encounter the OSError: CUDA_HOME environment variable is not set error, set the
CUDA_HOME
environment variable - The aarch64 Docker image and wheel package for 1.0.0rc5 are broken. This will be fixed in the upcoming weekly release
- If you encounter the OSError: CUDA_HOME environment variable is not set error, set the
What's Changed
- DeepEP LL support variable hidden size and tokens num by @yilin-void in #6141
- [Fix][Chore][Qwen3] fix bug of using fp4 on sm120 by @byshiue in #6065
- fix: Ensure mlx5 library is installed for deep_ep and remove deprecated python bindings by @MartinMarciniszyn in #6189
- [TRTLLM-5826][feat] Support pytorch LoRA adapter eviction by @amitz-nv in #5616
- W4A8 GEMM by @danielafrimi in #6005
- enh: Lift expectation of single image per sample in Gemma3 VLM by @brb-nv in #6195
- test: add phi-4 multimodel and bielik-11b-v2.2 models for perf test by @ruodil in #5826
- fix: Flush stale
PlanParams
with custom attention mask by @brb-nv in #6163 - doc: remove cuda_graph_config: {} from doc since cuda_graph enabled b… by @nv-guomingz in #6150
- [fix] Fix can_use_alltoall in fused_moe_wide_ep.py by @jinyangyuan-nvidia in #6173
- [TRTLLM-5863][feat] Support Weight-Only-Quantization in PyTorch Workflow by @Yuening-wa in #5850
- test: [CI] remove closed bugs by @xinhe-nv in #6201
- feat: nanobind bindings by @Linda-Stadter in #6185
- infra: [TRTLLM-5250] Add sanity check stage for ngc-release images (Build wheels for devel image) by @ZhanruiSunCh in #4656
- doc: add Deprecation Policy section by @QiJune in #5784
- [TRTLLM-4279] feat: Multistream initial support for torch compile flow by @liji-nv in #5847
- [Infra] - Waive failed cases on recent post-merge by @EmmaQiaoCh in #6212
- [BREAKING CHANGE]: change default backend to PyTorch in trtllm-serve by @LinPoly in #5717
- test: Enable GB200 torch compile multi gpu tests by @yizhang-nv in #6145
- [fix] Correct the returned value of has_spec_drafter by @ziyixiong-nv in #6178
- [chore] Clean up quickstart_advanced.py by @mikeiovine in #6021
- [Chore] Replace MODEL_CACHE_DIR with LLM_MODELS_ROOT and unwaive triton_server/test_triton.py::test_gpt_ib[gpt-ib] by @SimengLiu-nv in #5859
- [TRTLLM-5059][feat] Add KV cache reuse support for multimodal models by @chang-l in #5444
- feat: Refactor the fetching request logic by @Shunkangz in #5786
- tests: add timeout_manager to tensorrt flow test cases by @crazydemo in #5942
- feat: moe prepare support topk % 4 != 0 by @WeiHaocheng in #5742
- [fix] Fix flaky mistral E2E test by @2ez4bz in #6230
- bug: [https://nvbugs/5368507] Fix test_generate_with_seed. by @bobboli in #6206
- chore: Mass integration of release/0.21 (part 4) by @dc3671 in #6211
- doc: add supported data modality and types on multimodal serve by @yechank-nvidia in #5988
- chore: bump version to 1.0.0rc5 by @yiqingy0 in #6252
- [TRTLLM-6537][infra] extend multi-gpu tests related file list by @reasonsolo in #6139
- test: update test list for RTX6KD by @StanleySun639 in #6213
- fix: bindings unit tests for nanobind by @Linda-Stadter in #6221
- Add register_fake for finegrained_mixed_dtype_gemm torch_op by @danielafrimi in #6255
- [Issue 6193] Fix gemma3vl weight loader by @johncalesp in #6233
- [feat] Enable TP and batching for PixtralVisionModel / Mistral3VLM by @2ez4bz in #6152
- set NVIDIA_IMEX_CHANNELS for dlcluster slurm job only by @yuanjingx87 in #6234
- [nvbug/5361223] doc: Update Llama4 deployment guide: update config & note concurrency by @raayandhar in #6222
- [AutoDeploy] merge feat/ad-2025-07-07 by @lucaslie in #6196
- [nvbugs/5401261][fix] Fix Triton backend disaggregated serving support by @Tabrizian in #6224
- [refactor] Simplification of Speculative decoding configs - Part 2 by @wili-65535 in #5936
- doc: Refactor documents and examples of disaggregated serving and wide ep by @kaiyux in #6054
- Add basic Nemo Ckpt Lora Loading in pytorch flow by @venkywonka in #6019
- [https://nvbugs/5387771] fix deadlocks due to insufficient numSemaphores by @PerkzZheng in #6262
- fix: nvbug_5398806 by @hchings in #6239
- chore: set default device to cpu on Multimodal models by @yechank-nvidia in #5994
- chore: remove duplicate should_stop_processing check by @QiJune in #6242
- hopper-style context MLA by @zhou-yuxin in #5713
- [nvbug/5322354] fix PD + MTP + overlap scheduler accuracy issue by @yweng0828 in #6136
- test: [CI] Add failed cases into waives.txt by @xinhe-nv in #6289
- [TRTLLM-6651][feat] Enable Overlap scheduler + Beam Search in TRTLLM Sampler by @stnie in #6223
- [Infra] - Skip failed cases by @EmmaQiaoCh in #6299
- [AutoDeploy] disable flaky MoE nvfp4 test by @lucaslie in #6302
- [feat] Update .coderabbit.yaml with review settings and code guidelines by @venkywonka in #6251
- Waive tests by @Tabrizian in #6312
- [Infra] - Increase unittest execution time since some test exceeds 1600 by @EmmaQiaoCh in #6277
- Revert "tests: add timeout_manager to tensorrt flow test cases (#5942)" by @Tabrizian in #6309
- doc: fix invalid links related with llm api example by @nv-guomingz in #6317
- chore: remove unused variables in pyexecutor by @QiJune in #6280
- [TRTLLM-6444] Add some UCX trouble shooting docs and print UCX related logs by @reasonsolo in #6085
- feat: Add non UB AR + Residual + Norm + Quant fusion by @liji-nv in #6320
- Update fmhaRunner.cpp to fix guardwords scan error by @zhou-yuxin in #6327
- tests: only get timeout value from pytest marker by @crazydemo in #6287
- [Infra] - Wiave failed tests in post-merge by @EmmaQiaoCh in #6331
- [Fix][nvbug 5401163][nvbug 5404726][Qwen3] Fix bug of MoE on tp > 1 with trtllm moe backend by @byshiue in #6235
- perf: customize cublastLt algo for Llamba 3.3 70B TP4 by @zhenhuaw-me in #6315
- [Fix] the bug in the trtllm-gen heurisitcf for MLA kernels. by @PerkzZheng in #6284
- Improve TransferAgentTest.SyncMessage by @bo-nv in #6250
- [TRTLLM-6650][feat] Enhance beam search support with CUDA graph integration by @stnie in #6217
- [fix] Update to remove popping of KV cache and other args. by @FrankD412 in #6310
- [fix][nvbugs/5399355] Fix Lamport buffer clear issue for MNNVL TwoShot Allreduce and add FP16 support. by @timlee0212 in #6237
- fix: integration tests with nanobind by @Linda-Stadter in #6326
- [TRTLLM-6453][feat] Support chunked prefill on spec decode 2 model by @mikeiovine in #6104
- test: skip llama3.3 70b test on cg4 by @xinhe-nv in #6293
- [TRTLLM-5312] - Add bot run rules for triton tests by @yiqingy0 in #4988
- tests: add test_chunked_prefill for llama4 by @xinhe-nv in #5549
- [https://nvbugs/5340941] - fix: Correct custom ops used by Qwen3 Moe … by @liji-nv in #6285
- test: [CI] Add failed cases into waives.txt by @xinhe-nv in #6333
- [feat]: support logit_bias by @xq25478 in #5354
- fix: Fixing kv_cache_events unit tests [nvbug5362412] by @pcastonguay in #6265
- feat: Support JSON Schema in OpenAI-Compatible API by @noiji in #6321
- [doc] Add NGram tech blog by @SimengLiu-nv in #6311
- Mtp optimizations round1 by @ameynaik-hub in #5689
- [fix][nvbugs/5390810] Improve the check for disaggregated serving test by @Tabrizian in #6301
- [nvbug/5374773] chore: Add a runtime flag to enable fail fast when attn window is too large to fit at least one sequence in KV cache by @moraxu in #5974
- fix precompiled multi_query_token kernel not having is_fp8_out hash key by @jhaotingc in #6279
- [fix] README link directs to intended doc by @lianakoleva in #6340
- [https://nvbugs/5402719][fix]: Add cuda graph dummy requests to the spec_resource_manager by @ziyixiong-nv in #6258
- [nvbug/5320234] fix: test_trtllm_bench_llmapi_launch by @Superjomn in #6359
- fix: remove cudaStreamSynchronize when using relaxed acceptance by @yweng0828 in #5262
- [TRTLLM-6445] feat: Enable AllReduce-associated fusion patterns in Llama3/4. by @hyukn in #6205
- DeepEP LL dispatch FP4 by @yilin-void in #6296
- [nvbugs/5401156][fix] Avoid import all models when import trtllm._common by @chang-l in #6266
- [fix] Fix perf regression caused by MoE autotuner when using DeepEPLowLatency by @jinyangyuan-nvidia in #6288
- Add Acceptance Rate calculation to benchmark_serving by @zerollzeng in #6240
- [Infa] - waive failed cases and fix a typo by @EmmaQiaoCh in #6384
- [nvbug/5409414, 5355707] tests: adjust batchsize and decoding name by @crazydemo in #6292
- [TRTLLM-5061] chore: add status tags to LLM API reference by @Superjomn in #5707
- fix: compatibility with CUDA < 12.9 on
__CUDA_ARCH_SPECIFIC__
macro by @tongyuantongyu in #5917 - chore: add _prepare_and_schedule_batch function in PyExecutor by @QiJune in #6365
- test: waive failed cases by @xinhe-nv in #6394
- test: organize perf cases and add missing perflab cases in qa test list by @ruodil in #6283
- chore: delete useless gitkeep files. by @nv-guomingz in #6400
- [test] Add accuracy regression test for Mistral3.1 by @2ez4bz in #6322
- [test] Unwaive mistral3.1 small E2E test by @2ez4bz in #6352
- [None][infra]Update slurm config keys by @yuanjingx87 in #6370
- [infra] Add an auto-labeling github action to TRTLLM by @poweiw in #6373
- [nvbugs/5404000] fix: waive request_perf_metrics_draft test on pre-Hopper GPUs by @achartier in #6339
- feat: Add Phi-4-Mini-Instruct in Pytorch backend for LLM API accuracy tests by @moraxu in #6303
- [infra] Remove auto_apply_labels option from .coderabbit.yaml reviews section by @venkywonka in #6416
- [fix] Add trust_remote_code option to prepare_dataset. by @FrankD412 in #6338
- infra: [TRTLLM-6499] Split L0_Test into two pipeline by single GPU and multi GPU(For SBSA) by @ZhanruiSunCh in #6132
- doc: Add README for wide EP by @kaiyux in #6356
- [fix] Fixes to parameter usage and low latency configuration. by @FrankD412 in #6343
- test:[nvbug 5415268] add kv_cache_free_gpu_mem_fraction param and llama4 rcca cases by @ruodil in #6430
- chore: remove unused code in PyExecutor by @QiJune in #6351
- [5385981] fix: Update the usage of VisionAttention init API. by @hyukn in #6413
- test: [CI] Add failed cases into waives.txt by @xinhe-nv in #6423
- test: [CI] remove closed bugs by @xinhe-nv in #6381
- doc: remove backend parameter for trtllm-bench when backend is set to… by @nv-guomingz in #6428
- infra: [TRTLLM-5873] Use build stage wheels to speed up docker release image build by @ZhanruiSunCh in #4939
- [fix] Add detokenization-based stop word logic to LLM API by @moraxu in #5948
- chore: remove unused kv_cache_dtype in api reference by @Superjomn in #6444
- chore: disallow arbitrary arguments in llm_args.xxxConfigs by @Superjomn in #6367
- [FIX] fix bugs caused by None attention_bias during Qwen3 model convert engine by @fyf2016 in #6344
- fix: support mixture of text & multimodal prompts by @yechank-nvidia in #6345
- [fix] Move kv_cache_free_gpu_mem_fraction arg to benchmark command in tests by @venkywonka in #6463
- Rename layer to comply with deepseek by @peaceh-nv in #6393
- test: [CI] Add failed cases into waives.txt by @xinhe-nv in #6457
- [TRTLLM-6549] chore: record delay introduced by disaggregated serving in kv cache measure by @zhengd-nv in #6135
- [fix] Fix wide EP when using DeepEP with online EPLB by @jinyangyuan-nvidia in #6429
- [fix] Switch placement of image placeholder for mistral 3.1 by @2ez4bz in #6435
- chore: clean code of PyExecutor by @QiJune in #6445
- chore: remove draft_model_engine from init parameter list of PyExecutor by @QiJune in #6325
- chore: add trtllm-serve json schema example into doc. by @nv-guomingz in #6418
- tests: add TestNemotronH cuda graph tests by @xinhe-nv in #6390
- [nvbugs/5414909] fix: Qwen2-VL keyword on L20 by @yechank-nvidia in #6427
- [TRTLLM-5633] - Merge current waive list with the TOT waive list by @yiqingy0 in #5198
- [doc] update the doc of feature combination matrix by @leslie-fang25 in #6441
- [nvbug 5380101][fix] Fix nemotronNAS loading for TP>1 by @tomeras91 in #6447
- feat: Add support for disaggregation with pp with pytorch backend by @pcastonguay in #6369
- [TRTLLM-6654][feat] Add support for external multimodal embeddings by @chang-l in #6263
- chore: update trtllm-serve usage doc by removing backend parameter when it use torch as backend. by @nv-guomingz in #6419
- fix: Unwaive triton cpp test [nvbug 5401088] by @pcastonguay in #6412
- [Perf]: Add residual, norm for nemotron_nas models by @NVShreyas in #6455
- feat: TRTLLM-6450 update long rope for phi3.5/phi4-mini/phi4-mm by @Wanli-Jiang in #6353
- [nvbug/5410296][fix] Fix OOM in Llama 4 disagg-serve tests by @bo-nv in #6439
- Unwaive Gemma2 LoRA test on H100 by @brb-nv in #6461
- [nvbug/5409417] Unwaive llava test case by @amukkara in #6460
- add propagation of trust_remote_code to OpenAIServer by @shaharmor98 in #6446
- test: Add time logging for lora tests by @brb-nv in #6466
- [PERF] Move calculation Qwen2-VL's rotary_cos_sin to LLM worker process by @vadiklyutiy in #6004
- doc: update multimodal models on support-matrix.md by @yechank-nvidia in #6431
- [doc][ci][Qwen3][nvbugs 5374145] Add Qwen3 235B eagle3 CI by @byshiue in #6477
- feat: Support structural tag in C++ runtime and upgrade xgrammar to 0.1.21 by @syuoni in #6408
- fix: fix illeagel memory access by @dongjiyingdjy in #6437
- test: add accuracy reference by @xinhe-nv in #6479
- Fix e2e test failure for RTX6000 Pro by @farazkh80 in #6420
- doc: add bielik on support-matrix.md by @Wanli-Jiang in #6480
- [infra] Remove auto_assign_reviewers option from .coderabbit.yaml by @venkywonka in #6490
- [TRTLLM-5830][feat] Improve LoRA cache memory control by @amitz-nv in #6220
- [Infra][TRTLLM-5633] - Fix merge waive list by @yiqingy0 in #6504
- [None][infra] Pin the version for triton to 3.3.1 by @EmmaQiaoCh in #6508
- Bugfix/fix nemotron nas lora support by @shaharmor98 in #6380
- [https://nvbugs/5404046][fix] Fix Nemotron-H flaky CUDA graph / overlap scheduler test by @tomeras91 in #6485
- [nvbug/5374773] chore: Update nanobind with fail_fast_on_attention_window_too_large changes by @moraxu in #6491
- [TRTLLM-6392][feat] Support turning on/off spec decoding dynamically by @ziyixiong-nv in #6363
- [feat] Auto-enable ngram with concurrency <= 32. by @SimengLiu-nv in #6232
- fix: Fix poor generation with FP8 Gemma3 1B checkpoint by @brb-nv in #6499
- chore: Improve the AutoTuner log information. by @hyukn in #6368
- [TRTLLM-6611][feat] Add warnings and stricter validation to LoraManager adapter loading by @venkywonka in #6453
- Deepseek R1 FP8 Support on Blackwell by @zongfeijing in #6486
- fix: remove duplicate layer multiplication in KV cache size calculation by @jaedeok-nvidia in #6481
New Contributors
- @Yuening-wa made their first contribution in #5850
- @zhou-yuxin made their first contribution in #5713
- @ameynaik-hub made their first contribution in #5689
- @lianakoleva made their first contribution in #6340
- @fyf2016 made their first contribution in #6344
- @NVShreyas made their first contribution in #6455
- @vadiklyutiy made their first contribution in #6004
Full Changelog: v1.0.0rc4...v1.0.0rc5