NVIDIA/TensorRT-LLM v1.2.0rc4 on GitHub

Announcement Highlights

Model Support
- Optimize DeepSeek FP8 activation kernel for TRT-LLM Gen MoE (#9175)
- Disable fp8 deep GEMM for EXAONE-4.0-32B-FP8 (#8429)
- Fix output unpack issues for Llama3/4 NVFP4 models (#8679)
API
- Support out-of-tree models in trtllm-serve (#9269)
Feature
- Make sharing of activation_type across SW layers more robust (#9238)
- Create communication related classes (#8968)
- Integrate CuteDSL NVFP4 grouped GEMM with SwiGLU fusion and finalized fusion (#9288)
- Add PostNorm and multilayer options for Eagle models (#9233)
Fix
- Use fp32 for indexer weight_proj GEMM (#9243)
- Fix multimodal InputProcessor dummy builder (#8916)
- Set correct lm_head_tp_size_upper_bound (#9300)
- Move torch.cuda.Stream out of critical torch computation region (#8494)
- Fix trtllm-llmapi-launch port conflict (#8582)
- Rework DisaggPPTerminationHandler to fix hang issue (#8519)
- Overwrite only if default_max_tokens is legal (#8538)
- Fix block range index (#8470)
- Restrict FP8 blockscale MoE case to valid configurations (#8583)
- Fix L0_backend_trtllm behavior (#9282)
- Improve beam search request validation (#9228)
- Avoid incorrectly filling tensors with 0 (#9296)
- Fallback to greedy sampling in two-model overlap scheduler to improve stability (#9321)
Documentation
- Revise the description of enable_autotuner (#9320)
- Document the process for C++ dependencies (#9016)
Benchmark
- Set max_batch_size=1 to stabilize accuracy test results (#8609)
Test & Infra
- Use greedy decoding in test_openai_compatible_json_schema (#9305)
- Enable checking duplicate items in waives.txt in pre-commit (#9265)
- Fix test case where chunked attention is not supported on sm_120 (#9260)
- Add NCCL_DEBUG=INFO flag to collect more information on CI failures (#8440)
- Remove multimodal test cases using TRT backend (#8611)
- Clean cache for easily hanging test cases (#8619)
- Enable relaxed acceptance test on Blackwell (#8709)
- Update linter rules for mass integration (#8918)
- Upgrade starlette and FastAPI dependencies (#9319)
- Update goggles_action repository (#9240)
- Move third-party components to their own list file (#8986)
- Add fallback when fetching wheel from build stage fails (#9290)
- Add --waives-file flag in rerun pytest command (#8971)
- Add periodic JUnit XML path in conftest (#9337)
- Consume SlurmCluster sshPort for clusters with custom SSH port (#9313)
- Add one-model and overlap-scheduling to Eagle tests for GPTOSS (#9312)

What's Changed

[#9316][feat] AutoDeploy: Add the accuracy test for Nemotron MOE models by @nvchenghaoz in #9317
[#9096][feature] Auto Deploy: configurable fused MoE backend by @nzmora-nvidia in #9194
[None][fix] Use fp32 for indexer weight_proj GEMM by @chang-l in #9243
[None][fix] Multimodal InputProcessor dummy builder fix by @yechank-nvidia in #8916
[None][ci] waive test_disagg_server_restart by @QiJune in #9326
[None][chore] Revise the description of enable_autotuner. by @hyukn in #9320
[TRTLLM-9295][fix] use greedy decoding in test_openai_compatible_json_schema by @ixlmar in #9305
[TRTLLM-9164][infra] Enable checking duplicate items in waives.txt in pre-commit by @EmmaQiaoCh in #9265
[#9236][feature] Make sharing of activation_type across SW layers more robust by @nzmora-nvidia in #9238
[https://nvbugs/5667687][fix] Set correct lm_head_tp_size_upper_bound by @lancelly in #9300
[https://nvbugs/5667454][test] Fix Test Case as Chunked Attention not Supported on sm_120 by @yufeiwu-nv in #9260
[None][chore] Weekly mass integration of release/1.1 by @mikeiovine in #8918
[None][chore] Upgrade starlette and FastAPI by @tburt-nv in #9319
[None][infra] Update goggles_action repository by @karljang in #9240
[TRTLLM-9197][infra] Move thirdparty stuff to it's own listfile by @cheshirekow in #8986
[TRI-332] [fix] Fix L0_backend_trtllm by @yinggeh in #9282
[None][ci] waive test_llm_context_only_timed_out_kv_cache_exhausted by @QiJune in #9351
[None][infra] Add fallback when get wheel from build stage is fail by @ZhanruiSunCh in #9290
[TRTLLM-9183][infra] Add --waives-file in rerun pytest command by @yiqingy0 in #8971
[TRTLLM-8957][feat] create communication related classes by @xxi-nv in #8968
[None][chore] Add periodic junit xml path in conftest by @crazydemo in #9337
[None][ci] waive a test case of test_ad_build_small_multi.py by @QiJune in #9355
[None][infra] Waive failed cases in main post-merge on 11/21 by @EmmaQiaoCh in #9360
[None][chore] Bump version to 1.2.0rc4 by @yiqingy0 in #9363
[TRTLLM-8650][fix] beam search request validation (#8433) by @ixlmar in #9228
[TRTLLM-9191][feat] support out-of-tree models in trtllm-serve by @ixlmar in #9269
[https://nvbugs/5629833][fix] Don't fill tensors by @HuiGao-NV in #9296
[None][feat] TRT-LLM Gen MoE optimize DeepSeek Fp8 activation kernel by @nekorobov in #9175
[https://nvbugs/5590408][fix] Fallback to greedy sampling in two-model overlap scheduler by @ziyixiong-nv in #9321
[TRTLLM-9208][infra] Document the process for C++ deps by @cheshirekow in #9016
[TRTLLM-9370][feat] Integration of CuteDSL NVFP4 grouped GEMM (Part 2: SwiGLU Fusion and Finalize Fusion) by @syuoni in #9288
[None][feat] Eagle: PostNorm and multilayer options by @IzzyPutterman in #9233
[TRTLLM-9082][feat] AutoDeploy: Move the moe Align kernel to AOT by @nvchenghaoz in #9106
[#9388][fix] AutoDeploy: Fix cutlass BF16 MoE kernel invocation by @nzmora-nvidia in #9339
[TRTINFRA-7326][infra] - Consume SlurmCluster sshPort for clusters with custom SSH port by @mlefeb01 in #9313
[None][test] Add one-model and overlap-scheduling to eagle tests for GPTOSS by @dongfengy in #9312

Full Changelog: v1.2.0rc3...v1.2.0rc4