github NVIDIA/TensorRT-LLM v1.2.0rc4

pre-release11 hours ago

Announcement Highlights

  • Model Support

    • Optimize DeepSeek FP8 activation kernel for TRT-LLM Gen MoE (#9175)
    • Disable fp8 deep GEMM for EXAONE-4.0-32B-FP8 (#8429)
    • Fix output unpack issues for Llama3/4 NVFP4 models (#8679)
  • API

    • Support out-of-tree models in trtllm-serve (#9269)
  • Feature

    • Make sharing of activation_type across SW layers more robust (#9238)
    • Create communication related classes (#8968)
    • Integrate CuteDSL NVFP4 grouped GEMM with SwiGLU fusion and finalized fusion (#9288)
    • Add PostNorm and multilayer options for Eagle models (#9233)
  • Fix

    • Use fp32 for indexer weight_proj GEMM (#9243)
    • Fix multimodal InputProcessor dummy builder (#8916)
    • Set correct lm_head_tp_size_upper_bound (#9300)
    • Move torch.cuda.Stream out of critical torch computation region (#8494)
    • Fix trtllm-llmapi-launch port conflict (#8582)
    • Rework DisaggPPTerminationHandler to fix hang issue (#8519)
    • Overwrite only if default_max_tokens is legal (#8538)
    • Fix block range index (#8470)
    • Restrict FP8 blockscale MoE case to valid configurations (#8583)
    • Fix L0_backend_trtllm behavior (#9282)
    • Improve beam search request validation (#9228)
    • Avoid incorrectly filling tensors with 0 (#9296)
    • Fallback to greedy sampling in two-model overlap scheduler to improve stability (#9321)
  • Documentation

    • Revise the description of enable_autotuner (#9320)
    • Document the process for C++ dependencies (#9016)
  • Benchmark

    • Set max_batch_size=1 to stabilize accuracy test results (#8609)
  • Test & Infra

    • Use greedy decoding in test_openai_compatible_json_schema (#9305)
    • Enable checking duplicate items in waives.txt in pre-commit (#9265)
    • Fix test case where chunked attention is not supported on sm_120 (#9260)
    • Add NCCL_DEBUG=INFO flag to collect more information on CI failures (#8440)
    • Remove multimodal test cases using TRT backend (#8611)
    • Clean cache for easily hanging test cases (#8619)
    • Enable relaxed acceptance test on Blackwell (#8709)
    • Update linter rules for mass integration (#8918)
    • Upgrade starlette and FastAPI dependencies (#9319)
    • Update goggles_action repository (#9240)
    • Move third-party components to their own list file (#8986)
    • Add fallback when fetching wheel from build stage fails (#9290)
    • Add --waives-file flag in rerun pytest command (#8971)
    • Add periodic JUnit XML path in conftest (#9337)
    • Consume SlurmCluster sshPort for clusters with custom SSH port (#9313)
    • Add one-model and overlap-scheduling to Eagle tests for GPTOSS (#9312)

What's Changed

  • [#9316][feat] AutoDeploy: Add the accuracy test for Nemotron MOE models by @nvchenghaoz in #9317
  • [#9096][feature] Auto Deploy: configurable fused MoE backend by @nzmora-nvidia in #9194
  • [None][fix] Use fp32 for indexer weight_proj GEMM by @chang-l in #9243
  • [None][fix] Multimodal InputProcessor dummy builder fix by @yechank-nvidia in #8916
  • [None][ci] waive test_disagg_server_restart by @QiJune in #9326
  • [None][chore] Revise the description of enable_autotuner. by @hyukn in #9320
  • [TRTLLM-9295][fix] use greedy decoding in test_openai_compatible_json_schema by @ixlmar in #9305
  • [TRTLLM-9164][infra] Enable checking duplicate items in waives.txt in pre-commit by @EmmaQiaoCh in #9265
  • [#9236][feature] Make sharing of activation_type across SW layers more robust by @nzmora-nvidia in #9238
  • [https://nvbugs/5667687][fix] Set correct lm_head_tp_size_upper_bound by @lancelly in #9300
  • [https://nvbugs/5667454][test] Fix Test Case as Chunked Attention not Supported on sm_120 by @yufeiwu-nv in #9260
  • [None][chore] Weekly mass integration of release/1.1 by @mikeiovine in #8918
  • [None][chore] Upgrade starlette and FastAPI by @tburt-nv in #9319
  • [None][infra] Update goggles_action repository by @karljang in #9240
  • [TRTLLM-9197][infra] Move thirdparty stuff to it's own listfile by @cheshirekow in #8986
  • [TRI-332] [fix] Fix L0_backend_trtllm by @yinggeh in #9282
  • [None][ci] waive test_llm_context_only_timed_out_kv_cache_exhausted by @QiJune in #9351
  • [None][infra] Add fallback when get wheel from build stage is fail by @ZhanruiSunCh in #9290
  • [TRTLLM-9183][infra] Add --waives-file in rerun pytest command by @yiqingy0 in #8971
  • [TRTLLM-8957][feat] create communication related classes by @xxi-nv in #8968
  • [None][chore] Add periodic junit xml path in conftest by @crazydemo in #9337
  • [None][ci] waive a test case of test_ad_build_small_multi.py by @QiJune in #9355
  • [None][infra] Waive failed cases in main post-merge on 11/21 by @EmmaQiaoCh in #9360
  • [None][chore] Bump version to 1.2.0rc4 by @yiqingy0 in #9363
  • [TRTLLM-8650][fix] beam search request validation (#8433) by @ixlmar in #9228
  • [TRTLLM-9191][feat] support out-of-tree models in trtllm-serve by @ixlmar in #9269
  • [https://nvbugs/5629833][fix] Don't fill tensors by @HuiGao-NV in #9296
  • [None][feat] TRT-LLM Gen MoE optimize DeepSeek Fp8 activation kernel by @nekorobov in #9175
  • [https://nvbugs/5590408][fix] Fallback to greedy sampling in two-model overlap scheduler by @ziyixiong-nv in #9321
  • [TRTLLM-9208][infra] Document the process for C++ deps by @cheshirekow in #9016
  • [TRTLLM-9370][feat] Integration of CuteDSL NVFP4 grouped GEMM (Part 2: SwiGLU Fusion and Finalize Fusion) by @syuoni in #9288
  • [None][feat] Eagle: PostNorm and multilayer options by @IzzyPutterman in #9233
  • [TRTLLM-9082][feat] AutoDeploy: Move the moe Align kernel to AOT by @nvchenghaoz in #9106
  • [#9388][fix] AutoDeploy: Fix cutlass BF16 MoE kernel invocation by @nzmora-nvidia in #9339
  • [TRTINFRA-7326][infra] - Consume SlurmCluster sshPort for clusters with custom SSH port by @mlefeb01 in #9313
  • [None][test] Add one-model and overlap-scheduling to eagle tests for GPTOSS by @dongfengy in #9312

Full Changelog: v1.2.0rc3...v1.2.0rc4

Don't miss a new TensorRT-LLM release

NewReleases is sending notifications on new releases.