github NVIDIA/TensorRT-LLM v1.1.0rc2.post2

pre-release16 hours ago

Announcement Highlights

  • Feature
    • Add MNNVL AlltoAll tests to pre-merge (#7465)
    • Support multi-threaded tokenizers for trtllm-serve (#7515)
    • FP8 Context MLA integration (#7581)
    • Support block wise FP8 in wide ep (#7423)
    • Cherry-pick Responses API and multiple postprocess workers support for chat harmony (#7600)
    • Make low_precision_combine as a llm arg (#7598)
  • Documentation
    • Update deployment guide and cherry-pick CI test fix from main (#7623)

What's Changed

  • [None] [test] Add MNNVL AlltoAll tests to pre-merge by @kaiyux in #7465
  • [TRTLLM-7292][feat] Support multi-threaded tokenizers for trtllm-serve by @nv-yilinf in #7515
  • [None][fix] trtllm-serve yaml loading by @Superjomn in #7551
  • [None][chore] Bump version to 1.1.0rc2.post2 by @yiqingy0 in #7582
  • [https://nvbugs/5498967][fix] Downgrade NCCL by @yizhang-nv in #7556
  • [TRTLLM-6994][feat] FP8 Context MLA integration. by @yuxianq in #7581
  • [TRTLLM-7831][feat] Support block wise FP8 in wide ep by @xxi-nv in #7423
  • [None][chore] Make use_low_precision_moe_combine as a llm arg by @zongfeijing in #7598
  • [None][fix] Update deployment guide and cherry-pick CI test fix from main by @dongfengy in #7623
  • [None][feat] Cherry-pick Responses API and multiple postprocess workers support for chat harmony by @JunyiXu-nv in #7600
  • [None][chore] Fix kernel launch param and add TRTLLM MoE backend test by @pengbowang-nv in #7524

New Contributors

Full Changelog: v1.1.0rc2.post1...v1.1.0rc2.post2

Don't miss a new TensorRT-LLM release

NewReleases is sending notifications on new releases.