github NVIDIA/TensorRT-LLM v1.0.0rc0

latest releases: v1.1.0rc2.post1, v1.1.0rc3, v1.1.0rc2...
pre-release2 months ago

Announcement Highlights:

  • Model Support
  • Features
    • Add EAGLE3 support for Qwen3 (#5206)
    • Add Piecewise cuda graph support for MLA (#4467)
    • feat: Integrate TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner (#5207)
    • Re-implement LlmResponse in Python to reduce host overhead of pybind (#5224)
    • Add no_kv_cache_reuse option and streaming support for trtllm serve bench (#4971)
    • Add LLGuidance Support for PyTorch Backend (#5214)
    • Fusion finalize and allreduce for qwenmoe model (#5223)
    • Support stream_interval (#5284)
  • API
    • Add llm args to tune python gc threshold (#5141)
    • Introduce ResourceManagerType enum for resource management (#5246)
    • BREAKING CHANGEchore: make pytorch LLM the default (#5312)
    • Remove TrtGptModelOptionalParams (#5165)
  • Bug Fixes
    • Fix trtllm-llmapi-launch multiple LLM instances (#4727)
    • Fix the deterministic issue in the MTP Eagle path (#5285)
    • Fix: missing clientId when serialize and deserialize response (#5231)
  • Benchmark
  • Performance
    • Optimize MoE supplementary kernels for large-scale EP (#5215)
    • Improve performance of XQA-MLA for sm120 (#5087)
  • Infrastructure
    • Update dependencies with NGC PyTorch 25.05 and TRT 10.11 (#4885)
    • Add Multi-node CI testing support via Slurm (#4771)
  • Documentation
    • Add document of benchmarking for Qwen3 (#5158)
    • Update contributing md for internal developers (#5250)
    • blog: Disaggregated Serving in TensorRT-LLM (#5353)
    • Update mtp documents (#5387)
  • Known Issues
    • multi-GPU model support on RTX Pro 6000

What's Changed

New Contributors

Full Changelog: v0.21.0rc2...v1.0.0rc0

Don't miss a new TensorRT-LLM release

NewReleases is sending notifications on new releases.