NVIDIA/TensorRT-LLM v1.2.0 on GitHub

Highlights

Model Support
- Added beta support for K-EXAONE, Nemotron Nano V3, Qwen3-Next and Qwen3-VL.
- Improved GPT-OSS, Nemotron, EXAONE, GLM, Starcoder2, Qwen3, KimiK2, DeepSeek v3.2 and Mistral Large 3 support and validation.
- Expanded Blackwell/Hopper/Ampere enablement including B300/GB200/GB300 and SM120/SM121/SM103 paths.
- Broadened low-precision and MoE capabilities (FP8/NVFP4/MXFP4/INT4-AWQ), including routing and kernels.
Features
- Speculative Decoding:
  - Enabled MTP>1 support for DeepSeek v3.2
- Disaggregated Serving:
  - Added service discovery mechanism for dynamic scaling
  - Added support for cancelling requests
  - Added NIXL-LibFabric support
  - Added support for Mooncake transfer engine as a cache transceiver backend
- Sampling:
  - Implemented batched sampling using FlashInfer sampling
  - Added support for returning logprobs incrementally with streaming mode in PyTorch backend
  - Added Beam Search support to TorchSampler
- Performance:
  - Improved TorchSampler performance
  - Enabled PDL by default and added PDL support for indexer TopK and additional kernels.
  - Improved trtllm-gen kernels
  - Enabled early exit with overlap scheduler
  - Added NUMA-aware CPU affinity automatic configuration
- Expert Parallelism:
  - Enabled EPLB for trtllm-gen and cutlass backend
  - Enabled CuteDSL MoE with large EP
  - Added CUDA graph support for DeepEP
  - Multiple performance improvements
- Hardware:
  - DGX Spark Support (Beta)
- Others:
  - Helix parallelism support
  - New Ray orchestrator type
Documentation
- Deployment Guides:
  - Added comprehensive deployment guides for KimiK2, Qwen3 and Qwen3-Next.
  - Added new guide on CPU Affinity configuration.
  - Updated GPT-OSS guide.
- Developer Guides:
  - Added developer guide about KV Cache Transmission.
  - New section on MoE Expert Load Balance Analysis (Perfect Router) in Performance Analysis guide.
  - New section on API Change Principles in LLM API Change guide.
- Feature Documentation:
  - Created new guides for Additional Outputs, Helix Parallelism, KV Cache Connector, Ray Orchestrator, Sparse Attention and Torch Compile & Piecewise CUDA Graph.
  - Also updated the Feature Combination Matrix and Paged Attention, IFB, and Request Scheduling guide.
- Tech Blogs: Published blogs on:
  - "Scaling Expert Parallelism in TensorRT LLM (Part 3: Pushing the Performance Boundary)"
  - "Optimizing DeepSeek-V3.2 on NVIDIA Blackwell GPUs".
- Examples:
  - Added new section on disaggregated serving service discovery method.
  - Added examples for K-EXAONE, Nemotron Nano V2 VL and Nemotron Nano V3.
  - Added RocketKV usage documentation.
Infrastructure Changes
- The base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:25.12-py3.
- The base Docker image for TensorRT-LLM Backend is updated to nvcr.io/nvidia/tritonserver:25.12-py3.
- The dependent public PyTorch version is updated to 2.9.1.
- The dependent transformers version is updated to 4.57.3.
- The dependent triton version is updated to 3.5.1.
- The dependent NIXL version is updated to 0.8.0.
API Changes
- Breaking Changes:
  - FlashInfer sampling now used by default with PyTorch backend.
  - Changes to sampling strategy in some previously undefined cases.
- OpenAI API:
  - Enabled n > 1 with PyTorch backend
  - Added support for GET/DELETE v1/responses
Fixed multiple Issues
Known Issues
- DGX Spark: DGX Spark support is in beta. Only single-node configurations and the models listed above have been validated in this release.
- Disaggregated Serving: A hang may occur in disaggregated serving with context pipeline parallelism and generation tensor parallelism configurations.