Bug Fixes
- Fix a bug in FastNAS pruning (computer vision models) where the model parameters were sorted twice, messing up the ordering.
- Fix Q/DQ/Cast node placements in 'FP32 required' tensors in custom ops in the ONNX quantization workflow.
New Features
- Add MoE (e.g. Qwen3-30B-A3B, gpt-oss-20b) pruning support for
num_moe_experts,moe_ffn_hidden_size, andmoe_shared_expert_intermediate_sizeparameters in Minitron pruning (mcore_minitron). - Add
specdec_benchexample to benchmark speculative decoding performance. See examples/specdec_bench/README.md for more details. - Add FP8/NVFP4 KV cache quantization support for Megatron Core models.
- Add KL Divergence loss-based auto_quantize method. See auto_quantize API docs for more details.
- Add support for saving and resuming auto_quantize search state. This speeds up the auto_quantize process by skipping the score estimation step if the search state is provided.
- Add flag
trt_plugins_precisionin ONNX autocast to indicate custom ops precision. This is similar to the flag already existing in the quantization workflow. - Add support for PyTorch Geometric quantization.
- Add per tensor and per channel MSE calibrator support.
- Added support for PTQ/QAT checkpoint export and loading for running fakequant evaluation in vLLM. See examples/vllm_serve/README.md for more details.
Documentation
- Deprecate
examples/megatron-lmin favor of more detailed documentation in Megatron-LM/examples/post_training/modelopt.
Misc
- NVIDIA TensorRT Model Optimizer is now officially rebranded as NVIDIA Model Optimizer. GitHub will automatically redirect the old repository path (
NVIDIA/TensorRT-Model-Optimizer) to the new one (NVIDIA/Model-Optimizer). Documentation URL is also changed to nvidia.github.io/Model-Optimizer. - Bump TensorRT-LLM test docker to 1.2.0rc4.
- Bump minimum recommended transformers version to 4.53.
- Replace ONNX simplification package from
onnxsimtoonnxslim.