NVIDIA/Model-Optimizer 0.33.0 on GitHub

Backward Breaking Changes

PyTorch dependencies for modelopt.torch features are no longer optional and pip install nvidia-modelopt is now same as pip install nvidia-modelopt[torch].

New Features

Upgrade TensorRT-LLM dependency to 0.20.
Add new CNN QAT example to demonstrate how to use ModelOpt for QAT.
Add support for ONNX models with custom TensorRT ops in Autocast.
Add quantization aware distillation (QAD) support in llm_qat example.
Add support for BF16 in ONNX quantization.
Add per node calibration support in ONNX quantization.
ModelOpt now supports quantization of tensor-parallel sharded Huggingface transformer models. This requires transformers>=4.52.0.
Support quantization of FSDP2 wrapped models and add FSDP2 support in the llm_qat example.
Add NeMo 2 Simplified Flow examples for quantization aware training/distillation (QAT/QAD), speculative decoding, pruning & distillation.

NVIDIA/Model-Optimizer 0.33.0 ModelOpt 0.33.0 Release on GitHub