NVIDIA/Model-Optimizer 0.27.0 on GitHub

Deprecations

Deprecate real quantization configs, please use mtq.compress <modelopt.torch.quantization.compress> API for model compression after quantization.

New Features

New model support in the llm_ptq example: OpenAI Whisper.
Blockwise FP8 quantization support in unified model export.
Add quantization support to the Transformer Engine Linear module.
Add support for SVDQuant. Currently, only simulation is available; real deployment (for example, TensorRT deployment) support is coming soon.
To support distributed checkpoint resume expert-parallel (EP), modelopt_state in Megatron Core distributed checkpoint (used in NeMo and Megatron-LM) is stored differently. The legacy modelopt_state in the distributed checkpoint generated by previous modelopt version can still be loaded in 0.27 and 0.29 but will need to be stored in the new format.
Add triton-based NVFP4 quantization kernel that delivers approximately 40% performance improvement over the previous implementation.
Add a new API mtq.compress <modelopt.torch.quantization.compress> for model compression for weights after quantization.
Add option to simplify ONNX model before quantization is performed.
(Experimental) Improve support for ONNX models with custom TensorRT op:
- Add support for --calibration_shapes flag.
- Add automatic type and shape tensor propagation for full ORT support with TensorRT EP.

Known Issues

Quantization of T5 models is broken. Please use nvidia-modelopt==0.25.0 with transformers<4.50 meanwhile.

NVIDIA/Model-Optimizer 0.27.0 ModelOpt 0.27.0 Release on GitHub