NVIDIA/Model-Optimizer 0.37.0 on GitHub

Deprecated ModelOpt's custom docker images. Please use the PyTorch, TensorRT-LLM, or TensorRT docker image directly or refer to the installation guide for more details.
Deprecated quantize_mode argument in examples/onnx_ptq/evaluate.py to support strong typing. Use engine_precision instead.
Deprecated TRT-LLM's TRT backend in examples/llm_ptq and examples/vlm_ptq. Tasks build and benchmark support are removed and replaced with quant. engine_dir is replaced with checkpoint_dir in examples/llm_ptq and examples/vlm_ptq. For performance evaluation, please use trtllm-bench directly.
The --export_fmt flag in examples/llm_ptq is removed. By default, we export to the unified Hugging Face checkpoint format.
Deprecated examples/vlm_eval as it depends on the deprecated TRT-LLM's TRT backend.

high_precision_dtype defaults to fp16 in ONNX quantization, i.e., quantized output model weights are now FP16 by default.
Upgraded TensorRT-LLM dependency to 1.1.0rc2.
Support for Phi-4-multimodal and Qwen2.5-VL quantized HF checkpoint export in examples/vlm_ptq.
Support storing and restoring Minitron pruning activations and scores for re-pruning without running the forward loop again.
Added Minitron pruning example for the Megatron-LM framework. See examples/megatron-lm for more details.

NVIDIA/Model-Optimizer 0.37.0 ModelOpt 0.37.0 Release on GitHub