NVIDIA/Model-Optimizer 0.39.0 on GitHub

Deprecations

Deprecated modelopt.torch._deploy.utils.get_onnx_bytes API. Please use modelopt.torch._deploy.utils.get_onnx_bytes_and_metadata instead to access the ONNX model bytes with external data. See examples/onnx_ptq/download_example_onnx.py for example usage.

New Features

Added flag op_types_to_exclude_fp16 in ONNX quantization to exclude ops from being converted to FP16/BF16. Alternatively, for custom TensorRT ops, this can also be done by indicating 'fp32' precision in trt_plugins_precision.
Added LoRA mode support for MCore in a new peft submodule: modelopt.torch.peft.update_model(model, LORA_CFG).
Supported PTQ and fakequant in vLLM for fast evaluation of arbitrary quantization formats. See examples/vllm_serve for more details.
Added support for nemotron-post-training-dataset-v2 and nemotron-post-training-dataset-v1 in examples/llm_ptq. Defaults to a mix of cnn_dailymail and nemotron-post-training-dataset-v2 (gated dataset accessed using the HF_TOKEN environment variable) if no dataset is specified.
Allows specifying calib_seq in examples/llm_ptq to set the maximum sequence length for calibration.
Added support for MCore MoE PTQ/QAT/QAD.
Added support for multi-node PTQ and export with FSDP2 in examples/llm_ptq/multinode_ptq.py. See examples/llm_ptq/README.md for more details.
Added support for Nemotron Nano VL v1 & v2 models in FP8/NVFP4 PTQ workflow.
Added flags nodes_to_include and op_types_to_include in AutoCast to force-include nodes in low precision, even if they would otherwise be excluded by other rules.
Added support for torch.compile and benchmarking in examples/diffusers/quantization/diffusion_trt.py.
Enabled native ModelOpt quantization support for FP8 and NVFP4 formats in SGLang. See SGLang quantization documentation for more details.
Added ModelOpt quantized checkpoints in vLLM/SGLang CI/CD pipelines (PRs are under review).
Added support for exporting QLoRA checkpoints finetuned using ModelOpt.

Documentation

Added general guidelines for Minitron pruning and distillation. See examples/pruning/README.md for more details.
Added example for exporting QLoRA checkpoints for vLLM deployment. Refer to examples/llm_qat/README.md for more details.

Additional Announcements

ModelOpt will change its versioning from odd minor versions to all consecutive versions from next release. This means next release will be named 0.40.0 instead of 0.41.0

NVIDIA/Model-Optimizer 0.39.0 ModelOpt 0.39.0 Release on GitHub

Deprecations

New Features

Documentation

Additional Announcements

NVIDIA/Model-Optimizer 0.39.0
ModelOpt 0.39.0 Release

on GitHub