Deprecations
- Deprecated
modelopt.torch._deploy.utils.get_onnx_bytesAPI. Please usemodelopt.torch._deploy.utils.get_onnx_bytes_and_metadatainstead to access the ONNX model bytes with external data. See examples/onnx_ptq/download_example_onnx.py for example usage.
New Features
- Added flag
op_types_to_exclude_fp16in ONNX quantization to exclude ops from being converted to FP16/BF16. Alternatively, for custom TensorRT ops, this can also be done by indicating'fp32'precision intrt_plugins_precision. - Added LoRA mode support for MCore in a new peft submodule:
modelopt.torch.peft.update_model(model, LORA_CFG). - Supported PTQ and fakequant in vLLM for fast evaluation of arbitrary quantization formats. See
examples/vllm_servefor more details. - Added support for
nemotron-post-training-dataset-v2andnemotron-post-training-dataset-v1inexamples/llm_ptq. Defaults to a mix ofcnn_dailymailandnemotron-post-training-dataset-v2(gated dataset accessed using theHF_TOKENenvironment variable) if no dataset is specified. - Allows specifying
calib_seqinexamples/llm_ptqto set the maximum sequence length for calibration. - Added support for MCore MoE PTQ/QAT/QAD.
- Added support for multi-node PTQ and export with FSDP2 in
examples/llm_ptq/multinode_ptq.py. See examples/llm_ptq/README.md for more details. - Added support for Nemotron Nano VL v1 & v2 models in FP8/NVFP4 PTQ workflow.
- Added flags
nodes_to_includeandop_types_to_includein AutoCast to force-include nodes in low precision, even if they would otherwise be excluded by other rules. - Added support for
torch.compileand benchmarking inexamples/diffusers/quantization/diffusion_trt.py. - Enabled native ModelOpt quantization support for FP8 and NVFP4 formats in SGLang. See SGLang quantization documentation for more details.
- Added ModelOpt quantized checkpoints in vLLM/SGLang CI/CD pipelines (PRs are under review).
- Added support for exporting QLoRA checkpoints finetuned using ModelOpt.
Documentation
- Added general guidelines for Minitron pruning and distillation. See examples/pruning/README.md for more details.
- Added example for exporting QLoRA checkpoints for vLLM deployment. Refer to examples/llm_qat/README.md for more details.
Additional Announcements
- ModelOpt will change its versioning from odd minor versions to all consecutive versions from next release. This means next release will be named
0.40.0instead of0.41.0