github NVIDIA/Model-Optimizer 0.43.0
ModelOpt 0.43.0 Release

6 hours ago

Bug Fixes

  • ONNX Runtime dependency upgraded to 1.24 to solve missing graph outputs when using the TensorRT Execution Provider.

Backward Breaking Changes

  • Default --kv_cache_qformat in hf_ptq.py changed from fp8 to fp8_cast. Existing scripts that rely on the default will now skip KV cache calibration and use a constant amax instead. To restore the previous calibrated behavior, explicitly pass --kv_cache_qformat fp8.
  • Removed KV cache scale clamping (clamp_(min=1.0)) in the HF checkpoint export path. Calibrated KV cache scales below 1.0 are now exported as-is. If you observe accuracy degradation with calibrated KV cache (--kv_cache_qformat fp8 or nvfp4), consider using the casting methods (fp8_cast or nvfp4_cast) instead.

New Features

  • Add fp8_cast and nvfp4_cast modes for --kv_cache_qformat in hf_ptq.py. These use a constant amax (FP8 E4M3 max, 448.0) without data-driven calibration, since the downstream engine uses FP8 attention math for both FP8 and NVFP4 quantization. A new use_constant_amax field in QuantizerAttributeConfig controls this behavior.
  • User does not need to manually register MOE modules to cover experts calibration coverage in PTQ workflow.
  • hf_ptq.py now saves the quantization summary and moe expert token count table to the export directory.
  • Add --moe_calib_experts_ratio flag in hf_ptq.py to specify the ratio of experts to calibrate during forward pass to improve expert coverage during calibration. Default to None (not enabled).
  • Add sparse attention optimization for transformer models (modelopt.torch.sparsity.attention_sparsity). This reduces computational cost by skipping attention computation. Supports calibration for threshold selection on HuggingFace models. See examples/llm_sparsity/attention_sparsity/README.md for usage.
  • Add support for rotating the input before quantization for RHT.
  • Add support for advanced weight scale search for NVFP4 quantization and its export path.
  • Enable PTQ workflow for Qwen3.5 MoE models.
  • Enable PTQ workflow for the Kimi-K2.5 model.
  • Add nvfp4_omlp_only quantization format for NVFP4 quantization. This is similar to nvfp4_mlp_only but also quantizes the output projection layer in attention.
  • Add nvfp4_experts_only quantization config that targets only MoE routed expert layers (excluding shared) with NVFP4 quantization.
  • pass_through_bwd in the quantization config is now default to True. Please set it to False if you want to use STE with zeroed outlier gradients for potentially better QAT accuracy.
  • Add compute_quantization_mse API to measure per-quantizer mean-squared quantization error, with flexible wildcard and callable filtering.
  • Autotune: New tool for automated Q/DQ (Quantize/Dequantize) placement optimization for ONNX models. Uses TensorRT latency measurements to choose insertion schemes that minimize inference time. Discovers regions automatically, groups them by structural pattern, and tests multiple Q/DQ schemes per pattern. Supports INT8 and FP8 quantization, pattern cache for warm-start on similar models, checkpoint/resume, and importing patterns from an existing QDQ baseline. CLI: python -m modelopt.onnx.quantization.autotune. See the Autotune guide in the documentation.
  • Add get_auto_quantize_config API to extract a flat quantization config from auto_quantize search results, enabling re-quantization at different effective bit targets without re-running calibration.
  • Improve auto_quantize checkpoint/resume: calibration state is now saved and restored across runs, avoiding redundant calibration when resuming a search.
  • Add support for Nemotron-3 (NemotronHForCausalLM) model quantization and support for NemotronH MoE expert support in auto_quantize grouping and scoring rules.
  • Add support for block-granular RHT for non-power-of-2 dimensions.
  • Replace modelopt FP8 QDQ nodes with native ONNX QDQ nodes.

Deprecations

  • Removed MT-Bench (FastChat) support from examples/llm_eval. The run_fastchat.sh and gen_model_answer.py scripts have been deleted, and the mtbench task has been removed from the llm_ptq example scripts.
  • Remove deprecated NeMo-2.0 Framework references.

Misc

  • Migrated project metadata from setup.py to a fully declarative pyproject.toml.
  • Enable experimental Python 3.13 wheel support and unit tests in CI/CD.

Don't miss a new Model-Optimizer release

NewReleases is sending notifications on new releases.