Bug Fixes
- ONNX Runtime dependency upgraded to 1.24 to solve missing graph outputs when using the TensorRT Execution Provider.
Backward Breaking Changes
- Default
--kv_cache_qformatinhf_ptq.pychanged fromfp8tofp8_cast. Existing scripts that rely on the default will now skip KV cache calibration and use a constant amax instead. To restore the previous calibrated behavior, explicitly pass--kv_cache_qformat fp8. - Removed KV cache scale clamping (
clamp_(min=1.0)) in the HF checkpoint export path. Calibrated KV cache scales below 1.0 are now exported as-is. If you observe accuracy degradation with calibrated KV cache (--kv_cache_qformat fp8ornvfp4), consider using the casting methods (fp8_castornvfp4_cast) instead.
New Features
- Add
fp8_castandnvfp4_castmodes for--kv_cache_qformatinhf_ptq.py. These use a constant amax (FP8 E4M3 max, 448.0) without data-driven calibration, since the downstream engine uses FP8 attention math for both FP8 and NVFP4 quantization. A newuse_constant_amaxfield inQuantizerAttributeConfigcontrols this behavior. - User does not need to manually register MOE modules to cover experts calibration coverage in PTQ workflow.
hf_ptq.pynow saves the quantization summary and moe expert token count table to the export directory.- Add
--moe_calib_experts_ratioflag inhf_ptq.pyto specify the ratio of experts to calibrate during forward pass to improve expert coverage during calibration. Default to None (not enabled). - Add sparse attention optimization for transformer models (
modelopt.torch.sparsity.attention_sparsity). This reduces computational cost by skipping attention computation. Supports calibration for threshold selection on HuggingFace models. See examples/llm_sparsity/attention_sparsity/README.md for usage. - Add support for rotating the input before quantization for RHT.
- Add support for advanced weight scale search for NVFP4 quantization and its export path.
- Enable PTQ workflow for Qwen3.5 MoE models.
- Enable PTQ workflow for the Kimi-K2.5 model.
- Add
nvfp4_omlp_onlyquantization format for NVFP4 quantization. This is similar tonvfp4_mlp_onlybut also quantizes the output projection layer in attention. - Add
nvfp4_experts_onlyquantization config that targets only MoE routed expert layers (excluding shared) with NVFP4 quantization. pass_through_bwdin the quantization config is now default to True. Please set it to False if you want to use STE with zeroed outlier gradients for potentially better QAT accuracy.- Add
compute_quantization_mseAPI to measure per-quantizer mean-squared quantization error, with flexible wildcard and callable filtering. - Autotune: New tool for automated Q/DQ (Quantize/Dequantize) placement optimization for ONNX models. Uses TensorRT latency measurements to choose insertion schemes that minimize inference time. Discovers regions automatically, groups them by structural pattern, and tests multiple Q/DQ schemes per pattern. Supports INT8 and FP8 quantization, pattern cache for warm-start on similar models, checkpoint/resume, and importing patterns from an existing QDQ baseline. CLI:
python -m modelopt.onnx.quantization.autotune. See the Autotune guide in the documentation. - Add
get_auto_quantize_configAPI to extract a flat quantization config fromauto_quantizesearch results, enabling re-quantization at different effective bit targets without re-running calibration. - Improve
auto_quantizecheckpoint/resume: calibration state is now saved and restored across runs, avoiding redundant calibration when resuming a search. - Add support for Nemotron-3 (NemotronHForCausalLM) model quantization and support for NemotronH MoE expert support in
auto_quantizegrouping and scoring rules. - Add support for block-granular RHT for non-power-of-2 dimensions.
- Replace modelopt FP8 QDQ nodes with native ONNX QDQ nodes.
Deprecations
- Removed MT-Bench (FastChat) support from
examples/llm_eval. Therun_fastchat.shandgen_model_answer.pyscripts have been deleted, and themtbenchtask has been removed from thellm_ptqexample scripts. - Remove deprecated NeMo-2.0 Framework references.
Misc
- Migrated project metadata from
setup.pyto a fully declarativepyproject.toml. - Enable experimental Python 3.13 wheel support and unit tests in CI/CD.