github NVIDIA/Model-Optimizer 0.41.0
ModelOpt 0.41.0 Release

latest releases: 0.42.0, 0.42.0rc2, 0.42.0rc1...
one month ago

Bug Fixes

  • Fix Megatron KV Cache quantization checkpoint restore for QAT/QAD (device placement, amax sync across DP/TP, flash_decode compatibility).

New Features

  • Add support for Transformer Engine quantization for Megatron Core models.
  • Add support for Qwen3-Next model quantization.
  • Add support for dynamically linked TensorRT plugins in the ONNX quantization workflow.
  • Add support for KV Cache Quantization for vLLM FakeQuant PTQ script. See examples/vllm_serve/README.md for more details.
  • Add support for subgraphs in ONNX autocast.
  • Add support for parallel draft heads in Eagle speculative decoding.
  • Add support to enable custom emulated quantization backend. See register_quant_backend for more details. See an example in tests/unit/torch/quantization/test_custom_backend.py.
  • Add examples/llm_qad for QAD training with Megatron-LM.

Deprecations

  • Deprecate num_query_groups parameter in Minitron pruning (mcore_minitron). You can use ModelOpt 0.40.0 or earlier instead if you need to prune it.

Backward Breaking Changes

  • Remove torchprofile as a default dependency from ModelOpt as it's used only for flops-based FastNAS pruning (computer vision models). It can be installed separately if needed.

Don't miss a new Model-Optimizer release

NewReleases is sending notifications on new releases.