NVIDIA/Model-Optimizer 0.41.0 on GitHub

Fix Megatron KV Cache quantization checkpoint restore for QAT/QAD (device placement, amax sync across DP/TP, flash_decode compatibility).

Add support for Transformer Engine quantization for Megatron Core models.
Add support for Qwen3-Next model quantization.
Add support for dynamically linked TensorRT plugins in the ONNX quantization workflow.
Add support for KV Cache Quantization for vLLM FakeQuant PTQ script. See examples/vllm_serve/README.md for more details.
Add support for subgraphs in ONNX autocast.
Add support for parallel draft heads in Eagle speculative decoding.
Add support to enable custom emulated quantization backend. See register_quant_backend for more details. See an example in tests/unit/torch/quantization/test_custom_backend.py.
Add examples/llm_qad for QAD training with Megatron-LM.

Deprecate num_query_groups parameter in Minitron pruning (mcore_minitron). You can use ModelOpt 0.40.0 or earlier instead if you need to prune it.

Remove torchprofile as a default dependency from ModelOpt as it's used only for flops-based FastNAS pruning (computer vision models). It can be installed separately if needed.

NVIDIA/Model-Optimizer 0.41.0 ModelOpt 0.41.0 Release on GitHub