Bug Fixes
- Fix Megatron KV Cache quantization checkpoint restore for QAT/QAD (device placement, amax sync across DP/TP, flash_decode compatibility).
New Features
- Add support for Transformer Engine quantization for Megatron Core models.
- Add support for Qwen3-Next model quantization.
- Add support for dynamically linked TensorRT plugins in the ONNX quantization workflow.
- Add support for KV Cache Quantization for vLLM FakeQuant PTQ script. See examples/vllm_serve/README.md for more details.
- Add support for subgraphs in ONNX autocast.
- Add support for parallel draft heads in Eagle speculative decoding.
- Add support to enable custom emulated quantization backend. See
register_quant_backendfor more details. See an example intests/unit/torch/quantization/test_custom_backend.py. - Add
examples/llm_qadfor QAD training with Megatron-LM.
Deprecations
- Deprecate
num_query_groupsparameter in Minitron pruning (mcore_minitron). You can use ModelOpt 0.40.0 or earlier instead if you need to prune it.
Backward Breaking Changes
- Remove
torchprofileas a default dependency from ModelOpt as it's used only for flops-based FastNAS pruning (computer vision models). It can be installed separately if needed.