New Features
- Support full Transformer Engine spec for Minitron pruning (
mcore_minitron). Now we no longer need to use custom ModelOpt spec. Note that this does not affect the usage of the pruning workflow but makes pruning slightly faster and may result in slightly different pruned model because of different kernel and numerics. - Add end-to-end tutorial for Minitron pruning + distillation + quantization + evaluation + vLLM deployment for Nemotron-Nano-9B-v2 → Pruned 7B along with data blend preparation steps (and ablation study). See examples/pruning/minitron/README.md for details.
- Add Puzzletron - a new algorithm for heterogeneous pruning of LLM and VLM models. See examples/puzzletron/README.md for more details.
- Added iterator interface using CalibrationDataReader in ONNX quantization workflow.
- Add N:M sparse softmax support to the Triton flash attention kernel (
modelopt.torch.kernels.common.attention.triton_fa). See examples/llm_sparsity/attention_sparsity/README.md for usage. - Add skip-softmax skipping to the Triton flash attention kernel (
modelopt.torch.kernels.common.attention.triton_fa). See examples/llm_sparsity/attention_sparsity/README.md for usage. - Add Video Sparse Attention (VSA) method for video diffusion models (
modelopt.torch.sparsity.attention_sparsity). VSA uses 3D block tiling with a two-branch architecture for attention speedup. - Enable PTQ workflow for the Step3.5-Flash MoE model with NVFP4 W4A4 + FP8 KV cache quantization. See modelopt_recipes/models/Step3.5-Flash/nvfp4-mlp-only.yaml for more details.
- Add support for vLLM fakequant reload using ModelOpt state for HF models. See examples/vllm_serve/README.md for more details.
- [Early Testing] Add Claude Code PTQ skill (
.claude/skills/ptq/) for agent-assisted post-training quantization. The skill guides the agent through environment detection, model support checking, format selection, and execution via the launcher or manual SLURM/Docker/bare GPU paths. Includes handling for unlisted models with custom module patching. This feature is in early testing — use with caution. - [Early Testing] Polish Claude Code evaluation skill (
.claude/skills/evaluation/) for agent-assisted LLM accuracy benchmarking via NeMo Evaluator Launcher. Adds two companion skills vendored verbatim from NVIDIA-NeMo/Evaluator:launching-evals(run/check/debug/analyze NEL evaluations) andaccessing-mlflow(query MLflow runs, compare metrics, fetch artifacts). Re-sync at a pinned upstream SHA via.claude/scripts/sync-upstream-skills.sh. Also adds a sharedskills/common/credentials.mdcovering HF / NGC / Docker token setup referenced by multiple skills. This feature is in early testing — use with caution. - Add performant layerwise calibration for large models that don't fit on GPU (e.g. DeepSeek-R1, Kimi-K2). See modelopt_recipes/general/ptq/nvfp4_experts_only-kv_fp8_layerwise.yaml for usage. Layerwise calibration also supports PTQ with intermediate progress saving — useful when long PTQ runs get hit with Slurm timeouts. See modelopt_recipes/general/ptq/nvfp4_default-kv_none-gptq.yaml for usage.
- Add implicit GEMM CUDA kernel for Conv3D with fused NVFP4 fake quantization (
modelopt.torch.quantization.src.conv). When NVFP4 quantization is applied to annn.Conv3dlayer via ModelOpt PTQ, the implicit GEMM path is used automatically instead of cuDNN. Uses BF16 WMMA tensor cores (SM80+) with FP32 accumulation and in-kernel FP4 (E2M1) activation quantization. Grouped convolution (groups > 1) falls back to the default cuDNN path. Inference only — training mode falls back to cuDNN with a warning. - Add FP8 MHA quantization support for vision transformers. Adds an attention-aware ONNX post-processing pass (scale Mul / K-transpose move before Q, Q→DQ insertion on softmax output) in
FP8QuantExporter(modelopt.onnx.export.fp8_exporter.FP8QuantExporter), per-instance nested-attention-wrapper skipping in the HF plugin, andnn.LayerNormregistration inQuantModuleRegistryso BMM input quantizers and LayerNorm output quantizers defined in FP8_DEFAULT_CFG are honored end-to-end. See examples/torch_onnx/torch_quant_to_onnx.py for the general timm-model quantize→ONNX workflow.
Backward Breaking Changes
- The
quant_cfgfield in quantization configs is now an ordered list ofQuantizerCfgEntrydicts instead of a flat dictionary. Each entry specifies aquantizer_namewildcard, an optionalparent_classfilter, acfgdict of quantizer attributes, and/or anenableflag. Entries are applied in list order with later entries overriding earlier ones. The old dict-based format is still accepted and automatically converted vianormalize_quant_cfg_list(), but now emits aDeprecationWarning; new code should use the list format. All built-in configs (e.g.FP8_DEFAULT_CFG,INT4_AWQ_CFG,NVFP4_DEFAULT_CFG), examples, and YAML recipes have been updated. See thequant-cfgdocumentation for the new format reference and migration guide. - Deprecated Mllama (Llama 3.2 Vision) support in the
llm_ptqandvlm_ptqexamples. Themodel_type == "mllama"branches andMllamaImageProcessorusage have been removed fromhf_ptq.pyandexample_utils.py. For image-text calibration of VLMs, use--calib_with_imageswith a supported VLM (see Nemotron VL section inexamples/llm_ptq/README.md).
Bug Fixes
- Fix Megatron utility functions for generation (with pipeline parallelism) and ~10x speedup in MMLU score evaluation (by batching prefill passes).
- Fix Minitron pruning (
mcore_minitron) for MoE models. Importance estimation hooks were incorrectly registered for MoE modules and NAS step was hanging before this. - Fix TRT support for remote autotuning in ONNX Autotune from 10.16+ to 10.15+ and fix TRT versioning check to the
trtexecversion instead of the TRT Python API when usingtrtexecbackend. - Exclude MatMul/Gemm nodes with K or N < 16 from ONNX INT8 and FP8 quantization. Such small-dimension GEMMs cannot efficiently use INT8/FP8 Tensor Cores and the added Q/DQ layers cause perf regressions in TensorRT. Honors Gemm
transBwhen deriving K. - Fix
nvfp4_awqexportAssertionError: Modules have different quantization formatsfor MoE models (e.g. Qwen3-30B-A3B) when some experts are not exercised by the calibration data.awq_litenow applies a neutral all-onespre_quant_scaleto any expert that ends up disabled (no cache-pass tokens, NaN scales, or no search-pass tokens) so its format remainsnvfp4_awq, consistent with the rest of the MoE block. A warning is emitted whenever this fallback fires.
Misc
- [Security] Changed the default of
weights_onlytoTrueintorch.loadfor secure checkpoint loading. If you need to load a checkpoint that requires unpickling arbitrary objects, first register the class intorch.serialization.add_safe_globals([cls])before loading. Addedsafe_save(modelopt.torch.utils.serialization.safe_save) andsafe_load(modelopt.torch.utils.serialization.safe_load) API to save and load checkpoints securely. - Bump minimum required PyTorch version to 2.8.
- [Experimental] Add support for transformers>=5.0, including generic PTQ and unified HF checkpoint export for fused MoE expert modules (Mixtral, Qwen2-MoE, Qwen3-MoE, Qwen3.5-MoE, DeepSeek-V3, Jamba, OLMoE, etc.).
- Improve
megatron_preprocess_data: add--reasoning_contentsupport for Nemotron v3 datasets, eliminate intermediate JSONL for HuggingFace datasets, return output file prefixes from the Python API, add gzip input support (.jsonl.gz), add--strip_newlinesflag for plain-text pretraining data, add--hf_streamingfor very large datasets (only consumed rows downloaded), and auto-shuffle when--hf_max_samples_per_splitis set to avoid biased sampling. - Add installation support for Python 3.14. Only basic unit tests are verified for now. Production usage still defaults to Python 3.12. Python 3.10 support will be dropped in the next release.