Changelog Details
- Refactor model_provider to model_builder format for ModelOpt examples by @AAnoosheh :: PR: #2107
- Add MoE layer type to hybrid models by @deepakn94 :: PR: #2259
- Various quality-of-life improvements in training loop by @deepakn94 :: PR: #2580
- Rename TensorRT Model Optimizer to Model Optimizer by @AAnoosheh :: PR: #2373
- Changes to support latent MoEs by @deepakn94 :: PR: #2296
- Create separate teacher Layer Spec in KD mode by @AAnoosheh :: PR: #2429
- Simplify parameter sync for checkpoint save by @ananthsub :: PR: #2344
- [Megatron-FSDP] Support both old and new DeviceMesh APIs. by @cspades :: PR: #2575
- Pipeline parallelism fix in RL and sequence packing rewriting by @jalbericiola :: PR: #2632
- [Megatron-FSDP] Build default FSDP DeviceMesh, and remove model arg from fully_shard_optimizer(). by @cspades :: PR: #2471
- Refit prep 3 by @wdykas :: PR: #2708
- Add options to improve data loader initialization time, especially at scale by @asolergi-nv :: PR: #2445
- Hsdp register submesh fix lifuz mirror by @tomlifu :: PR: #2467
- Prep work for migrating to types from ModuleSpec by @nschank :: PR: #2668
- feat(MoE): Refactor cuda_graph_scope by @buptzyb :: PR: #1920
- Reflect the changes made by #1920 in RL by @tdene :: PR: #2780
- Fix 2780 by @tdene :: PR: #2791
- Update PR message by @Phlip79 :: PR: #2778
- Ignore bot for oncall by @Phlip79 :: PR: #2756
- Only assign oncall to main PRs by @Phlip79 :: PR: #2755
- Explicitly zero out padding token outputs when using quantization scales by @santhnm2 :: PR: #2585
- Synchronize total block count across pipeline parallel ranks by @santhnm2 :: PR: #2578
- Optimize TE CUDA Graph capturing time by @buptzyb :: PR: #2482
- Do a pass of typing fixes on transformer/ by @nschank :: PR: #2766
- moe: remove unused variable scale_up by @WineChord :: PR: #1670
- build: Pin down
nvidia-nvshmem-cu13(#2798) by @ko3n1g :: PR: #2803 - DeepSeek V3 FSDP Fix for Precision-Aware Optimizer by @tomlifu :: PR: #2466
- Minor Fixes on Post-Training ModelOpt Examples by @ChenhanYu :: PR: #2813
- fix(moe): Support HybridEP and reduce memory overhead for 1F1B A2A overlap by @lhb8125 :: PR: #2236
- Inference memory test by @wdykas :: PR: #2724
- Move batch invariance mode init to initialize.py by @santhnm2 :: PR: #2832
- Move full model init to cuda stream to avoid race condition leading to empty parameters in DDP by @jstjohn :: PR: #2652
- [docs] Cleanup homepage by @Phlip79 :: PR: #2823
- [docs] Update oncall doc by @Phlip79 :: PR: #2822
- Make default for rerun_mode=disabled not terminate with non-fatal rer… by @kwyss-nvidia :: PR: #2773
- Bugfix: ensure spawned persistent checkpoint worker sets its CUDA device correctly for CUDA context creation / hypothetical memory allocations by @ankurv-nvidia :: PR: #2710
- Implementation of a more flexible optimizer/scheduler override system by @jstjohn :: PR: #2723
- ci(fix): PyPI upload by @ko3n1g :: PR: #2843
- ci(fix): Don't fail on empty var by @ko3n1g :: PR: #2850
- Add RL support for MOEs by @jon-barker :: PR: #2742
- ci(fix): GH release version tag by @ko3n1g :: PR: #2854
- Reduce the scope of the side stream around DDP initialization by @jstjohn :: PR: #2852
- Manually update first oncall rotation by @Phlip79 :: PR: #2855
- Remove flaky iteration time functional test by @buptzyb :: PR: #2862
- Nccl gloo refit for RL by @wdykas :: PR: #2812
- build: Bump jet-client by @ko3n1g :: PR: #2876
- Dynamic Inference | Evict and re-compute context requests. by @lmcafee-nvidia :: PR: #2738
- Change oncall team name by @Phlip79 :: PR: #2861
- Revert "Dynamic Inference | Evict and re-compute context requests. (#2738)" by @chtruong814 :: PR: #2884
- [main] feat(moe): Support moe shared expert gate for Qwen3-Next (2/4) by @yuzhongw-nvidia :: PR: #2751
- [main] feat(moe): Support attention output gate for Qwen3-Next (3/4) by @yuzhongw-nvidia :: PR: #2752
- [docs] Fix docs and add generation doc by @Phlip79 :: PR: #2882
- FP8 params support for megatron-fsdp (MXFP8/Blockwise) by @kunlunl :: PR: #2239
- docs: fix broken images, links, and typos across documentation by @sbhavani :: PR: #2794
- ci(fix): Release version by @ko3n1g :: PR: #2873
- Assign mcore-oncall instead of user by @Phlip79 :: PR: #2879
- tests: Disable Mamba MOE model test after 43b4471 by @ko3n1g :: PR: #2886
- Fix mamba moe unit test after commit reversion by @jon-barker :: PR: #2888
- [training migration] add RNG config dataclass by @maanug-nv :: PR: #2347
- Fix inference server to make nemogym work. by @yobibyte :: PR: #2887
- Use DynamicInferenceCoordinator for text generation server by @santhnm2 :: PR: #1910
- Improve error messages in mamba moe unit test by @jon-barker :: PR: #2889
- [training migration] Add RerunStateMachineConfig dataclass by @maanug-nv :: PR: #2436
- Add retry loop with exponential backoff in dataloader as a form of in-application fault tolerance by @deepakn94 :: PR: #2836
- [training migration] Add SchedulerConfig dataclass by @maanug-nv :: PR: #2400
- RL: Fix cu_seqlens construction for PackedSeqParams by @mathemakitten :: PR: #2883
- [training migration] Add ProfilingConfig dataclass by @maanug-nv :: PR: #2393
- [MoE] Apply grouped gemm bias before unpadding for FP8 by @cuichenx :: PR: #2817
- Update Slack user group when oncall changes by @Phlip79 :: PR: #2859
- Remove unused FlashAttention3 args by @santhnm2 :: PR: #2898
- Use different token for assign logic by @Phlip79 :: PR: #2893
- chore: Add
--no-container-mount-hometo script by @ko3n1g :: PR: #2906 - build: Bump deps by @ko3n1g :: PR: #2911
- Fix RL sequence packing bin size by @tdene :: PR: #2909
- feat: m4 leftover changes by @yaoyu-33 :: PR: #2506
- Revert "Remove unused FlashAttention3 args (#2898)" by @chtruong814 :: PR: #2916
- ci: Skip broken tests after dependency bump by @chtruong814 :: PR: #2934
- Ko3n1g/build/downgrade flashinfer by @ko3n1g :: PR: #2937
- ci: Skip unit test cleanup by @chtruong814 :: PR: #2940
- build: 26.02 dependency bump main by @ko3n1g :: PR: #2923
- RL refit pipelining support by @wdykas :: PR: #2878
- [MAIN][NVFP4][MOE] 128 Zero Padding for Grouped Quantization kernels and Cuda Graph Support by @zhongbozhu :: PR: #2655
- Support DDP overlap for models with repeated parameters by @deepakn94 :: PR: #2837
- Add muon and layerwise distributed optimizer by @FDecaYed :: PR: #2241
- Revert "[dev] Add assertion for mxfp8 params without dp overlap (#2270)" by @ko3n1g :: PR: #2901
- Unit test for model_provider to model_builder coupling by @AAnoosheh :: PR: #2925
- ci: Onboard GB200 by @ko3n1g :: PR: #2847
- Install slack-sdk using uv by @Phlip79 :: PR: #2948
- Inference | Evict overflow paused requests from context. by @lmcafee-nvidia :: PR: #2926
- Enable training cudagraphs for RL by @mathemakitten :: PR: #2452
- Various fixes to in-job restarter and better time accounting of startup operations by @hexinw-nvidia :: PR: #2698
- feat(moe): Support placing MTP layers into standalone stages by @BestJuly :: PR: #2136
- Fix minor README wording and capitalization by @Deepak-J0shi :: PR: #2928
- ci: Restore grpo tests by @ko3n1g :: PR: #2952
- Fix GitHub GRPO resharding functional test by @tdene :: PR: #2927
- cp:
ci(fix): GB200 racecondition (2962)intomainby @ko3n1g :: PR: #2963 - Add out-of-SLA link by @Phlip79 :: PR: #2903
- feat(moe): Fine-grained activation offloading by @lhb8125 :: PR: #1913
- Fix broken mamba-moe unit test by @jon-barker :: PR: #2970
- ci: Fix GB200 change by @ko3n1g :: PR: #2969
- Update golden values for reshard test by @tdene :: PR: #2971
- chore: Update golden values by @ko3n1g :: PR: #2973
- Pass through --trust-remote-code and add this to all Nemotron model configs by @ChenhanYu :: PR: #2939
- Cuda 13 UVM by @wdykas :: PR: #2957
- Enable phase transition iterations by @jkamalu :: PR: #2938
- add missing import in rl_utils.py by @jon-barker :: PR: #2915
- Add sequence packing support for hybrid model by @duncanriach :: PR: #2913
- [Main] Partial CUDA Graph support for EP Overlap by @Wohox :: PR: #2184
- docs(megatron-fsdp): add Megatron-FSDP user guide by @xuwchen :: PR: #2396
- DeepSeek V3.2 support by @kunlunl :: PR: #2440
- fully remove zarr support by @dimapihtar :: PR: #2944
- chore: Standardize setuptools version by @ko3n1g :: PR: #2975
- ci: Run functional tests on main by @ko3n1g :: PR: #2983
- ci(fix): CI_COMMIT_BRANCH on forks by @ko3n1g :: PR: #2982
- [main] feat(moe): Support gated delta net for Qwen3-Next (1/4) by @yuzhongw-nvidia :: PR: #1989
- ci: Add more gb200 nightly tests by @ko3n1g :: PR: #2981
- [main] feat(moe): Support apply wd to qk layernorm for Qwen3-Next (4/4) by @yuzhongw-nvidia :: PR: #2753
- Re-submit "Various fixes to in-job restarter and better time accounting of startup operations" by @hexinw-nvidia :: PR: #2954
- Use slack-sdk in a different manner by @Phlip79 :: PR: #2950
- Hybrid Context Parallel Feature by @parthmannan :: PR: #2282
- Inference | Move
assert active_request_count > 0. by @lmcafee-nvidia :: PR: #2958 - Set
token_dtype_codeinit value inGPTDatasetConfigto fix CI by @asolergi-nv :: PR: #2912 - [main] ci(moe): Add
--apply-wd-to-qk-layernormflag to the gdn test case by @yuzhongw-nvidia :: PR: #2995 - ci: Disable step time on `gpt3_moe_mcore_te_tp2_pp2_ep4_etp1_no_mtp_n… by @ko3n1g :: PR: #2991
- ci: Fix workflows on main by @ko3n1g :: PR: #2990
- Make Megatron-FSDP torch.compile compatible by @shjwudp :: PR: #2425
- [Megatron-FSDP] Test FP8 activations + parameter sharding with Megatron-FSDP fully-shard. Update README. by @cspades :: PR: #2894
- chore: Escape special chars by @ko3n1g :: PR: #3014
- Improve memory logging by @deepakn94 :: PR: #2839
- Add a wrapper function for FA3 _flash_attn_forward call by @santhnm2 :: PR: #2933
- chore: Set umask 0002 by @ko3n1g :: PR: #3027
- Make attn mask inversion in-place instead of allocating it again by @mathemakitten :: PR: #3019
- [Megatron-FSDP] Fix incorrect gradient scaling target. by @cspades :: PR: #3023
- Replaces ModuleSpec with Protocols for some of the inputs to SelfAttention/CrossAttention by @nschank :: PR: #2761
- Various CUDA graph improvements on capture time, replay time, memory footprint by @jiemingz :: PR: #2572
- Update oncall schedule by @Phlip79 :: PR: #3017
- Ensure that last prefill chunk is handled correctly by Mamba models by @santhnm2 :: PR: #2897
- Add script for batch running CI tests across distinct nodes by @jon-barker :: PR: #3047
- Refit EP support by @wdykas :: PR: #2972
- Catch case of negative tokens to generate by @tdene :: PR: #2985
- Sync GitHub and Slack teams by @Phlip79 :: PR: #3037
- ci: Remove Github transition comment from CI by @chtruong814 :: PR: #2881
- Support custom Router implementations in MoELayer by @nschank :: PR: #2891
- ci: Override N_REPEAT by @ko3n1g :: PR: #3051
- Update type hints and doc strings for moe_utils.py by @JavaZeroo :: PR: #2821
- Supporting inference when called within an asyncio loop by @shanmugamr1992 :: PR: #2816
- Remove calculation of padding token in moe routing loss by @HaochenYuan :: PR: #2142
- Bug fix with --no-use-tokenizer-from-checkpoint-args by @jon-barker :: PR: #3049
- Revert "Bug fix with --no-use-tokenizer-from-checkpoint-args (#3049)" by @thomasdhc :: PR: #3057
- Add health endpoint to dynamic text gen server by @santhnm2 :: PR: #3009
- ci: Skip test_precision_aware_optimizer by @thomasdhc :: PR: #3062
- Support multimodule communication by @yaoyu-33 :: PR: #2031
- Revert "Support multimodule communication (#2031)" by @ko3n1g :: PR: #3068
- Revert "Remove calculation of padding token in moe routing loss (#2142)" by @ko3n1g :: PR: #3069
- Add ability to save wgrads and dgrads by @deepakn94 :: PR: #3032
- ci: Mark test_mode_partial_cudagraph unit tests as flaky by @chtruong814 :: PR: #3064
- Keep FSDP's and DDP's finish_grad_sync API identical by @deepakn94 :: PR: #3070
- (REPLAY) Bug fix with --no-use-tokenizer-from-checkpoint-args by @jon-barker :: PR: #3059
- Optimizing post-processing of requests by @sidsingh-nvidia :: PR: #2920
- Fix broken functional tests in #2920 by @sidsingh-nvidia :: PR: #3071
- fix ep weight gradnorm/num_zero calculation error for muon by @FDecaYed :: PR: #3024
- [training migration] Add LoggerConfig dataclass by @maanug-nv :: PR: #2414
- Added --ft-num-warmup-iters option. by @hexinw-nvidia :: PR: #3052
- Reapply "Various CUDA graph improvements on capture time, replay time, memory footprint (#2572)" by @jiemingz :: PR: #3056
- fix(fsdp): add CLI argument for outer_dp_sharding_strategy by @liuyun7345 :: PR: #3053
- ci: Log node name by @ko3n1g :: PR: #3081
- docs: Release docs by @ko3n1g :: PR: #3055
- Support NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 FP8/NVFP4 PTQ in example by @ChenhanYu :: PR: #3079
- add all_gather process-group for overlapping in fsdp disributed training by @jeffnvidia :: PR: #2663
- Add router replay for MoE models by @litianjian :: PR: #2101
- ci: Disable gpt_dynamic_inference_tp1_pp1_dp8_583m_throughputtest_zmq by @ko3n1g :: PR: #3099
- ci: Repeat func tests, save logs of unit tests and lessen debug output by @ko3n1g :: PR: #3089
- ci: Update improvement of step-time by @ko3n1g :: PR: #3104
- ci: Add GPU health checks by @ko3n1g :: PR: #3100
- Inference functional tests: Write outputs to INFERENCE_OUTPUT_PATH instead of TENSORBOARD_PATH by @mathemakitten :: PR: #3061
- Harden GRPO functional tests by @jon-barker :: PR: #3065
- build: Bump to TE2.12 by @ko3n1g :: PR: #3086
- Update moe readme. by @Victarry :: PR: #2830
- Logging cleanup (only log on rank 0 if possible) by @deepakn94 :: PR: #3036
- Move all bert and t5 tests to nightly by @Phlip79 :: PR: #3106
- Create greptile.json by @Phlip79 :: PR: #3087
- Fix bug of reuse_grad_buf_for_mxfp8_param_ag by @kunlunl :: PR: #2802
- Fix for Hybrid CP by @parthmannan :: PR: #3091
- Fix GRPO re-fit functional test by @jon-barker :: PR: #3113
- Minimize README contents by @megnvidia :: PR: #3020
- Add end-to-end tests for M-FSDP and ND-Parallel by @shjwudp :: PR: #3031
- [M-FSDP] Fix double buffering not working with activation recompute by @shjwudp :: PR: #2689
- Fix Multimodal Dockerfile by @faradawn :: PR: #3006
- [training migration] Add CheckpointConfig dataclass by @maanug-nv :: PR: #2431
- [training migration] Add StragglerDetectionConfig dataclass by @maanug-nv :: PR: #2435
- Standardize RL unit tests by @tdene :: PR: #3088
- remove retro by @dimapihtar :: PR: #3001
- Use the latest hybrid-ep by @Autumn1998 :: PR: #3093
- ci: Mark test_compatible_with_nd_parallel as flaky by @ko3n1g :: PR: #3122
- build: Use merge-commit-sha for container by @ko3n1g :: PR: #3123
- Refactor
rl_offload_kv_cache_during_trainingto offload KV cache to CPU while retaining fixed virtual address by @mathemakitten :: PR: #3048 - Disable Greptile status comments by @Phlip79 :: PR: #3127
- ci: Add unit tests to merge queue by @ko3n1g :: PR: #3125
- Create CodeRabbit config by @Phlip79 :: PR: #3131
- Fix RL optimizer offload by @jon-barker :: PR: #3112
- build: Explicitly set minimum torch version to >= 2.6.0 by @chtruong814 :: PR: #3085
- Move kitchen extension file to private kitchen repository by @kwyss-nvidia :: PR: #2779
- Revert "Fix RL optimizer offload (#3112)" by @ko3n1g :: PR: #3141
- Revise and move KD docs by @AAnoosheh :: PR: #3108
- build: Bump FLA by @ko3n1g :: PR: #3139
- ci: Add job timeouts by @ko3n1g :: PR: #3142
- Multiturn rollout support prep by @yobibyte :: PR: #2966
- ci: Set NODE_RANK by @ko3n1g :: PR: #3143
- Reapply 3955c49 by @jon-barker :: PR: #3146
- Revert "Multiturn rollout support prep (#2966)" by @ko3n1g :: PR: #3153
- Force input ids generated by mock dataset are < vocab_size by @asolergi-nv :: PR: #2945
- Fix coderabbit instructions error by @Phlip79 :: PR: #3150
- Add a check to make sure we are distributing all the layers when using
--decoder-first-pipeline-num-layers&--decoder-last-pipeline-num-layersby @asolergi-nv :: PR: #2947 - Automatically choose available ports in ZMQ by @tdene :: PR: #2278
- Generate arguments from TransformerConfig by @maanug-nv :: PR: #2896
- Fix for PR-2142 by @HaochenYuan :: PR: #3165
- ci: Onboard more GB200 tests by @ko3n1g :: PR: #3145
- ci(hotfix): Alert for GB200 by @ko3n1g :: PR: #3168
- Fix SFTDataset truncation bug by @duncanriach :: PR: #3158
- Vitalyk/multiturn v2 by @yobibyte :: PR: #3167
- ci: Disable the api check for now by @chtruong814 :: PR: #3157
- ci: Add DSv3 proxy by @ko3n1g :: PR: #3169
- Nvshmem refit by @wdykas :: PR: #2696
- [Community][Main] fix(moe): Fix theoretical memory calculation of layernorm. by @1195343015 :: PR: #2434
- fix: Set --refit-method default to gloo by @wdykas :: PR: #3172
- [fix] Bug fix for offloading in evaluate() by @lhb8125 :: PR: #3043
- cp:
Fix: nccl-ub in ddp path (3181)intomainby @ko3n1g :: PR: #3182 - Miscellaneous inference cleanup by @santhnm2 :: PR: #2955
- ci: Fix DSv3 by @ko3n1g :: PR: #3188
- Fix missing argument in MoELayer.forward() by @jiemingz :: PR: #3133
- Fix H2D stream synchronization in optimizer offload by @tgkyrie :: PR: #3140
- Add MTP support for hybrid models by @rkarimimahab :: PR: #2363
- docs: improve Megatron-LM and Megatron Core descriptions by @sbhavani :: PR: #3115
- Handle
stepkey correctly in checkpoint save with--optimizer-cpu-offloadby @ahmadki :: PR: #2874 - cp:
ci: Checkpoint retention (3205)intocore_r0.16.0by @ko3n1g :: PR: #3222 - cp:
Fix uv install for GH actions (#3259)by @ko3n1g :: PR: #3275 - cp:
Fix missing PackedSeqParams import (3214)intocore_r0.16.0by @ko3n1g :: PR: #3236 - cp:
fix: numpy overflow (3306)intocore_r0.16.0by @ko3n1g :: PR: #3328 - Missing import fix (#3241) by @parthmannan :: PR: #3298
- cp:
fix: T5 dataset (#3307)by @ko3n1g :: PR: #3329 - cp:
build: Bump TE on 2.12by @ko3n1g :: PR: #3372 - cp:
Improved parallel logging of learning rateby @ko3n1g :: PR: #3367 - cp:
ci: Update release-docs workflow to use FW-CI-templates v0.72.0 (3438)intocore_r0.16.0by @ko3n1g :: PR: #3454 - cp:
ci: Remove environments (3462)intocore_r0.16.0by @ko3n1g :: PR: #3481 - cp: Update release workflow to include changelog and publish docs (#3472) by @chtruong814 :: PR: #3480
- chore(beep boop 🤖): Bump
uv.lock(core_r0.16.0) (2026-02-19) by @svcnvidia-nemo-ci :: PR: #3502 - docs: Update docs for 0.16.0 by @chtruong814 :: PR: #3505
- chore(beep boop 🤖): Bump
uv.lock(core_r0.16.0) (2026-02-23) by @svcnvidia-nemo-ci :: PR: #3533 - docs: Update docs version picker for 0.16.0 to include nightly by @chtruong814 :: PR: #3547
- cp:
ci: Test docs build (#3583)by @ko3n1g :: PR: #3593 - cp: Changes of CICD workflow by @ko3n1g :: PR: #3603