github NVIDIA/Megatron-LM core_v0.18.0
NVIDIA Megatron Core 0.18.0

7 hours ago
Changelog Details
  • fix(ci): replace actions/setup-python with apt-get to avoid 429 rate limits by @ko3n1g :: PR: #4072
  • ci: Fix package name for code-freeze workflow by @ko3n1g :: PR: #4077
  • chore: bump _code_freeze workflow to v0.86.0 by @ko3n1g :: PR: #4078
  • Fix checkpoint inspector by @janEbert :: PR: #4079
  • Update docs to conform to NVIDIA style guides by @megnvidia :: PR: #4068
  • Miscellaneous inference fixes by @santhnm2 :: PR: #4030
  • fix fine_grained_callables with fused rmsnorm residual by @CarlosGomes98 :: PR: #4026
  • [Main][feat] Support overlapping A2A Combine backprop with wgrad GEMM by @Wohox :: PR: #3795
  • Modify mfsdp default data-parallel-sharding-strategy by @wplf :: PR: #3691
  • Fix fsdp_dtensor conversion for pretrained-only checkpoints by @DAISY-gh :: PR: #3912
  • Guard NVshmem issues by @wdykas :: PR: #4093
  • m-fsdp: wire use_precision_aware_optimizer from ddp_config to ParamAn… by @rapatel :: PR: #4024
  • Megatron-FSDP: Add MXFP8 transpose helper buffer for Hybrid FSDP by @shjwudp :: PR: #3918
  • feat(fsdp): use TE general_gemm for mixed-precision wgrad in FSDP path by @Victarry :: PR: #3822
  • Megatron-FSDP: Fix insufficient double buffers during gradient reduce by @shjwudp :: PR: #4054
  • Fix M-FSDP MXFP8 related BUGs by @shjwudp :: PR: #3991
  • Megatron-FSDP: Make _pre_forward_param_unshard and _register_post_backward_hook formal by @shjwudp :: PR: #4029
  • FIX: Use decoupled gradients for precision-aware M-FSDP grad norm by @XueSongTap :: PR: #3746
  • Align chat completions endpoint with vLLM by @santhnm2 :: PR: #4063
  • [Megatron-FSDP] Fix compatibility with frozen parameters and add unit tests by @shjwudp :: PR: #3287
  • [M-FSDP] Refactor uneven dtensor to full tensor and add UT by @shjwudp :: PR: #3190
  • Add agent instruction files by @Phlip79 :: PR: #4102
  • Bump eopt version by @skyw :: PR: #4100
  • Refactor emerging optimizer integration by @skyw :: PR: #4113
  • Fix over provisioning of Mamba state memory when max_requests is set by @santhnm2 :: PR: #4114
  • base strategy simplification by @dimapihtar :: PR: #4001
  • add support for DCP and FSDP async save by @dimapihtar :: PR: #4027
  • Add more emerging optimizers (#3907) by @skyw :: PR: #4119
  • Fix FSDP checkpoint conversion and loading for Qwen3.5-VL by @DAISY-gh :: PR: #3936
  • docs: update mcore optimizer docstrings to google style by @Akshat8510 :: PR: #2799
  • Set tensor-parallel attributes irrespective of perform_initialization by @ilml :: PR: #4084
  • docs: add developer-guide skill with CI/CD and failure navigation guidance by @ko3n1g :: PR: #4035
  • chore: Move skills by @ko3n1g :: PR: #4136
  • ci: Let Claude react to comment by @ko3n1g :: PR: #4135
  • Nemotron3 Super GB200 release config by @maanug-nv :: PR: #4118
  • Enable CUDA graph for ADAM optimizer by @vasunvidia :: PR: #3429
  • Claude review should recommend testing by @Phlip79 :: PR: #4137
  • cleanup: remove unused scatter_gather_tensors_in_pipeline argument by @Phlip79 :: PR: #4140
  • fix: Remove fail-fast (-x) and guard distributed teardown against deadlock by @ko3n1g :: PR: #4139
  • Claude: add respond-to-issue skill by @Phlip79 :: PR: #4141
  • Fix muon getter backward compatability by @skyw :: PR: #4157
  • Audit of user guide by @megnvidia :: PR: #4098
  • Fix RerunStateMachine crash (TypeError: 'NoneType' object is not subscriptable) by not saving a checkpoint after a transient NaN / Inf by @yezhengmao1 :: PR: #3981
  • Preserve type of decorated methods/classes by @nschank :: PR: #4062
  • update muon test case to use new interface by @skyw :: PR: #4163
  • [M-FSDP] Fix Tensor Parallel mode detection by @shjwudp :: PR: #3191
  • fix: remove weights_only=False for multimodal example by @faradawn :: PR: #4104
  • Cudagraphs: Fix sequence packing segfault more generally by @mathemakitten :: PR: #4162
  • Make MTP work with materialize_only_last_token_logits by @santhnm2 :: PR: #4166
  • Add unit test for Mamba EP inference (eager fallback with mixed CUDA graphs) by @santhnm2 :: PR: #4085
  • update docs in respect to async changes by @dimapihtar :: PR: #4177
  • update checkpointing docs in respect to async changes by @dimapihtar :: PR: #4208
  • chore: improve build-and-test skill with trigger rules and dependency workflow by @ko3n1g :: PR: #4199
  • Fix layerwise optimizer with expt_dp_size=1 and contention with element-wise distributed optimizer by @skyw :: PR: #4138
  • ci: add --cluster-a100/h100/gb200 args to trigger_internal_ci.py by @ko3n1g :: PR: #4195
  • ci: Update golden values for nightly tests by @chtruong814 :: PR: #4215
  • rename async_allgather to overlap_param_gather by @skyw :: PR: #4217
  • Fix Slack sync for users with GitHub email privacy enabled by @Phlip79 :: PR: #4220
  • Miscellaneous MTP inference fixes by @santhnm2 :: PR: #4191
  • Move inference guards out of arguments.py by @mathemakitten :: PR: #4210
  • Fix: enable fine-grained activation offloading for Mamba model. by @fanshiqing :: PR: #4173
  • bump NVRx by @dimapihtar :: PR: #4178
  • Update tokenizer args for Nemotron3 release config by @maanug-nv :: PR: #4239
  • build: add dynamic git-versioning and drop rc0 pre-release tag by @ko3n1g :: PR: #4212
  • Fix unnecessary permute padding for non-quantized MoE dispatch by @xiaoxi-wangfj :: PR: #4038
  • Fix split state dict main by @kunlunl :: PR: #3676
  • Add /split-pr Claude Code command for splitting PRs by CODEOWNERS by @Phlip79 :: PR: #4160
  • Enable FP8 DPA for MXFP8 recipe by @vasunvidia :: PR: #4066
  • Enable AG/RS overlap with explicit process group passing by @jeffnvidia :: PR: #3249
  • Enable cpu_offloading with Full iteration CUDA graph by @vasunvidia :: PR: #3969
  • Fix TransformerConfig validation for mixed dense/MoE upcycling by @rkteddy :: PR: #3647
  • Remove cross-rank synchronization during checkpoint load & deprecate torch.distributed.checkpoint.state_dict_loader.load_state_dict by @asolergi-nv :: PR: #2864
  • Fix incorrectly set decoupled_grad and DistOpt mechanics for MFSDP. by @cspades :: PR: #4133
  • Refit Miscelaneous by @wdykas :: PR: #3973
  • Add conditions_embeddings argument to TransformerBlock, TransformerLayer for DiT (diffusion transformer) by @huvunvidia :: PR: #4134
  • Fix build_sequences_per_dataset output path arg usage by @DhineshPonnarasan :: PR: #4144
  • ci: Flush pending CUDA work before the barrier in destroy_model_parallel by @chtruong814 :: PR: #4259
  • Update oncall schedule by @Phlip79 :: PR: #4257
  • docs(moe): Update MoE README by @sbhavani :: PR: #3664
  • Revert "Add conditions_embeddings argument to TransformerBlock, Trans… by @ko3n1g :: PR: #4270
  • reduce the number of shared expert streams by @yangbofun :: PR: #3752
  • remove legacy Bert code by @dimapihtar :: PR: #4204
  • [Main] Feat(moe): Gated delta net context parallel (CP) by @yuzhongw-nvidia :: PR: #2642
  • remove t5 legacy code by @dimapihtar :: PR: #4203
  • fix: handle list-typed process groups in ProcessGroupCollection.repr by @cluster2600 :: PR: #3753
  • Fix Context Parallelism documentation link by @liangxs :: PR: #4149
  • [MLA] fix: Pad V when Q/V head dims differ for THD by @HollowMan6 :: PR: #3003
  • fix(megatron-fsdp): build expt_device_mesh only for MoE models by @xuwchen :: PR: #3831
  • Allow the evaluation batch size to differ from the training batch size by @michal2409 :: PR: #4014
  • Add @NVIDIA/transformer review group to megatron/core/transformer/ by @Phlip79 :: PR: #4281
  • Reset AG_pipeline bucket status after validation step. by @vasunvidia :: PR: #3155
  • Enhance and fix NVTX for training by @yaox12 :: PR: #3642
  • NVFP4 native weights for DDP by @WanZzzzzz :: PR: #4005
  • Remove unnecessary arguments for layerwise distributed optimizer by @FDecaYed :: PR: #4272
  • reuse grad buffer for layer-wise param allgather by @FDecaYed :: PR: #3751
  • feat(ci): add strict review mode to Claude review workflow by @Victarry :: PR: #4197
  • Fix stale approvals by @Phlip79 :: PR: #4280
  • [MoE] Add a new score function to the router by @yaox12 :: PR: #3673
  • [MoE] Improvement of shared expert overlap, support shared expert overlap for FlexDispatcher by @Victarry :: PR: #2207
  • build: bump DeepEP to 34152ae by @ko3n1g :: PR: #4228
  • ci: mark test_fused_indexer_loss_gradient_tp_consistency as flaky_in_dev by @ko3n1g :: PR: #4299
  • Fix typo in PR4133. by @cspades :: PR: #4277
  • ci: add retry loop to apt-get update to handle transient mirror sync failures by @ko3n1g :: PR: #4209
  • fix: enforce correct pass thresholds for deterministic and approximate tests by @ko3n1g :: PR: #4238
  • remove legacy biencoder and realm models by @dimapihtar :: PR: #4205
  • ci: add configurable launcher support for functional tests (ft_launcher / torchrun) by @ko3n1g :: PR: #4298
  • chore: document --target main for local Docker builds by @ko3n1g :: PR: #4307
  • Extract args init to launch scripts by @maanug-nv :: PR: #4225
  • [Main] Fix TE version check for retain_pinned_cpu_buffers in cpu offload by @BestJuly :: PR: #4267
  • Fix documented shape by @janEbert :: PR: #3486
  • ci: add sync-skills workflow, rename CLAUDE.md → AGENTS.md, move .claude/skills → skills/ by @ko3n1g :: PR: #4303
  • Get device correctly when module returns a dict instead of individual tensor by @shifangx :: PR: #4265
  • remove vision legacy code by @dimapihtar :: PR: #4202
  • feat: long convergence resiliency for release tests by @ko3n1g :: PR: #4335
  • ci(action): improve GitHub Actions output UX by @ko3n1g :: PR: #4337
  • build: bump TransformerEngine to release_v2.14 by @ko3n1g :: PR: #4331
  • feat: add create-issue skill by @ko3n1g :: PR: #4338
  • M4 leftover for TE cuda graph by @shifangx :: PR: #3137
  • fix: wait for async P2P send before deallocating output tensor by @ZhiyuLi-Nvidia :: PR: #4047
  • ci(gb200): add 1-node mr-github functional test variants by @ko3n1g :: PR: #4334
  • Fix potential coredump issue that occurs when saving a checkpoint by @ezioliao :: PR: #1871
  • docs: bump versions1.json to 0.17.0 (latest) by @ko3n1g :: PR: #4360
  • Port DeepSeek Sparse Attention to MambaModel by @janEbert :: PR: #3553
  • Add tables and histogram for RL staleness by @tdene :: PR: #4097
  • [docs] ci: use parent-relative json_url for version picker by @ko3n1g :: PR: #4367
  • Fix bug with non-partial rollouts by @tdene :: PR: #3964
  • Add QK layernorm support for dot-product attention in MambaModel by @Phlip79 :: PR: #4067
  • Docs: improve docstrings and comments in example training loop by @DhineshPonnarasan :: PR: #4041
  • feat(ckpt): add --async-ckpt-use-cpu-shm argument by @sbak5 :: PR: #4355
  • cp: Fix UT timeout (#4310) by @chtruong814 :: PR: #4373
  • Fix RL reward due to stop token by @tdene :: PR: #4096
  • FA4 Inference by @wdykas :: PR: #4186
  • Make param_index_map always use unpacked (full numel) offsets by @deepakn94 :: PR: #4328
  • Add activation logging and tokens per expert logging by @Mellonta :: PR: #3842
  • Fix RL to once again work with --skip-train by @tdene :: PR: #4249
  • Fix Megatron initialization with extra_args_provider by @santhnm2 :: PR: #4327
  • Rename MambaModel/MambaStack to HybridModel/HybridStack by @Phlip79 :: PR: #4099
  • fix(ci): wrap uv install in retry block by @ko3n1g :: PR: #4387
  • Call save_checkpoint_and_time() when saving checkpoint and compute elapsed duration for saving checkpoint before logging timer by @awsankur :: PR: #4263
  • refactor(tests): move NCCL env vars from docker launcher to shell training script by @ko3n1g :: PR: #4390
  • Remove packed_attention_mask unused parameter by @tdene :: PR: #3859
  • Second batch of audit edits by @megnvidia :: PR: #4115
  • revert: replace rampup batch size scheduler with custom step batch size schedules (#3779) by @ko3n1g :: PR: #4404
  • Replace rampup batch size scheduler with custom step batch size schedules by @deepakn94 :: PR: #4411
  • Megatron-FSDP: log mcore detection only after imports succeed by @wujingyue :: PR: #4400
  • ci(gb200): re-enable tunable_overlap 1-node mr-github test by @ko3n1g :: PR: #4405
  • Fix local docs building by @Phlip79 :: PR: #4416
  • RL: Onload optimizer after logprobs computation by @tdene :: PR: #4235
  • Add RL token throughput and packing metrics by @tdene :: PR: #3877
  • ci: remove publish:merge_into_dev job by @ko3n1g :: PR: #4421
  • docs: add data loading best practices for large-scale training by @sbhavani :: PR: #4236
  • Fix: Auto enable manual registration and enhance the docummentation by @youngeunkwon0405 :: PR: #3295
  • Fix nvtx_decorator to check _nvtx_enabled at call time by @minitu :: PR: #4184
  • fix merges_file typo in megatron_hf_tokenizer by @chelseajohn :: PR: #4392
  • Enable NullTokenizer for pretraining to reduce I/O access by @asolergi-nv :: PR: #4057
  • docs: Add SECURITY.md by @chtruong814 :: PR: #4431
  • Mamba inference opt by @wdykas :: PR: #4414
  • DDP refactoring: Extract parameter layout computation into optimizer classmethod by @deepakn94 :: PR: #3812
  • Update PR template with explicit request for issue by @Phlip79 :: PR: #4409
  • Misc inference fixes by @sidsingh-nvidia :: PR: #4397
  • Rename Mamba to Hybrid outside megatron/core by @Phlip79 :: PR: #4159
  • Include mtp layers in token per expert logging by @Mellonta :: PR: #4412
  • fix: NVRx async compatibility and defer resiliency import by @sbak5 :: PR: #4420
  • fix(checkpoint_inspector): allow empty --param-to-param-group-map-json by @DAISY-gh :: PR: #4403
  • Add the YARN support for hybrid_model by @guihong-nv :: PR: #4244
  • [training migration] Add container class for config dataclasses by @maanug-nv :: PR: #4227
  • Inference: Fix broken functional tests on gitlab by @sidsingh-nvidia :: PR: #4454
  • SafeUnpickler class for safe pickle usage by @dimapihtar :: PR: #4319
  • get rid of weights_only=False by @dimapihtar :: PR: #4434
  • Inference | Per-block MoE routing storage for prefix caching by @lmcafee-nvidia :: PR: #4301
  • Add troubleshooting tip for 'access forbidden' by @balasaajay :: PR: #4449
  • Fix checkpoint loading with rerun state machine by @YangFei1990 :: PR: #4448
  • Add misc CUDA graph sugar to CudaGraphManager by @tdene :: PR: #4425
  • Inference: Add the embedding and output layer in the full_iteration_inference cuda graph scope for hybrid models by @sidsingh-nvidia :: PR: #4440
  • Important bugfixes in local CG implementation that were leading to loss curve gaps for latent MoE models by @jiemingz :: PR: #4433
  • fix: Replace polynomial rolling hash with SHA-256 for prefix caching by @lmcafee-nvidia :: PR: #4158
  • feat(ckpt): expose validate_access_integrity knob on dist-ckpt load by @asolergi-nv :: PR: #4422
  • Fix multivalidation by @RPrenger :: PR: #3388
  • Add missing knob for reduce_scatter_with_fp32_accumulation by @WanZzzzzz :: PR: #4410
  • Enable CUDA graphs for MTP inference by @santhnm2 :: PR: #4260
  • checkpoint integrity verification by @dimapihtar :: PR: #4305
  • Fix cache gating by @wdykas :: PR: #4455
  • [Main] Fix FusedAdam.use_decoupled_grad mis-set for Megatron-FSDP. by @cspades :: PR: #4427
  • add permute fusion into hybrid ep by @Autumn1998 :: PR: #4089
  • Add ColocatedBridgeCommunicator for heterogeneous TP/DP MIMO training (NMFW-17) by @yashaswikarnati :: PR: #4368
  • Fix incorrect bias display in extra_repr of Column/RowParallelLinear by @HelloWorldBeginner :: PR: #4330
  • Fix assertion logic in combined_1f1b_schedule_for_interleaved_pipelining by @joapolarbear :: PR: #4276
  • ci: Fix event name reference in CI workflow condition for merge group by @balasaajay :: PR: #4462
  • Add nightly sync workflow from main to dev by @Phlip79 :: PR: #4165
  • fix: handle list-format quant_cfg from ModelOpt PR #1094 by @ChenhanYu :: PR: #4187
  • ci: also add Run MBridge tests label in nightly sync workflow by @Phlip79 :: PR: #4499
  • [training migration] Add serialization features to config container by @maanug-nv :: PR: #4309
  • Fix conflict with inference graphs by @tdene :: PR: #4504
  • Add tools/prepare_cache.py for offline GPT dataset cache preparation by @asolergi-nv :: PR: #4080
  • [build] fix: move mamba-ssm and causal-conv1d to optional [ssm] extra by @ko3n1g :: PR: #4517
  • mamba: avoid redundant HBM reloads in causal_conv1d_update shift loop by @wdykas :: PR: #4460
  • Standardize misc graph interface by @tdene :: PR: #4485
  • Fix inference graph override in RL flow by @tdene :: PR: #4323
  • Unify and refactor Megatron-FSDP documentation. by @cspades :: PR: #4418
  • Revert "ci: add base_sha to codecov/codecov-action upload step (#4445)" by @chtruong814 :: PR: #4526
  • Skills for running unit tests and working with slurm by @yashaswikarnati :: PR: #4502
  • Reorganize order of operations in inference context and text generation controller by @tdene :: PR: #2929
  • ci: Update CI workflow conditions to include merge group handling by @balasaajay :: PR: #4532
  • ci: add base_sha to codecov/codecov-action upload step by @chtruong814 :: PR: #4540
  • Fix release tests: remove --global-batch-size conflicting with --step-batch-size-schedule by @deepakn94 :: PR: #4545
  • docs: use @file-path notation for file references in skills by @ko3n1g :: PR: #4542
  • Support YAML quant recipe in PTQ and remove first/last layer modifier code by @jenchen13 :: PR: #4503
  • Avoid nsys profile crash with CUDA graphs by @tdene :: PR: #4541
  • fix(ci): add retry with backoff to approve-test-queue bot by @ko3n1g :: PR: #4559
  • New allgathervdispatcher for inference and simplify old dispatcher. by @sidsingh-nvidia :: PR: #4258
  • Fixes for modelopt examples and SFTTokenizer for transformers v5 by @jenchen13 :: PR: #4450
  • Adding code for Flextron by @sheliang-nv :: PR: #4429
  • Fix partial cudagraphs + HybridEP not properly triggering DDP hook by @jiemingz :: PR: #4500
  • Ignore pytorch link anchors by @maanug-nv :: PR: #4582
  • MoE dispatcher fixes: size NVLS dispatcher buffers from actual tensor sizes by @mathemakitten :: PR: #4576
  • Finalize all builders in preprocess_data, not just the last key by @sayalinvidia :: PR: #4573
  • refactor(skills): add when_to_use frontmatter, split ci-test-system, enforce skill workflow by @ko3n1g :: PR: #4574
  • Make last_token_logits graphable by @tdene :: PR: #4552
  • fix(ci): correct off-by-one in total_steps_evaluated formula by @ko3n1g :: PR: #4591
  • Add fault injection support via nvidia_resiliency_ext. by @hexinw-nvidia :: PR: #4370
  • Guard vocab reduce_scatter on TP > 1 by @mathemakitten :: PR: #4565
  • Move inference context bookkeeping to CPU with ContextGPUView by @lmcafee-nvidia :: PR: #4306
  • Enable InJob restart on failures. by @hexinw-nvidia :: PR: #4594
  • Enable shared expert overlap with allgatherv in inference by @sidsingh-nvidia :: PR: #4570
  • Add vLLM grouped gemm backend for MoE inference by @santhnm2 :: PR: #4566
  • Move KD teacher loading to after Float16Module by @AAnoosheh :: PR: #4394
  • ci: update gpt3_7b_tp4_pp1_memory_speed gb200 golden values by @ko3n1g :: PR: #4601
  • Fix inference unit test by @maanug-nv :: PR: #4589
  • Checkpoint conversion between GPT_model and Hybrid_model by @guihong-nv :: PR: #4482
  • ci: add cadence input for test filtering in CI workflows by @balasaajay :: PR: #4561
  • Fix mtp_use_repeated_layer behavior for GPT models by @rkarimimahab :: PR: #3965
  • Handle SSM sharded tensor merge OOM with CPU fallback by @returnL :: PR: #4442
  • FlashInfer sampling by @tdene :: PR: #2456
  • Fix main2dev workflow by @Phlip79 :: PR: #4610
  • Add logic to enable chunked MLP during training by @pengdurice :: PR: #3656
  • Inference bug-fixes: Re-enable EP syncs for the legacy A2A dispatcher and re-simplify ep_sync accidentally reverted by #4306 by @sidsingh-nvidia :: PR: #4587
  • Remove invalid timeout argument for dist.barrier by @zhaoyinglia :: PR: #4512
  • Fix buffers in refit by @wdykas :: PR: #4580
  • Named validation sets by @RPrenger :: PR: #4578
  • Fix Hang in tests by @wdykas :: PR: #4575
  • Single commit for main2dev nightly by @Phlip79 :: PR: #4614
  • convert tokenizer args to config by @dimapihtar :: PR: #4406
  • Siddharth/fix ep sync by @wdykas :: PR: #4607
  • mmiranda working on another set of broken links by @megnvidia :: PR: #4534
  • Fix gradient corruption with layerwise param all-gather overlap by @deepakn94 :: PR: #4609
  • test: mark TestFusedApplyMLARope::test_forward_backward_for_q flaky_in_dev by @ko3n1g :: PR: #4639
  • remove legacy GPT code by @dimapihtar :: PR: #4322
  • ci: introduce L-tier scope vocabulary via parser by @balasaajay :: PR: #4625
  • Inference: Tune vLLM grouped gemm, moe_sum kernel, and enable shared expert overlap in latent MoEs by @sidsingh-nvidia :: PR: #4603
  • Fix crash involving evicted requests and tpot by @tdene :: PR: #4645
  • remove legacy tranformer and modules by @dimapihtar :: PR: #4207
  • chore: Update Docker image version to 26.04-py3 by @balasaajay :: PR: #4611
  • Propagate errors for failed inference requests by @mathemakitten :: PR: #4679
  • Inference: Cache input + position ID views by @mathemakitten :: PR: #4634
  • ci: Update Gitlab base image to 26.04 pytorch by @chtruong814 :: PR: #4688
  • Add periodic GPU sniff tests to detect hardware stragglers by @deepakn94 :: PR: #4662
  • ci: Bump GHA versions by @chtruong814 :: PR: #4606
  • build: widen flashinfer-python pin to <0.7.0 by @ko3n1g :: PR: #4700
  • Add Python-side guardrail for HybridEP InfiniBand limit and rename seq_len by @Shreyas-S-809 :: PR: #4094
  • Switch oncall by @janEbert :: PR: #4702
  • Update golden values for various functional tests by @balasaajay :: PR: #4703
  • chore: Update golden values for various functional tests by @balasaajay :: PR: #4706
  • build: upgrade mamba-ssm to 2.3.2.post1, causal-conv1d to 1.6.2.post1 by @ko3n1g :: PR: #4712
  • ci: replace uuidgen with /proc/sys/kernel/random/uuid by @ko3n1g :: PR: #4714
  • chore(codeowners): add megatron/inference/ ownership by @ko3n1g :: PR: #4704
  • Create a Protocol for the MLP layer of TransformerLayer by @nschank :: PR: #3435
  • Revert "Add Python-side guardrail for HybridEP InfiniBand limit and rename seq_len (#4094)" by @ko3n1g :: PR: #4718
  • Add Python-side guardrail for DeepEP IB limits by @janEbert :: PR: #4719
  • ci: revert bad uv.lock bump and label future bumps with Run functional tests by @ko3n1g :: PR: #4730
  • [ci] fix: treat cancelled run-main-script step as failure by @ko3n1g :: PR: #4727
  • ci: Major refactor of release-workflows by @ko3n1g :: PR: #4602
  • build(deps): bump nvidia-modelopt to 0.43 by @ko3n1g :: PR: #4723
  • fix(fsdp): recognize legacy GDN TP metadata by @Glitchfix :: PR: #4664
  • Fixes for Nemotron3 Super release test config by @maanug-nv :: PR: #4544
  • feat(gpt): add output postprocess hook by @Glitchfix :: PR: #4686
  • Add bump-base-image skill and update golden value comparison by @balasaajay :: PR: #4733
  • Guard omegaconf imports by @maanug-nv :: PR: #4685
  • Fix a regression introduced by #4625 for nightly runs by @balasaajay :: PR: #4734
  • Add LLaVA audio (sound) model support by @cuichenx :: PR: #4402
  • Support transfomers 5.x.x for text generation server by @tdene :: PR: #4732
  • Update transformer-engine dependency to version 2.15.0 by @balasaajay :: PR: #4682
  • Increase CG cover from max_requests to max_tokens by @tdene :: PR: #4214
  • fully remove legacy code by @dimapihtar :: PR: #4759
  • fix legacy torch save when tensor_model_parallel_size > expert_model_parallel_size * expert_tensor_parallel_size by @dimapihtar :: PR: #4678
  • Wire --rl-inference-parsers into MRL by @tdene :: PR: #4768
  • Integrate LayerWiseDistributedOptimizer with DDP buffer infrastructure by @deepakn94 :: PR: #4509
  • [training migration] Migrate mamba builder by @maanug-nv :: PR: #4550
  • NCCL UB fix: reduce memory cost and correctly deregister NCCL mem pool by @xrennvidia :: PR: #4492
  • fix: use no_mask in local ViT layer spec by @Phlip79 :: PR: #4395
  • refit clean up and refactoring by @wdykas :: PR: #4762
  • Support recomputing in HybridModel by @xuantengh :: PR: #4496
  • Make weight and optimizer memory estimation take into account expert parallelism correctly by @YangFei1990 :: PR: #4687
  • One single flag that determines if we are in inference by @tdene :: PR: #4617
  • [main] feat(moe): Support packed sequence for gated delta net (GDN) by @yuzhongw-nvidia :: PR: #2645
  • remove dead manual_release_grads code path in 1F1B overlap schedule by @Wohox :: PR: #4511
  • Fix recompute checkpointing + training CGs by @tdene :: PR: #3919
  • Use Protocols to type-check linear_proj submodules of Attention by @nschank :: PR: #3434
  • fix tokenizers in respect to newer transformers by @dimapihtar :: PR: #4608
  • Bump nvidia-modelopt>=0.44.0 by @kevalmorabia97 :: PR: #4803
  • Update owners by @Phlip79 :: PR: #4794
  • ci: Update workflow to use same commit for building docker image and running tests by @balasaajay :: PR: #4787
  • chore: Update nightly tests golden values by @balasaajay :: PR: #4805
  • Inference: Optimize Prefill Engine Steps for Nemotron by @sidsingh-nvidia :: PR: #4764
  • ci: tolerate git-gc race in /home/runner chown after checkout by @balasaajay :: PR: #4808
  • additional tests for nvrx by @dimapihtar :: PR: #4522
  • Disable MSC by default; opt in via --enable-msc by @asolergi-nv :: PR: #4629
  • Strengthen test_checkpoint to verify distributed checkpoint behavior by @lichenlu :: PR: #4711
  • Combine GEMM + SwiGLU fused MLP PRs (3890, 4071, 4095, 4219, 4311, 4324) → main by @Connor-XY :: PR: #4636
  • [fix] Use MSC for checking checkpoint existence by @pavelgein :: PR: #4251
  • [Main][feat] Support A2A Overlap for Megatron-FSDP by @Wohox :: PR: #3797
  • Reorder mtp_post_process after attention backward in 1F1B schedule plan by @gdengk :: PR: #4695
  • add is_torch_min_version in fsdp src by @xrennvidia :: PR: #4812
  • Add high-priority A2A stream and HybridEP preprocessing SMs by @gdengk :: PR: #4694
  • Refactor CUDA graph API: decompose cuda_graph_scope into full_iteration impl, inference scope, and per-layer capture modules by @buptzyb :: PR: #4292
  • Tokenizers updates by @dimapihtar :: PR: #4780
  • Fix no nvrx tests by @dimapihtar :: PR: #4847
  • Thread custom process groups through MoE grad finalization by @yashaswikarnati :: PR: #4782
  • Fix unit tests by @shanmugamr1992 :: PR: #4689
  • Tests/dynamic inference functional coverage by @shanmugamr1992 :: PR: #4761
  • Fix oncall references by @janEbert :: PR: #4722
  • Update golden values for nightly functional tests by @balasaajay :: PR: #4850
  • fix(inference): size DynamicInferenceContext KV layer_map for non-uniform PP by @athitten :: PR: #4775
  • Modernize post-training modelopt example scripts by @kevalmorabia97 :: PR: #4807
  • test: add inference performance test harness for GPT 583M, hybrid 2B,… by @shanmugamr1992 :: PR: #4806
  • ci: Gate optional CI jobs with repository variables by @chtruong814 :: PR: #4907
  • Fix tokenizers bug in nightly by @Phlip79 :: PR: #4833
  • ci: Prevent shell trace in parts of _run_training.sh by @chtruong814 :: PR: #4884
  • Ignore Vim swap files by @wujingyue :: PR: #4860
  • M-FSDP: Make fine_grained_param_gather configurable for MXFP8 to enable performance–memory trade-offs by @shjwudp :: PR: #4181
  • MimoOptimizer: fix distributed checkpoint save and load for non-colocated MIMO by @kamran-nvidia :: PR: #4801
  • Route non-Muon params through DistributedOptimizer by @deepakn94 :: PR: #4771
  • Allow optimizer CG to share the same pool as full-iter CG by @nanz-nv :: PR: #4698
  • Use sharded_state_dict_default in MLP.sharded_state_dict by @gdengk :: PR: #4693
  • Update PR template by @Phlip79 :: PR: #4904
  • Fix MTP recompute crash with packed sequences by @BestJuly :: PR: #4593
  • ci: Update perf test to output logs for tests to pass by @chtruong814 :: PR: #4906
  • Also persist asymmetrical units for the MXFP8 transpose weight buffer. by @cspades :: PR: #4852
  • fix no_shard training convergency and add unittest for no_shard by @wplf :: PR: #3754
  • Move policy epoch stats to the message object by @ArEsKay3 :: PR: #4533
  • Add a knob to throttle the max allowed inflight offload in fine grained offloading by @nanz-nv :: PR: #4692
  • refactor(data): consolidate get_batch and enable PP for SFT THD by @asolergi-nv :: PR: #4103
  • Allow YAML MoE configs to use model specs by @chawkins-nvidia :: PR: #4822
  • Move bert and t5 pretrain files by @Phlip79 :: PR: #4820
  • Paged Stashing by @nanz-nv :: PR: #4247
  • make FP4 param gather work with the mixed precisions in NVFP4 recipe by @xrennvidia :: PR: #4358
  • fix: Fix multi-node functional test phase sync by @chtruong814 :: PR: #4924
  • Perf tests by @shanmugamr1992 :: PR: #4917
  • fix(cuda_graphs): handle TE 2.15 removal of FP8GlobalStateManager.set_skip_fp8_weight_update_tensor by @balasaajay :: PR: #4874
  • Fix paged stashing test submodules lookup by @Phlip79 :: PR: #4925
  • Add TEFusedDenseMLP for Dense+Grouped GEMM fusion on SM100+ (#4318) by @sraman-rgb :: PR: #4786
  • Fix mxfp8 param gather numerical issue when DP overlap is off by @WanZzzzzz :: PR: #4800
  • [MXFP8/FP4-param-gather] Post processing after forced param AG in eval by @WanZzzzzz :: PR: #4562
  • ci: Update training script paths in BERT and T5 by @balasaajay :: PR: #4939
  • Various training utils by @maanug-nv :: PR: #4872
  • ci: restore perf test torchrun logs by @chtruong814 :: PR: #4951
  • Fix get_batch return order to ignore BlendedDataset provenance fields by @deepakn94 :: PR: #4952
  • test(release): add release goldens for deepseekv3/nemotron3 and set tp2pp2 exit-interval by @ko3n1g :: PR: #4932
  • test: enable NVTE_CUTEDSL_FUSED_GROUPED_MLP via pytest fixture by @ko3n1g :: PR: #4931
  • Avoid offsetting functional test master port by @chtruong814 :: PR: #4973
  • Fix elastification unwrap_model import by @Devil1716 :: PR: #4972
  • test: re-enable paged stashing MoE tests by @ko3n1g :: PR: #4978
  • test(ci): re-enable 8experts2parallel_multi_dist_optimizer_instances_1node by @ko3n1g :: PR: #4984
  • ci: Add support for MBridge job gating based on PR labels by @balasaajay :: PR: #4926
  • test: re-enable test_pp2_create_cudagraphs_first_stage on TE 2.15+ by @ko3n1g :: PR: #4985
  • fix(tests): initialize num_microbatches calculator in vision cudagraph tests by @ko3n1g :: PR: #4986
  • ci: Add allow_failure flag to gpt and moe recipes that are failing in nightlies by @balasaajay :: PR: #4905
  • Drain predecessor reduce-scatter at dispatch time by @deepakn94 :: PR: #4940
  • nightly(ci): Update golden values for functional t5 tests by @balasaajay :: PR: #4995
  • [main] Refactor and Improve MoE Logginginit commit by @yanring :: PR: #3431
  • ci: validate release branch-rules by @ko3n1g :: PR: #4929
  • [Megatron-FSDP] Add conditional param.grad dereferencing logic to support full-iteration (FWD-BWD) CUDA graphability. by @cspades :: PR: #4663
  • test: restrict iter-time comparison to steady-state window by @ko3n1g :: PR: #5010
  • fix(test): pin eval-global-batch-size on 15b gb200 release configs by @ko3n1g :: PR: #5022
  • [fix] Release MTP assertion when EP overlap with PP=1 by @Wohox :: PR: #4796
  • fix(test): widen iter-time steady-state window for short tests by @ko3n1g :: PR: #5023
  • Perf fix by @shanmugamr1992 :: PR: #4996
  • Add dev-feature preservation gate and change schedule by @Phlip79 :: PR: #4773
  • chore(test): remove orphan nemotron3_super_release_g200 dir by @ko3n1g :: PR: #5024
  • Ignore Claude worktree directory by @Phlip79 :: PR: #5020
  • ci: update CI workflow conditions for integration tests by @balasaajay :: PR: #4658
  • Add NVSkills CI request workflow by @Phlip79 :: PR: #5033
  • DDP wrap pg size fixes by @maanug-nv :: PR: #5006
  • fix(layer_wise): tag MTP-stage word_embeddings as is_embedding_or_output_parameter by @Wohox :: PR: #5034
  • Move LTS dependencies from pyproject.toml to Dockerfile.ci.lts by @balasaajay :: PR: #4877
  • Use shared ModelOpt calibration loop on 0.45+ with 0.44 fallback fix by @kevalmorabia97 :: PR: #4881
  • test(release): skip golden comparison on intermediate resume windows by @ko3n1g :: PR: #5040
  • [mimo] Thread position_ids through MimoModel for multimodal RoPE by @liding-nv :: PR: #4938
  • build: Switch DSv3 on H100 to HybridEP by @ko3n1g :: PR: #5039
  • Fix: Import unwrap_model from megatron.core.utils in modelopt examples by @kevalmorabia97 :: PR: #5045
  • Simple and stable Inference APIs by @YangFei1990 :: PR: #4697
  • ci: Add notification step for MBridge downstream test results by @balasaajay :: PR: #5028
  • Delete output tensor early by @Phlip79 :: PR: #4742
  • Support ScaledSReLU in TE grouped MLP fuser by @sraman-rgb :: PR: #4859
  • Skip gradient updates when grad norm exceeds threshold by @yfw :: PR: #3460
  • Add 9 user skills by @Phlip79 :: PR: #5066
  • test(nemotron): align nemotron3 super GB200 goldens with exit-interval 4768 by @ko3n1g :: PR: #5069
  • chore: Update transformer-engine dependency to version 2.16.0 by @balasaajay :: PR: #4992
  • Update energon version requirement by @maanug-nv :: PR: #4572
  • Fix test failures for new inference APIs by @YangFei1990 :: PR: #5068
  • fix(ci): set PYTHONUNBUFFERED=1 in JET workload env by @ko3n1g :: PR: #5072
  • Preserve non-FSDP-unit buckets across AllGatherPipeline reset by @wujingyue :: PR: #4717
  • Add opt-in MXFP8 LM-head output projection by @gdengk :: PR: #4825
  • fix(ci): bound JET pipeline polling with a watchdog to prevent indefinite hangs by @ko3n1g :: PR: #5076
  • ci: prune old artifacts on cluster lustre during weekly/release runs by @ko3n1g :: PR: #5084
  • ci(test): isolate ckpt-resume tensorboard per phase by @ko3n1g :: PR: #5074
  • test: unmark EP A2A activation offload test flaky by @lhb8125 :: PR: #5009
  • Change ownership groups by @Phlip79 :: PR: #5021
  • test: skip mfsdp_fully_shard cases when world_size < mesh size by @wujingyue :: PR: #4487
  • fix mimo optimizer checkpoint metadata restore by @liding-nv :: PR: #4791
  • [mimo] Support bridge fan-out for variable modality tokens by @liding-nv :: PR: #5062
  • cp: Remove DeepEP hardware limit check (4846) into core_r0.18.0 by @ko3n1g :: PR: #5126
  • chore: Update transformer-engine dependency to revision 4220403 (#5112) by @balasaajay :: PR: #5137
  • cp: fix(optimizer): gate ChainedOptimizer MXFP8 defer-sync on DDP-level overlap_param_gather (4982) into core_r0.18.0 by @ko3n1g :: PR: #5146
  • cp: build: Switch DSv3 on H100 to HybridEP (5164) into core_r0.18.0 by @ko3n1g :: PR: #5165
  • beep boop 🤖: Bumping Megatron Core to v0.18.2 by @nvidia-megatron-lm-release-bot[bot] :: PR: #5434
  • docs: Fix docs version for 0.18.0 release by @chtruong814 :: PR: #5435
  • beep boop 🤖: Bumping Megatron Core to v0.18.1 by @nvidia-megatron-lm-release-bot[bot] :: PR: #5436
  • Resetting Megatron Core and FSDP patch versions to 0 by @balasaajay :: PR: #5437
  • chore: Bump versions by @ko3n1g
  • fix(ci): replace actions/setup-python with apt-get to avoid 429 rate limits (#4072) by @ko3n1g
  • ci: Fix package name for code-freeze workflow (#4077) by @ko3n1g
  • chore: bump _code_freeze workflow to v0.86.0 (#4078) by @ko3n1g
  • Fix checkpoint inspector (#4079) by @janEbert
  • Update docs to conform to NVIDIA style guides (#4068) by @megnvidia
  • Miscellaneous inference fixes (#4030) by @santhnm2
  • fix fine_grained_callables with fused rmsnorm residual (#4026) by @CarlosGomes98
  • [Main][feat] Support overlapping A2A Combine backprop with wgrad GEMM (#3795) by @Wohox
  • Modify mfsdp default data-parallel-sharding-strategy (#3691) by @wplf
  • Fix fsdp_dtensor conversion for pretrained-only checkpoints (#3912) by @DAISY-gh
  • Guard NVshmem issues (#4093) by @wdykas
  • m-fsdp: wire use_precision_aware_optimizer from ddp_config to ParamAn… (#4024) by @rapatel
  • Megatron-FSDP: Add MXFP8 transpose helper buffer for Hybrid FSDP (#3918) by @shjwudp
  • feat(fsdp): use TE general_gemm for mixed-precision wgrad in FSDP path (#3822) by @Victarry
  • Megatron-FSDP: Make _pre_forward_param_unshard and _register_post_backward_hook formal (#4029) by @shjwudp
  • Megatron-FSDP: Fix insufficient double buffers during gradient reduce (#4054) by @shjwudp
  • Fix M-FSDP MXFP8 related BUGs (#3991) by @shjwudp
  • FIX: Use decoupled gradients for precision-aware M-FSDP grad norm (#3746) by @XueSongTap
  • [Megatron-FSDP] Fix compatibility with frozen parameters and add unit tests (#3287) by @shjwudp
  • Align chat completions endpoint with vLLM (#4063) by @santhnm2
  • [M-FSDP] Refactor uneven dtensor to full tensor and add UT (#3190) by @shjwudp
  • Add agent instruction files (#4102) by @Phlip79
  • Bump eopt version (#4100) by @skyw
  • Refactor emerging optimizer integration (#4113) by @skyw
  • Fix over provisioning of Mamba state memory when max_requests is set (#4114) by @santhnm2
  • base strategy simplification (#4001) by @dimapihtar
  • add support for DCP and FSDP async save (#4027) by @dimapihtar
  • Add more emerging optimizers (#3907) (#4119) by @skyw
  • Fix FSDP checkpoint conversion and loading for Qwen3.5-VL (#3936) by @DAISY-gh
  • docs: update mcore optimizer docstrings to google style (#2799) by @Akshat8510
  • Update oncall schedule (#4117) by @Phlip79
  • Set tensor-parallel attributes irrespective of perform_initialization (#4084) by @ilml
  • docs: add developer-guide skill with CI/CD and failure navigation guidance (#4035) by @ko3n1g
  • chore: Move skills (#4136) by @ko3n1g
  • ci: Let Claude react to comment (#4135) by @ko3n1g
  • Nemotron3 Super GB200 release config (#4118) by @maanug-nv
  • Enable CUDA graph for ADAM optimizer (#3429) by @vasunvidia
  • Claude review should recommend testing (#4137) by @Phlip79
  • cleanup: remove unused scatter_gather_tensors_in_pipeline argument (#4140) by @Phlip79
  • fix: Remove fail-fast (-x) and guard distributed teardown against deadlock (#4139) by @ko3n1g
  • chore(beep boop 🤖): Bump (main) (2026-04-06) by @github-actions[bot]
  • Claude: add respond-to-issue skill (#4141) by @Phlip79
  • Fix muon getter backward compatability (#4157) by @skyw
  • Audit of user guide (#4098) by @megnvidia
  • Fix RerunStateMachine crash (TypeError: 'NoneType' object is not subscriptable) by not saving a checkpoint after a transient NaN / Inf (#3981) by @yezhengmao1
  • Preserve type of decorated methods/classes (#4062) by @nschank
  • update muon test case to use new interface (#4163) by @skyw
  • [M-FSDP] Fix Tensor Parallel mode detection (#3191) by @shjwudp
  • fix: remove weights_only=False for multimodal example (#4104) by @faradawn
  • Cudagraphs: Fix sequence packing segfault more generally (#4162) by @mathemakitten
  • Make MTP work with materialize_only_last_token_logits (#4166) by @santhnm2
  • Add unit test for Mamba EP inference (eager fallback with mixed CUDA graphs) (#4085) by @santhnm2
  • update docs in respect to async changes (#4177) by @dimapihtar
  • update checkpointing docs in respect to async changes (#4208) by @dimapihtar
  • chore: improve build-and-test skill with trigger rules and dependency workflow (#4199) by @ko3n1g
  • Fix layerwise optimizer with expt_dp_size=1 and contention with element-wise distributed optimizer (#4138) by @skyw
  • ci: add --cluster-a100/h100/gb200 args to trigger_internal_ci.py (#4195) by @ko3n1g
  • ci: Update golden values for nightly tests (#4215) by @chtruong814
  • rename async_allgather to overlap_param_gather (#4217) by @skyw
  • Fix Slack sync for users with GitHub email privacy enabled (#4220) by @Phlip79
  • Miscellaneous MTP inference fixes (#4191) by @santhnm2
  • Move inference guards out of arguments.py (#4210) by @mathemakitten
  • Fix: enable fine-grained activation offloading for Mamba model. (#4173) by @fanshiqing
  • bump NVRx (#4178) by @dimapihtar
  • Update tokenizer args for Nemotron3 release config (#4239) by @maanug-nv
  • build: add dynamic git-versioning and drop rc0 pre-release tag (#4212) by @ko3n1g
  • Fix unnecessary permute padding for non-quantized MoE dispatch (#4038) by @xiaoxi-wangfj
  • Fix split state dict main (#3676) by @kunlunl
  • Enable FP8 DPA for MXFP8 recipe (#4066) by @vasunvidia
  • Add /split-pr Claude Code command for splitting PRs by CODEOWNERS (#4160) by @Phlip79
  • Enable AG/RS overlap with explicit process group passing (#3249) by @jeffnvidia
  • Enable cpu_offloading with Full iteration CUDA graph (#3969) by @vasunvidia
  • Fix TransformerConfig validation for mixed dense/MoE upcycling (#3647) by @rkteddy
  • Remove cross-rank synchronization during checkpoint load & deprecate torch.distributed.checkpoint.state_dict_loader.load_state_dict (#2864) by @asolergi-nv
  • Fix incorrectly set decoupled_grad and DistOpt mechanics for MFSDP. (#4133) by @cspades
  • Refit Miscelaneous (#3973) by @wdykas
  • Add conditions_embeddings argument to TransformerBlock, TransformerLayer for DiT (diffusion transformer) (#4134) by @huvunvidia
  • Fix build_sequences_per_dataset output path arg usage (#4144) by @DhineshPonnarasan
  • ci: Flush pending CUDA work before the barrier in destroy_model_parallel (#4259) by @chtruong814
  • Update oncall schedule (#4257) by @Phlip79
  • docs(moe): Update MoE README (#3664) by @sbhavani
  • Revert "Add conditions_embeddings argument to TransformerBlock, Trans… (#4270) by @ko3n1g
  • reduce the number of shared expert streams (#3752) by @yangbofun
  • remove legacy Bert code (#4204) by @dimapihtar
  • [Main] Feat(moe): Gated delta net context parallel (CP) (#2642) by @yuzhongw-nvidia
  • remove t5 legacy code (#4203) by @dimapihtar
  • fix: handle list-typed process groups in ProcessGroupCollection.repr (#3753) by @cluster2600
  • Fix Context Parallelism documentation link (#4149) by @liangxs
  • [MLA] fix: Pad V when Q/V head dims differ for THD (#3003) by @HollowMan6
  • Allow the evaluation batch size to differ from the training batch size (#4014) by @michal2409
  • fix(megatron-fsdp): build expt_device_mesh only for MoE models (#3831) by @xuwchen
  • Add @NVIDIA/transformer review group to megatron/core/transformer/ (#4281) by @Phlip79
  • Reset AG_pipeline bucket status after validation step. (#3155) by @vasunvidia
  • Enhance and fix NVTX for training (#3642) by @yaox12
  • NVFP4 native weights for DDP (#4005) by @WanZzzzzz
  • Remove unnecessary arguments for layerwise distributed optimizer (#4272) by @FDecaYed
  • reuse grad buffer for layer-wise param allgather (#3751) by @FDecaYed
  • feat(ci): add strict review mode to Claude review workflow (#4197) by @Victarry
  • Fix stale approvals (#4280) by @Phlip79
  • [MoE] Add a new score function to the router (#3673) by @yaox12
  • [MoE] Improvement of shared expert overlap, support shared expert overlap for FlexDispatcher (#2207) by @Victarry
  • build: bump DeepEP to 34152ae (#4228) by @ko3n1g
  • ci: mark test_fused_indexer_loss_gradient_tp_consistency as flaky_in_dev (#4299) by @ko3n1g
  • Fix typo in PR4133. (#4277) by @cspades
  • ci: add retry loop to apt-get update to handle transient mirror sync failures (#4209) by @ko3n1g
  • remove legacy biencoder and realm models (#4205) by @dimapihtar
  • fix: enforce correct pass thresholds for deterministic and approximate tests (#4238) by @ko3n1g
  • ci: add configurable launcher support for functional tests (ft_launcher / torchrun) (#4298) by @ko3n1g
  • chore: document --target main for local Docker builds (#4307) by @ko3n1g
  • Extract args init to launch scripts (#4225) by @maanug-nv
  • [Main] Fix TE version check for retain_pinned_cpu_buffers in cpu offload (#4267) by @BestJuly
  • Fix documented shape (#3486) by @janEbert
  • ci: add sync-skills workflow, rename CLAUDE.md → AGENTS.md, move .claude/skills → skills/ (#4303) by @ko3n1g
  • chore(beep boop 🤖): symlink skills/ → .claude/skills, .agents/skills and AGENTS.md → CLAUDE.md by @github-actions[bot]
  • Get device correctly when module returns a dict instead of individual tensor (#4265) by @shifangx
  • remove vision legacy code (#4202) by @dimapihtar
  • feat: long convergence resiliency for release tests (#4335) by @ko3n1g
  • ci(action): improve GitHub Actions output UX (#4337) by @ko3n1g
  • build: bump TransformerEngine to release_v2.14 (#4331) by @ko3n1g
  • M4 leftover for TE cuda graph (#3137) by @shifangx
  • feat: add create-issue skill (#4338) by @ko3n1g
  • Set megatron-fsdp to 0.5.0 by @ko3n1g
  • fix: wait for async P2P send before deallocating output tensor (#4047) by @ZhiyuLi-Nvidia
  • ci(gb200): add 1-node mr-github functional test variants (#4334) by @ko3n1g
  • Fix potential coredump issue that occurs when saving a checkpoint (#1871) by @ezioliao
  • Port DeepSeek Sparse Attention to MambaModel (#3553) by @janEbert
  • docs: bump versions1.json to 0.17.0 (latest) (#4360) by @ko3n1g
  • Add tables and histogram for RL staleness (#4097) by @tdene
  • [docs] ci: use parent-relative json_url for version picker (#4367) by @ko3n1g
  • Fix bug with non-partial rollouts (#3964) by @tdene
  • Add QK layernorm support for dot-product attention in MambaModel (#4067) by @Phlip79
  • Docs: improve docstrings and comments in example training loop (#4041) by @DhineshPonnarasan
  • feat(ckpt): add --async-ckpt-use-cpu-shm argument (#4355) by @sbak5
  • cp: Fix UT timeout (#4310) (#4373) by @chtruong814
  • Fix RL reward due to stop token (#4096) by @tdene
  • FA4 Inference (#4186) by @wdykas
  • Make param_index_map always use unpacked (full numel) offsets (#4328) by @deepakn94
  • Add activation logging and tokens per expert logging (#3842) by @Mellonta
  • Fix RL to once again work with --skip-train (#4249) by @tdene
  • Fix Megatron initialization with extra_args_provider (#4327) by @santhnm2
  • Rename MambaModel/MambaStack to HybridModel/HybridStack (#4099) by @Phlip79
  • chore(beep boop 🤖): Bump (main) (2026-04-20) by @github-actions[bot]
  • fix(ci): wrap uv install in retry block (#4387) by @ko3n1g
  • Call save_checkpoint_and_time() when saving checkpoint and compute elapsed duration for saving checkpoint before logging timer (#4263) by @awsankur
  • refactor(tests): move NCCL env vars from docker launcher to shell training script (#4390) by @ko3n1g
  • Remove packed_attention_mask unused parameter (#3859) by @tdene
  • Second batch of audit edits (#4115) by @megnvidia
  • Replace rampup batch size scheduler with custom step batch size schedules (#3779) by @mkhona-nvidia
  • revert: replace rampup batch size scheduler with custom step batch size schedules (#3779) (#4404) by @ko3n1g
  • Megatron-FSDP: log mcore detection only after imports succeed (#4400) by @wujingyue
  • Replace rampup batch size scheduler with custom step batch size schedules (#4411) by @deepakn94
  • ci(gb200): re-enable tunable_overlap 1-node mr-github test (#4405) by @ko3n1g
  • Fix local docs building (#4416) by @Phlip79
  • RL: Onload optimizer after logprobs computation (#4235) by @tdene
  • Add RL token throughput and packing metrics (#3877) by @tdene
  • ci: remove publish:merge_into_dev job (#4421) by @ko3n1g
  • docs: add data loading best practices for large-scale training (#4236) by @sbhavani
  • Fix: Auto enable manual registration and enhance the docummentation (#3295) by @youngeunkwon0405
  • Fix nvtx_decorator to check _nvtx_enabled at call time (#4184) by @minitu
  • fix merges_file typo in megatron_hf_tokenizer (#4392) by @chelseajohn
  • Enable NullTokenizer for pretraining to reduce I/O access (#4057) by @asolergi-nv
  • docs: Add SECURITY.md (#4431) by @chtruong814
  • Mamba inference opt (#4414) by @wdykas
  • DDP refactoring: Extract parameter layout computation into optimizer classmethod (#3812) by @deepakn94
  • Update PR template with explicit request for issue (#4409) by @Phlip79
  • Misc inference fixes (#4397) by @sidsingh-nvidia
  • Rename Mamba to Hybrid outside megatron/core (#4159) by @Phlip79
  • Include mtp layers in token per expert logging (#4412) by @Mellonta
  • fix: NVRx async compatibility and defer resiliency import (#4420) by @sbak5
  • ci: add base_sha to codecov/codecov-action upload step (#4445) by @ko3n1g
  • fix(checkpoint_inspector): allow empty --param-to-param-group-map-json (#4403) by @DAISY-gh
  • Add the YARN support for hybrid_model (#4244) by @guihong-nv
  • [training migration] Add container class for config dataclasses (#4227) by @maanug-nv
  • Inference: Fix broken functional tests on gitlab (#4454) by @sidsingh-nvidia
  • SafeUnpickler class for safe pickle usage (#4319) by @dimapihtar
  • get rid of weights_only=False (#4434) by @dimapihtar
  • Inference | Per-block MoE routing storage for prefix caching (#4301) by @lmcafee-nvidia
  • Add troubleshooting tip for 'access forbidden' (#4449) by @balasaajay
  • Fix checkpoint loading with rerun state machine (#4448) by @YangFei1990
  • Add misc CUDA graph sugar to CudaGraphManager (#4425) by @tdene
  • Inference: Add the embedding and output layer in the full_iteration_inference cuda graph scope for hybrid models (#4440) by @sidsingh-nvidia
  • Important bugfixes in local CG implementation that were leading to loss curve gaps for latent MoE models (#4433) by @jiemingz
  • fix: Replace polynomial rolling hash with SHA-256 for prefix caching (#4158) by @lmcafee-nvidia
  • feat(ckpt): expose validate_access_integrity knob on dist-ckpt load (#4422) by @asolergi-nv
  • Fix multivalidation (#3388) by @RPrenger
  • Add missing knob for reduce_scatter_with_fp32_accumulation (#4410) by @WanZzzzzz
  • Enable CUDA graphs for MTP inference (#4260) by @santhnm2
  • chore(beep boop 🤖): Bump (main) (2026-04-27) by @github-actions[bot]
  • checkpoint integrity verification (#4305) by @dimapihtar
  • Fix cache gating (#4455) by @wdykas
  • [Main] Fix FusedAdam.use_decoupled_grad mis-set for Megatron-FSDP. (#4427) by @cspades
  • add permute fusion into hybrid ep (#4089) by @Autumn1998
  • Add ColocatedBridgeCommunicator for heterogeneous TP/DP MIMO training (NMFW-17) (#4368) by @yashaswikarnati
  • Fix incorrect bias display in extra_repr of Column/RowParallelLinear (#4330) by @HelloWorldBeginner
  • Fix assertion logic in combined_1f1b_schedule_for_interleaved_pipelining (#4276) by @joapolarbear
  • ci: Fix event name reference in CI workflow condition for merge group (#4462) by @balasaajay
  • Add manual sync workflow from main to dev (#4165) by @Phlip79
  • fix: handle list-format quant_cfg from ModelOpt PR #1094 (#4187) by @ChenhanYu
  • ci: also add Run MBridge tests label in nightly sync workflow (#4499) by @Phlip79
  • [training migration] Add serialization features to config container (#4309) by @maanug-nv
  • Fix conflict with inference graphs (#4504) by @tdene
  • Add tools/prepare_cache.py for offline GPT dataset cache preparation (#4080) by @asolergi-nv
  • [build] fix: move mamba-ssm and causal-conv1d to optional [ssm] extra (#4517) by @ko3n1g
  • mamba: avoid redundant HBM reloads in causal_conv1d_update shift loop (#4460) by @wdykas
  • Standardize misc graph interface (#4485) by @tdene
  • Fix inference graph override in RL flow (#4323) by @tdene
  • Unify and refactor Megatron-FSDP documentation. (#4418) by @cspades
  • Skills for running unit tests and working with slurm (#4502) by @yashaswikarnati
  • Revert "ci: add base_sha to codecov/codecov-action upload step (#4445)" (#4526) by @chtruong814
  • Reorganize order of operations in inference context and text generation controller (#2929) by @tdene
  • ci: Update CI workflow conditions to include merge group handling (#4532) by @balasaajay
  • ci: add base_sha to codecov/codecov-action upload step (#4540) by @chtruong814
  • Fix release tests: remove --global-batch-size conflicting with --step-batch-size-schedule (#4545) by @deepakn94
  • docs: use @file-path notation for file references in skills (#4542) by @ko3n1g
  • Support YAML quant recipe in PTQ and remove first/last layer modifier code (#4503) by @jenchen13
  • Avoid nsys profile crash with CUDA graphs (#4541) by @tdene
  • fix(ci): add retry with backoff to approve-test-queue bot (#4559) by @ko3n1g
  • New allgathervdispatcher for inference and simplify old dispatcher. (#4258) by @sidsingh-nvidia
  • Fixes for modelopt examples and SFTTokenizer for transformers v5 (#4450) by @jenchen13
  • Adding code for Flextron (#4429) by @sheliang-nv
  • Fix partial cudagraphs + HybridEP not properly triggering DDP hook (#4500) by @jiemingz
  • Ignore pytorch link anchors (#4582) by @maanug-nv
  • MoE dispatcher fixes: size NVLS dispatcher buffers from actual tensor sizes (#4576) by @mathemakitten
  • Finalize all builders in preprocess_data, not just the last key (#4573) by @sayalinvidia
  • refactor(skills): add when_to_use frontmatter, split ci-test-system, enforce skill workflow (#4574) by @ko3n1g
  • Make last_token_logits graphable (#4552) by @tdene
  • fix(ci): correct off-by-one in total_steps_evaluated formula (#4591) by @ko3n1g
  • Add fault injection support via nvidia_resiliency_ext. (#4370) by @hexinw-nvidia
  • Guard vocab reduce_scatter on TP > 1 (#4565) by @mathemakitten
  • Move inference context bookkeeping to CPU with ContextGPUView (#4306) by @lmcafee-nvidia
  • Enable InJob restart on failures. (#4594) by @hexinw-nvidia
  • Enable shared expert overlap with allgatherv in inference (#4570) by @sidsingh-nvidia
  • Add vLLM grouped gemm backend for MoE inference (#4566) by @santhnm2
  • Move KD teacher loading to after Float16Module (#4394) by @AAnoosheh
  • ci: update gpt3_7b_tp4_pp1_memory_speed gb200 golden values (#4601) by @ko3n1g
  • Fix inference unit test (#4589) by @maanug-nv
  • Checkpoint conversion between GPT_model and Hybrid_model (#4482) by @guihong-nv
  • ci: add cadence input for test filtering in CI workflows (#4561) by @balasaajay
  • Handle SSM sharded tensor merge OOM with CPU fallback (#4442) by @returnL
  • Fix mtp_use_repeated_layer behavior for GPT models (#3965) by @rkarimimahab
  • FlashInfer sampling (#2456) by @tdene
  • Fix main2dev workflow (#4610) by @Phlip79
  • Add logic to enable chunked MLP during training (#3656) by @pengdurice
  • Inference bug-fixes: Re-enable EP syncs for the legacy A2A dispatcher and re-simplify ep_sync accidentally reverted by #4306 (#4587) by @sidsingh-nvidia
  • Remove invalid timeout argument for dist.barrier (#4512) by @zhaoyinglia
  • Fix buffers in refit (#4580) by @wdykas
  • Named validation sets (#4578) by @RPrenger
  • Fix Hang in tests (#4575) by @wdykas
  • Single commit for main2dev nightly (#4614) by @Phlip79
  • convert tokenizer args to config (#4406) by @dimapihtar
  • Siddharth/fix ep sync (#4607) by @wdykas
  • mmiranda working on another set of broken links (#4534) by @megnvidia
  • Fix gradient corruption with layerwise param all-gather overlap (#4609) by @deepakn94
  • test: mark TestFusedApplyMLARope::test_forward_backward_for_q flaky_in_dev (#4639) by @ko3n1g
  • remove legacy GPT code (#4322) by @dimapihtar
  • ci: introduce L-tier scope vocabulary via parser (#4625) by @balasaajay
  • Inference: Tune vLLM grouped gemm, moe_sum kernel, and enable shared expert overlap in latent MoEs (#4603) by @sidsingh-nvidia
  • Fix crash involving evicted requests and tpot (#4645) by @tdene
  • remove legacy tranformer and modules (#4207) by @dimapihtar
  • chore: Update Docker image version to 26.04-py3 (#4611) by @balasaajay
  • Propagate errors for failed inference requests (#4679) by @mathemakitten
  • Inference: Cache input + position ID views (#4634) by @mathemakitten
  • ci: Update Gitlab base image to 26.04 pytorch (#4688) by @chtruong814
  • Add periodic GPU sniff tests to detect hardware stragglers (#4662) by @deepakn94
  • ci: Bump GHA versions (#4606) by @chtruong814
  • build: widen flashinfer-python pin to <0.7.0 (#4700) by @ko3n1g
  • Add Python-side guardrail for HybridEP InfiniBand limit and rename seq_len (#4094) by @Shreyas-S-809
  • Switch oncall (#4702) by @janEbert
  • Update golden values for various functional tests (#4703) by @balasaajay
  • chore: Update golden values for various functional tests (#4706) by @balasaajay
  • build: upgrade mamba-ssm to 2.3.2.post1, causal-conv1d to 1.6.2.post1 (#4712) by @ko3n1g
  • ci: replace uuidgen with /proc/sys/kernel/random/uuid (#4714) by @ko3n1g
  • chore(codeowners): add megatron/inference/ ownership (#4704) by @ko3n1g
  • Create a Protocol for the MLP layer of TransformerLayer (#3435) by @nschank
  • Revert "Add Python-side guardrail for HybridEP InfiniBand limit and rename seq_len (#4094)" (#4718) by @ko3n1g
  • chore(beep boop 🤖): Bump (main) (2026-05-11) by @github-actions[bot]
  • Add Python-side guardrail for DeepEP IB limits (#4719) by @janEbert
  • ci: revert bad uv.lock bump and label future bumps with Run functional tests (#4730) by @ko3n1g
  • [ci] fix: treat cancelled run-main-script step as failure (#4727) by @ko3n1g
  • ci: Major refactor of release-workflows (#4602) by @ko3n1g
  • fix(fsdp): recognize legacy GDN TP metadata (#4664) by @Glitchfix
  • build(deps): bump nvidia-modelopt to 0.43 (#4723) by @ko3n1g
  • Fixes for Nemotron3 Super release test config (#4544) by @maanug-nv
  • feat(gpt): add output postprocess hook (#4686) by @Glitchfix
  • Add bump-base-image skill and update golden value comparison (#4733) by @balasaajay
  • Guard omegaconf imports (#4685) by @maanug-nv
  • Fix a regression introduced by #4625 for nightly runs (#4734) by @balasaajay
  • Add LLaVA audio (sound) model support (#4402) by @cuichenx
  • Support transfomers 5.x.x for text generation server (#4732) by @tdene
  • Update transformer-engine dependency to version 2.15.0 (#4682) by @balasaajay
  • Increase CG cover from max_requests to max_tokens (#4214) by @tdene
  • fully remove legacy code (#4759) by @dimapihtar
  • fix legacy torch save when tensor_model_parallel_size > expert_model_parallel_size * expert_tensor_parallel_size (#4678) by @dimapihtar
  • Wire --rl-inference-parsers into MRL (#4768) by @tdene
  • Integrate LayerWiseDistributedOptimizer with DDP buffer infrastructure (#4509) by @deepakn94
  • [training migration] Migrate mamba builder (#4550) by @maanug-nv
  • NCCL UB fix: reduce memory cost and correctly deregister NCCL mem pool (#4492) by @xrennvidia
  • fix: use no_mask in local ViT layer spec (#4395) by @Phlip79
  • refit clean up and refactoring (#4762) by @wdykas
  • Make weight and optimizer memory estimation take into account expert parallelism correctly (#4687) by @YangFei1990
  • Support recomputing in HybridModel (#4496) by @xuantengh
  • One single flag that determines if we are in inference (#4617) by @tdene
  • [main] feat(moe): Support packed sequence for gated delta net (GDN) (#2645) by @yuzhongw-nvidia
  • remove dead manual_release_grads code path in 1F1B overlap schedule (#4511) by @Wohox
  • Fix recompute checkpointing + training CGs (#3919) by @tdene
  • Use Protocols to type-check linear_proj submodules of Attention (#3434) by @nschank
  • fix tokenizers in respect to newer transformers (#4608) by @dimapihtar
  • Bump nvidia-modelopt>=0.44.0 (#4803) by @kevalmorabia97
  • Update owners (#4794) by @Phlip79
  • ci: Update workflow to use same commit for building docker image and running tests (#4787) by @balasaajay
  • chore: Update nightly tests golden values (#4805) by @balasaajay
  • Inference: Optimize Prefill Engine Steps for Nemotron (#4764) by @sidsingh-nvidia
  • Disable MSC by default; opt in via --enable-msc (#4629) by @asolergi-nv
  • Strengthen test_checkpoint to verify distributed checkpoint behavior (#4711) by @lichenlu
  • Combine GEMM + SwiGLU fused MLP PRs (3890, 4071, 4095, 4219, 4311, 4324) → main (#4636) by @Connor-XY
  • additional tests for nvrx (#4522) by @dimapihtar
  • [fix] Use MSC for checking checkpoint existence (#4251) by @pavelgein
  • ci: tolerate git-gc race in /home/runner chown after checkout (#4808) by @balasaajay
  • Reorder mtp_post_process after attention backward in 1F1B schedule plan (#4695) by @gdengk
  • [Main][feat] Support A2A Overlap for Megatron-FSDP (#3797) by @Wohox
  • add is_torch_min_version in fsdp src (#4812) by @xrennvidia
  • Add high-priority A2A stream and HybridEP preprocessing SMs (#4694) by @gdengk
  • Refactor CUDA graph API: decompose cuda_graph_scope into full_iteration impl, inference scope, and per-layer capture modules (#4292) by @buptzyb
  • chore(beep boop 🤖): Bump (main) (2026-05-18) by @github-actions[bot]
  • Tokenizers updates (#4780) by @dimapihtar
  • Fix no nvrx tests (#4847) by @dimapihtar
  • Thread custom process groups through MoE grad finalization (#4782) by @yashaswikarnati
  • Fix unit tests (#4689) by @shanmugamr1992
  • Tests/dynamic inference functional coverage (#4761) by @shanmugamr1992
  • Fix oncall references (#4722) by @janEbert
  • Update golden values for nightly functional tests (#4850) by @balasaajay
  • fix(inference): size DynamicInferenceContext KV layer_map for non-uniform PP (#4775) by @athitten
  • Modernize post-training modelopt example scripts (#4807) by @kevalmorabia97
  • test: add inference performance test harness for GPT 583M, hybrid 2B,… (#4806) by @shanmugamr1992
  • Fix tokenizers bug in nightly (#4833) by @Phlip79
  • ci: Prevent shell trace in parts of _run_training.sh (#4884) by @chtruong814
  • Ignore Vim swap files (#4860) by @wujingyue
  • M-FSDP: Make fine_grained_param_gather configurable for MXFP8 to enable performance–memory trade-offs (#4181) by @shjwudp
  • MimoOptimizer: fix distributed checkpoint save and load for non-colocated MIMO (#4801) by @kamran-nvidia
  • Route non-Muon params through DistributedOptimizer (#4771) by @deepakn94
  • ci: Gate optional CI jobs with repository variables (#4907) by @chtruong814
  • Allow optimizer CG to share the same pool as full-iter CG (#4698) by @nanz-nv
  • Use sharded_state_dict_default in MLP.sharded_state_dict (#4693) by @gdengk
  • Fix MTP recompute crash with packed sequences (#4593) by @BestJuly
  • Update PR template (#4904) by @Phlip79
  • ci: Update perf test to output logs for tests to pass (#4906) by @chtruong814
  • Also persist asymmetrical units for the MXFP8 transpose weight buffer. (#4852) by @cspades
  • fix no_shard training convergency and add unittest for no_shard (#3754) by @wplf
  • Move policy epoch stats to the message object (#4533) by @ArEsKay3
  • Add a knob to throttle the max allowed inflight offload in fine grained offloading (#4692) by @nanz-nv
  • refactor(data): consolidate get_batch and enable PP for SFT THD (#4103) by @asolergi-nv
  • Allow YAML MoE configs to use model specs (#4822) by @chawkins-nvidia
  • Move bert and t5 pretrain files (#4820) by @Phlip79
  • Paged Stashing (#4247) by @nanz-nv
  • make FP4 param gather work with the mixed precisions in NVFP4 recipe (#4358) by @xrennvidia
  • fix: Fix multi-node functional test phase sync (#4924) by @chtruong814
  • Perf tests (#4917) by @shanmugamr1992
  • fix(cuda_graphs): handle TE 2.15 removal of FP8GlobalStateManager.set_skip_fp8_weight_update_tensor (#4874) by @balasaajay
  • Fix paged stashing test submodules lookup (#4925) by @Phlip79
  • Add TEFusedDenseMLP for Dense+Grouped GEMM fusion on SM100+ (#4318) (#4786) by @sraman-rgb
  • Fix mxfp8 param gather numerical issue when DP overlap is off (#4800) by @WanZzzzzz
  • [MXFP8/FP4-param-gather] Post processing after forced param AG in eval (#4562) by @WanZzzzzz
  • ci: Update training script paths in BERT and T5 (#4939) by @balasaajay
  • Various training utils (#4872) by @maanug-nv
  • ci: restore perf test torchrun logs (#4951) by @chtruong814
  • Fix get_batch return order to ignore BlendedDataset provenance fields (#4952) by @deepakn94
  • test(release): add release goldens for deepseekv3/nemotron3 and set tp2pp2 exit-interval (#4932) by @ko3n1g
  • test: enable NVTE_CUTEDSL_FUSED_GROUPED_MLP via pytest fixture (#4931) by @ko3n1g
  • chore(beep boop 🤖): Bump (main) (2026-05-25) by @github-actions[bot]
  • Avoid offsetting functional test master port (#4973) by @chtruong814
  • Fix elastification unwrap_model import (#4972) by @Devil1716
  • test: re-enable paged stashing MoE tests (#4978) by @ko3n1g
  • test(ci): re-enable 8experts2parallel_multi_dist_optimizer_instances_1node (#4984) by @ko3n1g
  • ci: Add support for MBridge job gating based on PR labels (#4926) by @balasaajay
  • test: re-enable test_pp2_create_cudagraphs_first_stage on TE 2.15+ (#4985) by @ko3n1g
  • fix(tests): initialize num_microbatches calculator in vision cudagraph tests (#4986) by @ko3n1g
  • ci: Add allow_failure flag to gpt and moe recipes that are failing in nightlies (#4905) by @balasaajay
  • Drain predecessor reduce-scatter at dispatch time (#4940) by @deepakn94
  • nightly(ci): Update golden values for functional t5 tests (#4995) by @balasaajay
  • chore: rotate oncall schedule by @github-actions[bot]
  • [main] Refactor and Improve MoE Logginginit commit (#3431) by @yanring
  • ci: validate release branch-rules (#4929) by @ko3n1g
  • [Megatron-FSDP] Add conditional param.grad dereferencing logic to support full-iteration (FWD-BWD) CUDA graphability. (#4663) by @cspades
  • test: restrict iter-time comparison to steady-state window (#5010) by @ko3n1g
  • [fix] Release MTP assertion when EP overlap with PP=1 (#4796) by @Wohox
  • fix(test): pin eval-global-batch-size on 15b gb200 release configs (#5022) by @ko3n1g
  • fix(test): widen iter-time steady-state window for short tests (#5023) by @ko3n1g
  • Perf fix (#4996) by @shanmugamr1992
  • Add dev-feature preservation gate and change schedule (#4773) by @Phlip79
  • chore(test): remove orphan nemotron3_super_release_g200 dir (#5024) by @ko3n1g
  • Ignore Claude worktree directory (#5020) by @Phlip79
  • Update copy-pr-bot.yaml [skip ci] by @github-actions[bot]
  • ci: update CI workflow conditions for integration tests (#4658) by @balasaajay
  • Add NVSkills CI request workflow (#5033) by @Phlip79
  • DDP wrap pg size fixes (#5006) by @maanug-nv
  • fix(layer_wise): tag MTP-stage word_embeddings as is_embedding_or_output_parameter (#5034) by @Wohox
  • Move LTS dependencies from pyproject.toml to Dockerfile.ci.lts (#4877) by @balasaajay
  • Use shared ModelOpt calibration loop on 0.45+ with 0.44 fallback fix (#4881) by @kevalmorabia97
  • test(release): skip golden comparison on intermediate resume windows (#5040) by @ko3n1g
  • [mimo] Thread position_ids through MimoModel for multimodal RoPE (#4938) by @liding-nv
  • build: Switch DSv3 on H100 to HybridEP (#5039) by @ko3n1g
  • Fix: Import unwrap_model from megatron.core.utils in modelopt examples (#5045) by @kevalmorabia97
  • Simple and stable Inference APIs (#4697) by @YangFei1990
  • ci: Add notification step for MBridge downstream test results (#5028) by @balasaajay
  • Delete output tensor early (#4742) by @Phlip79
  • Support ScaledSReLU in TE grouped MLP fuser (#4859) by @sraman-rgb
  • Skip gradient updates when grad norm exceeds threshold (#3460) by @yfw
  • Add 9 user skills (#5066) by @Phlip79
  • test(nemotron): align nemotron3 super GB200 goldens with exit-interval 4768 (#5069) by @ko3n1g
  • chore: Update transformer-engine dependency to version 2.16.0 (#4992) by @balasaajay
  • Update energon version requirement (#4572) by @maanug-nv
  • Fix test failures for new inference APIs (#5068) by @YangFei1990
  • fix(ci): set PYTHONUNBUFFERED=1 in JET workload env (#5072) by @ko3n1g
  • Preserve non-FSDP-unit buckets across AllGatherPipeline reset (#4717) by @wujingyue
  • Add opt-in MXFP8 LM-head output projection (#4825) by @gdengk
  • chore(beep boop 🤖): Bump (main) (2026-06-01) by @github-actions[bot]
  • fix(ci): bound JET pipeline polling with a watchdog to prevent indefinite hangs (#5076) by @ko3n1g
  • test: unmark EP A2A activation offload test flaky (#5009) by @lhb8125
  • ci: prune old artifacts on cluster lustre during weekly/release runs (#5084) by @ko3n1g
  • ci(test): isolate ckpt-resume tensorboard per phase (#5074) by @ko3n1g
  • Change ownership groups (#5021) by @Phlip79
  • test: skip mfsdp_fully_shard cases when world_size < mesh size (#4487) by @wujingyue
  • fix mimo optimizer checkpoint metadata restore (#4791) by @liding-nv
  • [mimo] Support bridge fan-out for variable modality tokens (#5062) by @liding-nv
  • cp: Remove DeepEP hardware limit check (4846) into core_r0.18.0 (#5126) by @ko3n1g
  • chore: Update transformer-engine dependency to revision 4220403 (#5112) (#5137) by @balasaajay
  • cp: fix(optimizer): gate ChainedOptimizer MXFP8 defer-sync on DDP-level overlap_param_gather (4982) into core_r0.18.0 (#5146) by @ko3n1g
  • cp: build: Switch DSv3 on H100 to HybridEP (5164) into core_r0.18.0 (#5165) by @ko3n1g
  • chore(beep boop 🤖): Bump (core_r0.18.0) (2026-06-22) by @github-actions[bot]
  • beep boop 🤖: Bumping Megatron Core to v0.18.2 [skip ci] by @github-actions[bot]
  • Resetting Megatron Core version to v0.18.0 and Megatron FSDP to rc0 by @balasaajay
  • make fsdp a release package by @balasaajay
  • docs: Fix docs version for 0.18.0 release (#5435) by @chtruong814
  • beep boop 🤖: Bumping Megatron Core to v0.18.1 [skip ci] by @github-actions[bot]
  • Resetting Megatron Core and FSDP patch versions to 0 (#5437) by @balasaajay

Don't miss a new Megatron-LM release

NewReleases is sending notifications on new releases.