Changelog Details
- fix(ci): replace actions/setup-python with apt-get to avoid 429 rate limits by @ko3n1g :: PR: #4072
- ci: Fix package name for code-freeze workflow by @ko3n1g :: PR: #4077
- chore: bump
_code_freezeworkflow tov0.86.0by @ko3n1g :: PR: #4078 - Fix checkpoint inspector by @janEbert :: PR: #4079
- Update docs to conform to NVIDIA style guides by @megnvidia :: PR: #4068
- Miscellaneous inference fixes by @santhnm2 :: PR: #4030
- fix fine_grained_callables with fused rmsnorm residual by @CarlosGomes98 :: PR: #4026
- [Main][feat] Support overlapping A2A Combine backprop with wgrad GEMM by @Wohox :: PR: #3795
- Modify mfsdp default data-parallel-sharding-strategy by @wplf :: PR: #3691
- Fix fsdp_dtensor conversion for pretrained-only checkpoints by @DAISY-gh :: PR: #3912
- Guard NVshmem issues by @wdykas :: PR: #4093
- m-fsdp: wire use_precision_aware_optimizer from ddp_config to ParamAn… by @rapatel :: PR: #4024
- Megatron-FSDP: Add MXFP8 transpose helper buffer for Hybrid FSDP by @shjwudp :: PR: #3918
- feat(fsdp): use TE general_gemm for mixed-precision wgrad in FSDP path by @Victarry :: PR: #3822
- Megatron-FSDP: Fix insufficient double buffers during gradient reduce by @shjwudp :: PR: #4054
- Fix M-FSDP MXFP8 related BUGs by @shjwudp :: PR: #3991
- Megatron-FSDP: Make _pre_forward_param_unshard and _register_post_backward_hook formal by @shjwudp :: PR: #4029
- FIX: Use decoupled gradients for precision-aware M-FSDP grad norm by @XueSongTap :: PR: #3746
- Align chat completions endpoint with vLLM by @santhnm2 :: PR: #4063
- [Megatron-FSDP] Fix compatibility with frozen parameters and add unit tests by @shjwudp :: PR: #3287
- [M-FSDP] Refactor uneven dtensor to full tensor and add UT by @shjwudp :: PR: #3190
- Add agent instruction files by @Phlip79 :: PR: #4102
- Bump eopt version by @skyw :: PR: #4100
- Refactor emerging optimizer integration by @skyw :: PR: #4113
- Fix over provisioning of Mamba state memory when max_requests is set by @santhnm2 :: PR: #4114
- base strategy simplification by @dimapihtar :: PR: #4001
- add support for DCP and FSDP async save by @dimapihtar :: PR: #4027
- Add more emerging optimizers (#3907) by @skyw :: PR: #4119
- Fix FSDP checkpoint conversion and loading for Qwen3.5-VL by @DAISY-gh :: PR: #3936
- docs: update mcore optimizer docstrings to google style by @Akshat8510 :: PR: #2799
- Set tensor-parallel attributes irrespective of perform_initialization by @ilml :: PR: #4084
- docs: add developer-guide skill with CI/CD and failure navigation guidance by @ko3n1g :: PR: #4035
- chore: Move skills by @ko3n1g :: PR: #4136
- ci: Let Claude react to comment by @ko3n1g :: PR: #4135
- Nemotron3 Super GB200 release config by @maanug-nv :: PR: #4118
- Enable CUDA graph for ADAM optimizer by @vasunvidia :: PR: #3429
- Claude review should recommend testing by @Phlip79 :: PR: #4137
- cleanup: remove unused
scatter_gather_tensors_in_pipelineargument by @Phlip79 :: PR: #4140 - fix: Remove fail-fast (-x) and guard distributed teardown against deadlock by @ko3n1g :: PR: #4139
- Claude: add respond-to-issue skill by @Phlip79 :: PR: #4141
- Fix muon getter backward compatability by @skyw :: PR: #4157
- Audit of user guide by @megnvidia :: PR: #4098
- Fix
RerunStateMachinecrash (TypeError: 'NoneType' object is not subscriptable) by not saving a checkpoint after a transient NaN / Inf by @yezhengmao1 :: PR: #3981 - Preserve type of decorated methods/classes by @nschank :: PR: #4062
- update muon test case to use new interface by @skyw :: PR: #4163
- [M-FSDP] Fix Tensor Parallel mode detection by @shjwudp :: PR: #3191
- fix: remove weights_only=False for multimodal example by @faradawn :: PR: #4104
- Cudagraphs: Fix sequence packing segfault more generally by @mathemakitten :: PR: #4162
- Make MTP work with materialize_only_last_token_logits by @santhnm2 :: PR: #4166
- Add unit test for Mamba EP inference (eager fallback with mixed CUDA graphs) by @santhnm2 :: PR: #4085
- update docs in respect to async changes by @dimapihtar :: PR: #4177
- update checkpointing docs in respect to async changes by @dimapihtar :: PR: #4208
- chore: improve build-and-test skill with trigger rules and dependency workflow by @ko3n1g :: PR: #4199
- Fix layerwise optimizer with
expt_dp_size=1and contention with element-wise distributed optimizer by @skyw :: PR: #4138 - ci: add --cluster-a100/h100/gb200 args to trigger_internal_ci.py by @ko3n1g :: PR: #4195
- ci: Update golden values for nightly tests by @chtruong814 :: PR: #4215
- rename async_allgather to overlap_param_gather by @skyw :: PR: #4217
- Fix Slack sync for users with GitHub email privacy enabled by @Phlip79 :: PR: #4220
- Miscellaneous MTP inference fixes by @santhnm2 :: PR: #4191
- Move inference guards out of arguments.py by @mathemakitten :: PR: #4210
- Fix: enable fine-grained activation offloading for Mamba model. by @fanshiqing :: PR: #4173
- bump NVRx by @dimapihtar :: PR: #4178
- Update tokenizer args for Nemotron3 release config by @maanug-nv :: PR: #4239
- build: add dynamic git-versioning and drop rc0 pre-release tag by @ko3n1g :: PR: #4212
- Fix unnecessary permute padding for non-quantized MoE dispatch by @xiaoxi-wangfj :: PR: #4038
- Fix split state dict main by @kunlunl :: PR: #3676
- Add /split-pr Claude Code command for splitting PRs by CODEOWNERS by @Phlip79 :: PR: #4160
- Enable FP8 DPA for MXFP8 recipe by @vasunvidia :: PR: #4066
- Enable AG/RS overlap with explicit process group passing by @jeffnvidia :: PR: #3249
- Enable cpu_offloading with Full iteration CUDA graph by @vasunvidia :: PR: #3969
- Fix TransformerConfig validation for mixed dense/MoE upcycling by @rkteddy :: PR: #3647
- Remove cross-rank synchronization during checkpoint load & deprecate torch.distributed.checkpoint.state_dict_loader.load_state_dict by @asolergi-nv :: PR: #2864
- Fix incorrectly set decoupled_grad and DistOpt mechanics for MFSDP. by @cspades :: PR: #4133
- Refit Miscelaneous by @wdykas :: PR: #3973
- Add conditions_embeddings argument to TransformerBlock, TransformerLayer for DiT (diffusion transformer) by @huvunvidia :: PR: #4134
- Fix build_sequences_per_dataset output path arg usage by @DhineshPonnarasan :: PR: #4144
- ci: Flush pending CUDA work before the barrier in destroy_model_parallel by @chtruong814 :: PR: #4259
- Update oncall schedule by @Phlip79 :: PR: #4257
- docs(moe): Update MoE README by @sbhavani :: PR: #3664
- Revert "Add conditions_embeddings argument to TransformerBlock, Trans… by @ko3n1g :: PR: #4270
- reduce the number of shared expert streams by @yangbofun :: PR: #3752
- remove legacy Bert code by @dimapihtar :: PR: #4204
- [Main] Feat(moe): Gated delta net context parallel (CP) by @yuzhongw-nvidia :: PR: #2642
- remove t5 legacy code by @dimapihtar :: PR: #4203
- fix: handle list-typed process groups in ProcessGroupCollection.repr by @cluster2600 :: PR: #3753
- Fix Context Parallelism documentation link by @liangxs :: PR: #4149
- [MLA] fix: Pad V when Q/V head dims differ for THD by @HollowMan6 :: PR: #3003
- fix(megatron-fsdp): build expt_device_mesh only for MoE models by @xuwchen :: PR: #3831
- Allow the evaluation batch size to differ from the training batch size by @michal2409 :: PR: #4014
- Add @NVIDIA/transformer review group to megatron/core/transformer/ by @Phlip79 :: PR: #4281
- Reset AG_pipeline bucket status after validation step. by @vasunvidia :: PR: #3155
- Enhance and fix NVTX for training by @yaox12 :: PR: #3642
- NVFP4 native weights for DDP by @WanZzzzzz :: PR: #4005
- Remove unnecessary arguments for layerwise distributed optimizer by @FDecaYed :: PR: #4272
- reuse grad buffer for layer-wise param allgather by @FDecaYed :: PR: #3751
- feat(ci): add strict review mode to Claude review workflow by @Victarry :: PR: #4197
- Fix stale approvals by @Phlip79 :: PR: #4280
- [MoE] Add a new score function to the router by @yaox12 :: PR: #3673
- [MoE] Improvement of shared expert overlap, support shared expert overlap for FlexDispatcher by @Victarry :: PR: #2207
- build: bump DeepEP to 34152ae by @ko3n1g :: PR: #4228
- ci: mark test_fused_indexer_loss_gradient_tp_consistency as flaky_in_dev by @ko3n1g :: PR: #4299
- Fix typo in PR4133. by @cspades :: PR: #4277
- ci: add retry loop to apt-get update to handle transient mirror sync failures by @ko3n1g :: PR: #4209
- fix: enforce correct pass thresholds for deterministic and approximate tests by @ko3n1g :: PR: #4238
- remove legacy biencoder and realm models by @dimapihtar :: PR: #4205
- ci: add configurable launcher support for functional tests (ft_launcher / torchrun) by @ko3n1g :: PR: #4298
- chore: document --target main for local Docker builds by @ko3n1g :: PR: #4307
- Extract args init to launch scripts by @maanug-nv :: PR: #4225
- [Main] Fix TE version check for retain_pinned_cpu_buffers in cpu offload by @BestJuly :: PR: #4267
- Fix documented shape by @janEbert :: PR: #3486
- ci: add sync-skills workflow, rename CLAUDE.md → AGENTS.md, move .claude/skills → skills/ by @ko3n1g :: PR: #4303
- Get
devicecorrectly when module returns a dict instead of individual tensor by @shifangx :: PR: #4265 - remove vision legacy code by @dimapihtar :: PR: #4202
- feat: long convergence resiliency for release tests by @ko3n1g :: PR: #4335
- ci(action): improve GitHub Actions output UX by @ko3n1g :: PR: #4337
- build: bump TransformerEngine to release_v2.14 by @ko3n1g :: PR: #4331
- feat: add create-issue skill by @ko3n1g :: PR: #4338
- M4 leftover for TE cuda graph by @shifangx :: PR: #3137
- fix: wait for async P2P send before deallocating output tensor by @ZhiyuLi-Nvidia :: PR: #4047
- ci(gb200): add 1-node mr-github functional test variants by @ko3n1g :: PR: #4334
- Fix potential coredump issue that occurs when saving a checkpoint by @ezioliao :: PR: #1871
- docs: bump versions1.json to 0.17.0 (latest) by @ko3n1g :: PR: #4360
- Port DeepSeek Sparse Attention to
MambaModelby @janEbert :: PR: #3553 - Add tables and histogram for RL staleness by @tdene :: PR: #4097
- [docs] ci: use parent-relative json_url for version picker by @ko3n1g :: PR: #4367
- Fix bug with non-partial rollouts by @tdene :: PR: #3964
- Add QK layernorm support for dot-product attention in MambaModel by @Phlip79 :: PR: #4067
- Docs: improve docstrings and comments in example training loop by @DhineshPonnarasan :: PR: #4041
- feat(ckpt): add --async-ckpt-use-cpu-shm argument by @sbak5 :: PR: #4355
- cp: Fix UT timeout (#4310) by @chtruong814 :: PR: #4373
- Fix RL reward due to stop token by @tdene :: PR: #4096
- FA4 Inference by @wdykas :: PR: #4186
- Make param_index_map always use unpacked (full numel) offsets by @deepakn94 :: PR: #4328
- Add activation logging and tokens per expert logging by @Mellonta :: PR: #3842
- Fix RL to once again work with --skip-train by @tdene :: PR: #4249
- Fix Megatron initialization with extra_args_provider by @santhnm2 :: PR: #4327
- Rename MambaModel/MambaStack to HybridModel/HybridStack by @Phlip79 :: PR: #4099
- fix(ci): wrap uv install in retry block by @ko3n1g :: PR: #4387
- Call save_checkpoint_and_time() when saving checkpoint and compute elapsed duration for saving checkpoint before logging timer by @awsankur :: PR: #4263
- refactor(tests): move NCCL env vars from docker launcher to shell training script by @ko3n1g :: PR: #4390
- Remove packed_attention_mask unused parameter by @tdene :: PR: #3859
- Second batch of audit edits by @megnvidia :: PR: #4115
- revert: replace rampup batch size scheduler with custom step batch size schedules (#3779) by @ko3n1g :: PR: #4404
- Replace rampup batch size scheduler with custom step batch size schedules by @deepakn94 :: PR: #4411
- Megatron-FSDP: log mcore detection only after imports succeed by @wujingyue :: PR: #4400
- ci(gb200): re-enable tunable_overlap 1-node mr-github test by @ko3n1g :: PR: #4405
- Fix local docs building by @Phlip79 :: PR: #4416
- RL: Onload optimizer after logprobs computation by @tdene :: PR: #4235
- Add RL token throughput and packing metrics by @tdene :: PR: #3877
- ci: remove publish:merge_into_dev job by @ko3n1g :: PR: #4421
- docs: add data loading best practices for large-scale training by @sbhavani :: PR: #4236
- Fix: Auto enable manual registration and enhance the docummentation by @youngeunkwon0405 :: PR: #3295
- Fix nvtx_decorator to check _nvtx_enabled at call time by @minitu :: PR: #4184
- fix merges_file typo in megatron_hf_tokenizer by @chelseajohn :: PR: #4392
- Enable NullTokenizer for pretraining to reduce I/O access by @asolergi-nv :: PR: #4057
- docs: Add SECURITY.md by @chtruong814 :: PR: #4431
- Mamba inference opt by @wdykas :: PR: #4414
- DDP refactoring: Extract parameter layout computation into optimizer classmethod by @deepakn94 :: PR: #3812
- Update PR template with explicit request for issue by @Phlip79 :: PR: #4409
- Misc inference fixes by @sidsingh-nvidia :: PR: #4397
- Rename Mamba to Hybrid outside megatron/core by @Phlip79 :: PR: #4159
- Include mtp layers in token per expert logging by @Mellonta :: PR: #4412
- fix: NVRx async compatibility and defer resiliency import by @sbak5 :: PR: #4420
- fix(checkpoint_inspector): allow empty --param-to-param-group-map-json by @DAISY-gh :: PR: #4403
- Add the YARN support for hybrid_model by @guihong-nv :: PR: #4244
- [training migration] Add container class for config dataclasses by @maanug-nv :: PR: #4227
- Inference: Fix broken functional tests on gitlab by @sidsingh-nvidia :: PR: #4454
- SafeUnpickler class for safe pickle usage by @dimapihtar :: PR: #4319
- get rid of weights_only=False by @dimapihtar :: PR: #4434
- Inference | Per-block MoE routing storage for prefix caching by @lmcafee-nvidia :: PR: #4301
- Add troubleshooting tip for 'access forbidden' by @balasaajay :: PR: #4449
- Fix checkpoint loading with rerun state machine by @YangFei1990 :: PR: #4448
- Add misc CUDA graph sugar to CudaGraphManager by @tdene :: PR: #4425
- Inference: Add the embedding and output layer in the full_iteration_inference cuda graph scope for hybrid models by @sidsingh-nvidia :: PR: #4440
- Important bugfixes in local CG implementation that were leading to loss curve gaps for latent MoE models by @jiemingz :: PR: #4433
- fix: Replace polynomial rolling hash with SHA-256 for prefix caching by @lmcafee-nvidia :: PR: #4158
- feat(ckpt): expose validate_access_integrity knob on dist-ckpt load by @asolergi-nv :: PR: #4422
- Fix multivalidation by @RPrenger :: PR: #3388
- Add missing knob for reduce_scatter_with_fp32_accumulation by @WanZzzzzz :: PR: #4410
- Enable CUDA graphs for MTP inference by @santhnm2 :: PR: #4260
- checkpoint integrity verification by @dimapihtar :: PR: #4305
- Fix cache gating by @wdykas :: PR: #4455
- [Main] Fix FusedAdam.use_decoupled_grad mis-set for Megatron-FSDP. by @cspades :: PR: #4427
- add permute fusion into hybrid ep by @Autumn1998 :: PR: #4089
- Add ColocatedBridgeCommunicator for heterogeneous TP/DP MIMO training (NMFW-17) by @yashaswikarnati :: PR: #4368
- Fix incorrect bias display in extra_repr of Column/RowParallelLinear by @HelloWorldBeginner :: PR: #4330
- Fix assertion logic in combined_1f1b_schedule_for_interleaved_pipelining by @joapolarbear :: PR: #4276
- ci: Fix event name reference in CI workflow condition for merge group by @balasaajay :: PR: #4462
- Add nightly sync workflow from main to dev by @Phlip79 :: PR: #4165
- fix: handle list-format quant_cfg from ModelOpt PR #1094 by @ChenhanYu :: PR: #4187
- ci: also add Run MBridge tests label in nightly sync workflow by @Phlip79 :: PR: #4499
- [training migration] Add serialization features to config container by @maanug-nv :: PR: #4309
- Fix conflict with inference graphs by @tdene :: PR: #4504
- Add tools/prepare_cache.py for offline GPT dataset cache preparation by @asolergi-nv :: PR: #4080
- [build] fix: move mamba-ssm and causal-conv1d to optional [ssm] extra by @ko3n1g :: PR: #4517
- mamba: avoid redundant HBM reloads in causal_conv1d_update shift loop by @wdykas :: PR: #4460
- Standardize misc graph interface by @tdene :: PR: #4485
- Fix inference graph override in RL flow by @tdene :: PR: #4323
- Unify and refactor Megatron-FSDP documentation. by @cspades :: PR: #4418
- Revert "ci: add base_sha to codecov/codecov-action upload step (#4445)" by @chtruong814 :: PR: #4526
- Skills for running unit tests and working with slurm by @yashaswikarnati :: PR: #4502
- Reorganize order of operations in inference context and text generation controller by @tdene :: PR: #2929
- ci: Update CI workflow conditions to include merge group handling by @balasaajay :: PR: #4532
- ci: add base_sha to codecov/codecov-action upload step by @chtruong814 :: PR: #4540
- Fix release tests: remove --global-batch-size conflicting with --step-batch-size-schedule by @deepakn94 :: PR: #4545
- docs: use @file-path notation for file references in skills by @ko3n1g :: PR: #4542
- Support YAML quant recipe in PTQ and remove first/last layer modifier code by @jenchen13 :: PR: #4503
- Avoid nsys profile crash with CUDA graphs by @tdene :: PR: #4541
- fix(ci): add retry with backoff to approve-test-queue bot by @ko3n1g :: PR: #4559
- New allgathervdispatcher for inference and simplify old dispatcher. by @sidsingh-nvidia :: PR: #4258
- Fixes for modelopt examples and SFTTokenizer for transformers v5 by @jenchen13 :: PR: #4450
- Adding code for Flextron by @sheliang-nv :: PR: #4429
- Fix partial cudagraphs + HybridEP not properly triggering DDP hook by @jiemingz :: PR: #4500
- Ignore pytorch link anchors by @maanug-nv :: PR: #4582
- MoE dispatcher fixes: size NVLS dispatcher buffers from actual tensor sizes by @mathemakitten :: PR: #4576
- Finalize all builders in preprocess_data, not just the last key by @sayalinvidia :: PR: #4573
- refactor(skills): add when_to_use frontmatter, split ci-test-system, enforce skill workflow by @ko3n1g :: PR: #4574
- Make last_token_logits graphable by @tdene :: PR: #4552
- fix(ci): correct off-by-one in total_steps_evaluated formula by @ko3n1g :: PR: #4591
- Add fault injection support via nvidia_resiliency_ext. by @hexinw-nvidia :: PR: #4370
- Guard vocab reduce_scatter on TP > 1 by @mathemakitten :: PR: #4565
- Move inference context bookkeeping to CPU with ContextGPUView by @lmcafee-nvidia :: PR: #4306
- Enable InJob restart on failures. by @hexinw-nvidia :: PR: #4594
- Enable shared expert overlap with allgatherv in inference by @sidsingh-nvidia :: PR: #4570
- Add vLLM grouped gemm backend for MoE inference by @santhnm2 :: PR: #4566
- Move KD teacher loading to after Float16Module by @AAnoosheh :: PR: #4394
- ci: update gpt3_7b_tp4_pp1_memory_speed gb200 golden values by @ko3n1g :: PR: #4601
- Fix inference unit test by @maanug-nv :: PR: #4589
- Checkpoint conversion between GPT_model and Hybrid_model by @guihong-nv :: PR: #4482
- ci: add cadence input for test filtering in CI workflows by @balasaajay :: PR: #4561
- Fix
mtp_use_repeated_layerbehavior for GPT models by @rkarimimahab :: PR: #3965 - Handle SSM sharded tensor merge OOM with CPU fallback by @returnL :: PR: #4442
- FlashInfer sampling by @tdene :: PR: #2456
- Fix main2dev workflow by @Phlip79 :: PR: #4610
- Add logic to enable chunked MLP during training by @pengdurice :: PR: #3656
- Inference bug-fixes: Re-enable EP syncs for the legacy A2A dispatcher and re-simplify ep_sync accidentally reverted by #4306 by @sidsingh-nvidia :: PR: #4587
- Remove invalid
timeoutargument for dist.barrier by @zhaoyinglia :: PR: #4512 - Fix buffers in refit by @wdykas :: PR: #4580
- Named validation sets by @RPrenger :: PR: #4578
- Fix Hang in tests by @wdykas :: PR: #4575
- Single commit for main2dev nightly by @Phlip79 :: PR: #4614
- convert tokenizer args to config by @dimapihtar :: PR: #4406
- Siddharth/fix ep sync by @wdykas :: PR: #4607
- mmiranda working on another set of broken links by @megnvidia :: PR: #4534
- Fix gradient corruption with layerwise param all-gather overlap by @deepakn94 :: PR: #4609
- test: mark TestFusedApplyMLARope::test_forward_backward_for_q flaky_in_dev by @ko3n1g :: PR: #4639
- remove legacy GPT code by @dimapihtar :: PR: #4322
- ci: introduce L-tier scope vocabulary via parser by @balasaajay :: PR: #4625
- Inference: Tune vLLM grouped gemm, moe_sum kernel, and enable shared expert overlap in latent MoEs by @sidsingh-nvidia :: PR: #4603
- Fix crash involving evicted requests and tpot by @tdene :: PR: #4645
- remove legacy tranformer and modules by @dimapihtar :: PR: #4207
- chore: Update Docker image version to 26.04-py3 by @balasaajay :: PR: #4611
- Propagate errors for failed inference requests by @mathemakitten :: PR: #4679
- Inference: Cache input + position ID views by @mathemakitten :: PR: #4634
- ci: Update Gitlab base image to 26.04 pytorch by @chtruong814 :: PR: #4688
- Add periodic GPU sniff tests to detect hardware stragglers by @deepakn94 :: PR: #4662
- ci: Bump GHA versions by @chtruong814 :: PR: #4606
- build: widen flashinfer-python pin to <0.7.0 by @ko3n1g :: PR: #4700
- Add Python-side guardrail for HybridEP InfiniBand limit and rename seq_len by @Shreyas-S-809 :: PR: #4094
- Switch oncall by @janEbert :: PR: #4702
- Update golden values for various functional tests by @balasaajay :: PR: #4703
- chore: Update golden values for various functional tests by @balasaajay :: PR: #4706
- build: upgrade mamba-ssm to 2.3.2.post1, causal-conv1d to 1.6.2.post1 by @ko3n1g :: PR: #4712
- ci: replace uuidgen with /proc/sys/kernel/random/uuid by @ko3n1g :: PR: #4714
- chore(codeowners): add megatron/inference/ ownership by @ko3n1g :: PR: #4704
- Create a Protocol for the MLP layer of TransformerLayer by @nschank :: PR: #3435
- Revert "Add Python-side guardrail for HybridEP InfiniBand limit and rename seq_len (#4094)" by @ko3n1g :: PR: #4718
- Add Python-side guardrail for DeepEP IB limits by @janEbert :: PR: #4719
- ci: revert bad uv.lock bump and label future bumps with
Run functional testsby @ko3n1g :: PR: #4730 - [ci] fix: treat cancelled run-main-script step as failure by @ko3n1g :: PR: #4727
- ci: Major refactor of release-workflows by @ko3n1g :: PR: #4602
- build(deps): bump nvidia-modelopt to 0.43 by @ko3n1g :: PR: #4723
- fix(fsdp): recognize legacy GDN TP metadata by @Glitchfix :: PR: #4664
- Fixes for Nemotron3 Super release test config by @maanug-nv :: PR: #4544
- feat(gpt): add output postprocess hook by @Glitchfix :: PR: #4686
- Add bump-base-image skill and update golden value comparison by @balasaajay :: PR: #4733
- Guard omegaconf imports by @maanug-nv :: PR: #4685
- Fix a regression introduced by #4625 for nightly runs by @balasaajay :: PR: #4734
- Add LLaVA audio (sound) model support by @cuichenx :: PR: #4402
- Support transfomers 5.x.x for text generation server by @tdene :: PR: #4732
- Update transformer-engine dependency to version 2.15.0 by @balasaajay :: PR: #4682
- Increase CG cover from max_requests to max_tokens by @tdene :: PR: #4214
- fully remove legacy code by @dimapihtar :: PR: #4759
- fix legacy torch save when tensor_model_parallel_size > expert_model_parallel_size * expert_tensor_parallel_size by @dimapihtar :: PR: #4678
- Wire --rl-inference-parsers into MRL by @tdene :: PR: #4768
- Integrate LayerWiseDistributedOptimizer with DDP buffer infrastructure by @deepakn94 :: PR: #4509
- [training migration] Migrate mamba builder by @maanug-nv :: PR: #4550
- NCCL UB fix: reduce memory cost and correctly deregister NCCL mem pool by @xrennvidia :: PR: #4492
- fix: use no_mask in local ViT layer spec by @Phlip79 :: PR: #4395
- refit clean up and refactoring by @wdykas :: PR: #4762
- Support recomputing in HybridModel by @xuantengh :: PR: #4496
- Make weight and optimizer memory estimation take into account expert parallelism correctly by @YangFei1990 :: PR: #4687
- One single flag that determines if we are in inference by @tdene :: PR: #4617
- [main] feat(moe): Support packed sequence for gated delta net (GDN) by @yuzhongw-nvidia :: PR: #2645
- remove dead manual_release_grads code path in 1F1B overlap schedule by @Wohox :: PR: #4511
- Fix recompute checkpointing + training CGs by @tdene :: PR: #3919
- Use Protocols to type-check linear_proj submodules of Attention by @nschank :: PR: #3434
- fix tokenizers in respect to newer transformers by @dimapihtar :: PR: #4608
- Bump nvidia-modelopt>=0.44.0 by @kevalmorabia97 :: PR: #4803
- Update owners by @Phlip79 :: PR: #4794
- ci: Update workflow to use same commit for building docker image and running tests by @balasaajay :: PR: #4787
- chore: Update nightly tests golden values by @balasaajay :: PR: #4805
- Inference: Optimize Prefill Engine Steps for Nemotron by @sidsingh-nvidia :: PR: #4764
- ci: tolerate git-gc race in /home/runner chown after checkout by @balasaajay :: PR: #4808
- additional tests for nvrx by @dimapihtar :: PR: #4522
- Disable MSC by default; opt in via --enable-msc by @asolergi-nv :: PR: #4629
- Strengthen test_checkpoint to verify distributed checkpoint behavior by @lichenlu :: PR: #4711
- Combine GEMM + SwiGLU fused MLP PRs (3890, 4071, 4095, 4219, 4311, 4324) → main by @Connor-XY :: PR: #4636
- [fix] Use MSC for checking checkpoint existence by @pavelgein :: PR: #4251
- [Main][feat] Support A2A Overlap for Megatron-FSDP by @Wohox :: PR: #3797
- Reorder mtp_post_process after attention backward in 1F1B schedule plan by @gdengk :: PR: #4695
- add is_torch_min_version in fsdp src by @xrennvidia :: PR: #4812
- Add high-priority A2A stream and HybridEP preprocessing SMs by @gdengk :: PR: #4694
- Refactor CUDA graph API: decompose cuda_graph_scope into full_iteration impl, inference scope, and per-layer capture modules by @buptzyb :: PR: #4292
- Tokenizers updates by @dimapihtar :: PR: #4780
- Fix no nvrx tests by @dimapihtar :: PR: #4847
- Thread custom process groups through MoE grad finalization by @yashaswikarnati :: PR: #4782
- Fix unit tests by @shanmugamr1992 :: PR: #4689
- Tests/dynamic inference functional coverage by @shanmugamr1992 :: PR: #4761
- Fix oncall references by @janEbert :: PR: #4722
- Update golden values for nightly functional tests by @balasaajay :: PR: #4850
- fix(inference): size DynamicInferenceContext KV layer_map for non-uniform PP by @athitten :: PR: #4775
- Modernize post-training modelopt example scripts by @kevalmorabia97 :: PR: #4807
- test: add inference performance test harness for GPT 583M, hybrid 2B,… by @shanmugamr1992 :: PR: #4806
- ci: Gate optional CI jobs with repository variables by @chtruong814 :: PR: #4907
- Fix tokenizers bug in nightly by @Phlip79 :: PR: #4833
- ci: Prevent shell trace in parts of _run_training.sh by @chtruong814 :: PR: #4884
- Ignore Vim swap files by @wujingyue :: PR: #4860
- M-FSDP: Make fine_grained_param_gather configurable for MXFP8 to enable performance–memory trade-offs by @shjwudp :: PR: #4181
- MimoOptimizer: fix distributed checkpoint save and load for non-colocated MIMO by @kamran-nvidia :: PR: #4801
- Route non-Muon params through DistributedOptimizer by @deepakn94 :: PR: #4771
- Allow optimizer CG to share the same pool as full-iter CG by @nanz-nv :: PR: #4698
- Use sharded_state_dict_default in MLP.sharded_state_dict by @gdengk :: PR: #4693
- Update PR template by @Phlip79 :: PR: #4904
- Fix MTP recompute crash with packed sequences by @BestJuly :: PR: #4593
- ci: Update perf test to output logs for tests to pass by @chtruong814 :: PR: #4906
- Also persist asymmetrical units for the MXFP8 transpose weight buffer. by @cspades :: PR: #4852
- fix no_shard training convergency and add unittest for no_shard by @wplf :: PR: #3754
- Move policy epoch stats to the message object by @ArEsKay3 :: PR: #4533
- Add a knob to throttle the max allowed inflight offload in fine grained offloading by @nanz-nv :: PR: #4692
- refactor(data): consolidate get_batch and enable PP for SFT THD by @asolergi-nv :: PR: #4103
- Allow YAML MoE configs to use model specs by @chawkins-nvidia :: PR: #4822
- Move bert and t5 pretrain files by @Phlip79 :: PR: #4820
- Paged Stashing by @nanz-nv :: PR: #4247
- make FP4 param gather work with the mixed precisions in NVFP4 recipe by @xrennvidia :: PR: #4358
- fix: Fix multi-node functional test phase sync by @chtruong814 :: PR: #4924
- Perf tests by @shanmugamr1992 :: PR: #4917
- fix(cuda_graphs): handle TE 2.15 removal of FP8GlobalStateManager.set_skip_fp8_weight_update_tensor by @balasaajay :: PR: #4874
- Fix paged stashing test submodules lookup by @Phlip79 :: PR: #4925
- Add TEFusedDenseMLP for Dense+Grouped GEMM fusion on SM100+ (#4318) by @sraman-rgb :: PR: #4786
- Fix mxfp8 param gather numerical issue when DP overlap is off by @WanZzzzzz :: PR: #4800
- [MXFP8/FP4-param-gather] Post processing after forced param AG in eval by @WanZzzzzz :: PR: #4562
- ci: Update training script paths in BERT and T5 by @balasaajay :: PR: #4939
- Various training utils by @maanug-nv :: PR: #4872
- ci: restore perf test torchrun logs by @chtruong814 :: PR: #4951
- Fix
get_batchreturn order to ignore BlendedDataset provenance fields by @deepakn94 :: PR: #4952 - test(release): add release goldens for deepseekv3/nemotron3 and set tp2pp2 exit-interval by @ko3n1g :: PR: #4932
- test: enable NVTE_CUTEDSL_FUSED_GROUPED_MLP via pytest fixture by @ko3n1g :: PR: #4931
- Avoid offsetting functional test master port by @chtruong814 :: PR: #4973
- Fix elastification unwrap_model import by @Devil1716 :: PR: #4972
- test: re-enable paged stashing MoE tests by @ko3n1g :: PR: #4978
- test(ci): re-enable 8experts2parallel_multi_dist_optimizer_instances_1node by @ko3n1g :: PR: #4984
- ci: Add support for MBridge job gating based on PR labels by @balasaajay :: PR: #4926
- test: re-enable test_pp2_create_cudagraphs_first_stage on TE 2.15+ by @ko3n1g :: PR: #4985
- fix(tests): initialize num_microbatches calculator in vision cudagraph tests by @ko3n1g :: PR: #4986
- ci: Add allow_failure flag to gpt and moe recipes that are failing in nightlies by @balasaajay :: PR: #4905
- Drain predecessor reduce-scatter at dispatch time by @deepakn94 :: PR: #4940
- nightly(ci): Update golden values for functional t5 tests by @balasaajay :: PR: #4995
- [main] Refactor and Improve MoE Logginginit commit by @yanring :: PR: #3431
- ci: validate release branch-rules by @ko3n1g :: PR: #4929
- [Megatron-FSDP] Add conditional param.grad dereferencing logic to support full-iteration (FWD-BWD) CUDA graphability. by @cspades :: PR: #4663
- test: restrict iter-time comparison to steady-state window by @ko3n1g :: PR: #5010
- fix(test): pin eval-global-batch-size on 15b gb200 release configs by @ko3n1g :: PR: #5022
- [fix] Release MTP assertion when EP overlap with PP=1 by @Wohox :: PR: #4796
- fix(test): widen iter-time steady-state window for short tests by @ko3n1g :: PR: #5023
- Perf fix by @shanmugamr1992 :: PR: #4996
- Add dev-feature preservation gate and change schedule by @Phlip79 :: PR: #4773
- chore(test): remove orphan nemotron3_super_release_g200 dir by @ko3n1g :: PR: #5024
- Ignore Claude worktree directory by @Phlip79 :: PR: #5020
- ci: update CI workflow conditions for integration tests by @balasaajay :: PR: #4658
- Add NVSkills CI request workflow by @Phlip79 :: PR: #5033
- DDP wrap pg size fixes by @maanug-nv :: PR: #5006
- fix(layer_wise): tag MTP-stage word_embeddings as is_embedding_or_output_parameter by @Wohox :: PR: #5034
- Move LTS dependencies from pyproject.toml to Dockerfile.ci.lts by @balasaajay :: PR: #4877
- Use shared ModelOpt calibration loop on 0.45+ with 0.44 fallback fix by @kevalmorabia97 :: PR: #4881
- test(release): skip golden comparison on intermediate resume windows by @ko3n1g :: PR: #5040
- [mimo] Thread position_ids through MimoModel for multimodal RoPE by @liding-nv :: PR: #4938
- build: Switch DSv3 on H100 to HybridEP by @ko3n1g :: PR: #5039
- Fix: Import unwrap_model from megatron.core.utils in modelopt examples by @kevalmorabia97 :: PR: #5045
- Simple and stable Inference APIs by @YangFei1990 :: PR: #4697
- ci: Add notification step for MBridge downstream test results by @balasaajay :: PR: #5028
- Delete output tensor early by @Phlip79 :: PR: #4742
- Support ScaledSReLU in TE grouped MLP fuser by @sraman-rgb :: PR: #4859
- Skip gradient updates when grad norm exceeds threshold by @yfw :: PR: #3460
- Add 9 user skills by @Phlip79 :: PR: #5066
- test(nemotron): align nemotron3 super GB200 goldens with exit-interval 4768 by @ko3n1g :: PR: #5069
- chore: Update transformer-engine dependency to version 2.16.0 by @balasaajay :: PR: #4992
- Update energon version requirement by @maanug-nv :: PR: #4572
- Fix test failures for new inference APIs by @YangFei1990 :: PR: #5068
- fix(ci): set PYTHONUNBUFFERED=1 in JET workload env by @ko3n1g :: PR: #5072
- Preserve non-FSDP-unit buckets across AllGatherPipeline reset by @wujingyue :: PR: #4717
- Add opt-in MXFP8 LM-head output projection by @gdengk :: PR: #4825
- fix(ci): bound JET pipeline polling with a watchdog to prevent indefinite hangs by @ko3n1g :: PR: #5076
- ci: prune old artifacts on cluster lustre during weekly/release runs by @ko3n1g :: PR: #5084
- ci(test): isolate ckpt-resume tensorboard per phase by @ko3n1g :: PR: #5074
- test: unmark EP A2A activation offload test flaky by @lhb8125 :: PR: #5009
- Change ownership groups by @Phlip79 :: PR: #5021
- test: skip mfsdp_fully_shard cases when world_size < mesh size by @wujingyue :: PR: #4487
- fix mimo optimizer checkpoint metadata restore by @liding-nv :: PR: #4791
- [mimo] Support bridge fan-out for variable modality tokens by @liding-nv :: PR: #5062
- cp:
Remove DeepEP hardware limit check (4846)intocore_r0.18.0by @ko3n1g :: PR: #5126 - chore: Update transformer-engine dependency to revision 4220403 (#5112) by @balasaajay :: PR: #5137
- cp:
fix(optimizer): gate ChainedOptimizer MXFP8 defer-sync on DDP-level overlap_param_gather (4982)intocore_r0.18.0by @ko3n1g :: PR: #5146 - cp:
build: Switch DSv3 on H100 to HybridEP (5164)intocore_r0.18.0by @ko3n1g :: PR: #5165 - beep boop 🤖: Bumping Megatron Core to v0.18.2 by @nvidia-megatron-lm-release-bot[bot] :: PR: #5434
- docs: Fix docs version for 0.18.0 release by @chtruong814 :: PR: #5435
- beep boop 🤖: Bumping Megatron Core to v0.18.1 by @nvidia-megatron-lm-release-bot[bot] :: PR: #5436
- Resetting Megatron Core and FSDP patch versions to 0 by @balasaajay :: PR: #5437
- chore: Bump versions by @ko3n1g
- fix(ci): replace actions/setup-python with apt-get to avoid 429 rate limits (#4072) by @ko3n1g
- ci: Fix package name for code-freeze workflow (#4077) by @ko3n1g
- chore: bump
_code_freezeworkflow tov0.86.0(#4078) by @ko3n1g - Fix checkpoint inspector (#4079) by @janEbert
- Update docs to conform to NVIDIA style guides (#4068) by @megnvidia
- Miscellaneous inference fixes (#4030) by @santhnm2
- fix fine_grained_callables with fused rmsnorm residual (#4026) by @CarlosGomes98
- [Main][feat] Support overlapping A2A Combine backprop with wgrad GEMM (#3795) by @Wohox
- Modify mfsdp default data-parallel-sharding-strategy (#3691) by @wplf
- Fix fsdp_dtensor conversion for pretrained-only checkpoints (#3912) by @DAISY-gh
- Guard NVshmem issues (#4093) by @wdykas
- m-fsdp: wire use_precision_aware_optimizer from ddp_config to ParamAn… (#4024) by @rapatel
- Megatron-FSDP: Add MXFP8 transpose helper buffer for Hybrid FSDP (#3918) by @shjwudp
- feat(fsdp): use TE general_gemm for mixed-precision wgrad in FSDP path (#3822) by @Victarry
- Megatron-FSDP: Make _pre_forward_param_unshard and _register_post_backward_hook formal (#4029) by @shjwudp
- Megatron-FSDP: Fix insufficient double buffers during gradient reduce (#4054) by @shjwudp
- Fix M-FSDP MXFP8 related BUGs (#3991) by @shjwudp
- FIX: Use decoupled gradients for precision-aware M-FSDP grad norm (#3746) by @XueSongTap
- [Megatron-FSDP] Fix compatibility with frozen parameters and add unit tests (#3287) by @shjwudp
- Align chat completions endpoint with vLLM (#4063) by @santhnm2
- [M-FSDP] Refactor uneven dtensor to full tensor and add UT (#3190) by @shjwudp
- Add agent instruction files (#4102) by @Phlip79
- Bump eopt version (#4100) by @skyw
- Refactor emerging optimizer integration (#4113) by @skyw
- Fix over provisioning of Mamba state memory when max_requests is set (#4114) by @santhnm2
- base strategy simplification (#4001) by @dimapihtar
- add support for DCP and FSDP async save (#4027) by @dimapihtar
- Add more emerging optimizers (#3907) (#4119) by @skyw
- Fix FSDP checkpoint conversion and loading for Qwen3.5-VL (#3936) by @DAISY-gh
- docs: update mcore optimizer docstrings to google style (#2799) by @Akshat8510
- Update oncall schedule (#4117) by @Phlip79
- Set tensor-parallel attributes irrespective of perform_initialization (#4084) by @ilml
- docs: add developer-guide skill with CI/CD and failure navigation guidance (#4035) by @ko3n1g
- chore: Move skills (#4136) by @ko3n1g
- ci: Let Claude react to comment (#4135) by @ko3n1g
- Nemotron3 Super GB200 release config (#4118) by @maanug-nv
- Enable CUDA graph for ADAM optimizer (#3429) by @vasunvidia
- Claude review should recommend testing (#4137) by @Phlip79
- cleanup: remove unused
scatter_gather_tensors_in_pipelineargument (#4140) by @Phlip79 - fix: Remove fail-fast (-x) and guard distributed teardown against deadlock (#4139) by @ko3n1g
- chore(beep boop 🤖): Bump (main) (2026-04-06) by @github-actions[bot]
- Claude: add respond-to-issue skill (#4141) by @Phlip79
- Fix muon getter backward compatability (#4157) by @skyw
- Audit of user guide (#4098) by @megnvidia
- Fix
RerunStateMachinecrash (TypeError: 'NoneType' object is not subscriptable) by not saving a checkpoint after a transient NaN / Inf (#3981) by @yezhengmao1 - Preserve type of decorated methods/classes (#4062) by @nschank
- update muon test case to use new interface (#4163) by @skyw
- [M-FSDP] Fix Tensor Parallel mode detection (#3191) by @shjwudp
- fix: remove weights_only=False for multimodal example (#4104) by @faradawn
- Cudagraphs: Fix sequence packing segfault more generally (#4162) by @mathemakitten
- Make MTP work with materialize_only_last_token_logits (#4166) by @santhnm2
- Add unit test for Mamba EP inference (eager fallback with mixed CUDA graphs) (#4085) by @santhnm2
- update docs in respect to async changes (#4177) by @dimapihtar
- update checkpointing docs in respect to async changes (#4208) by @dimapihtar
- chore: improve build-and-test skill with trigger rules and dependency workflow (#4199) by @ko3n1g
- Fix layerwise optimizer with
expt_dp_size=1and contention with element-wise distributed optimizer (#4138) by @skyw - ci: add --cluster-a100/h100/gb200 args to trigger_internal_ci.py (#4195) by @ko3n1g
- ci: Update golden values for nightly tests (#4215) by @chtruong814
- rename async_allgather to overlap_param_gather (#4217) by @skyw
- Fix Slack sync for users with GitHub email privacy enabled (#4220) by @Phlip79
- Miscellaneous MTP inference fixes (#4191) by @santhnm2
- Move inference guards out of arguments.py (#4210) by @mathemakitten
- Fix: enable fine-grained activation offloading for Mamba model. (#4173) by @fanshiqing
- bump NVRx (#4178) by @dimapihtar
- Update tokenizer args for Nemotron3 release config (#4239) by @maanug-nv
- build: add dynamic git-versioning and drop rc0 pre-release tag (#4212) by @ko3n1g
- Fix unnecessary permute padding for non-quantized MoE dispatch (#4038) by @xiaoxi-wangfj
- Fix split state dict main (#3676) by @kunlunl
- Enable FP8 DPA for MXFP8 recipe (#4066) by @vasunvidia
- Add /split-pr Claude Code command for splitting PRs by CODEOWNERS (#4160) by @Phlip79
- Enable AG/RS overlap with explicit process group passing (#3249) by @jeffnvidia
- Enable cpu_offloading with Full iteration CUDA graph (#3969) by @vasunvidia
- Fix TransformerConfig validation for mixed dense/MoE upcycling (#3647) by @rkteddy
- Remove cross-rank synchronization during checkpoint load & deprecate torch.distributed.checkpoint.state_dict_loader.load_state_dict (#2864) by @asolergi-nv
- Fix incorrectly set decoupled_grad and DistOpt mechanics for MFSDP. (#4133) by @cspades
- Refit Miscelaneous (#3973) by @wdykas
- Add conditions_embeddings argument to TransformerBlock, TransformerLayer for DiT (diffusion transformer) (#4134) by @huvunvidia
- Fix build_sequences_per_dataset output path arg usage (#4144) by @DhineshPonnarasan
- ci: Flush pending CUDA work before the barrier in destroy_model_parallel (#4259) by @chtruong814
- Update oncall schedule (#4257) by @Phlip79
- docs(moe): Update MoE README (#3664) by @sbhavani
- Revert "Add conditions_embeddings argument to TransformerBlock, Trans… (#4270) by @ko3n1g
- reduce the number of shared expert streams (#3752) by @yangbofun
- remove legacy Bert code (#4204) by @dimapihtar
- [Main] Feat(moe): Gated delta net context parallel (CP) (#2642) by @yuzhongw-nvidia
- remove t5 legacy code (#4203) by @dimapihtar
- fix: handle list-typed process groups in ProcessGroupCollection.repr (#3753) by @cluster2600
- Fix Context Parallelism documentation link (#4149) by @liangxs
- [MLA] fix: Pad V when Q/V head dims differ for THD (#3003) by @HollowMan6
- Allow the evaluation batch size to differ from the training batch size (#4014) by @michal2409
- fix(megatron-fsdp): build expt_device_mesh only for MoE models (#3831) by @xuwchen
- Add @NVIDIA/transformer review group to megatron/core/transformer/ (#4281) by @Phlip79
- Reset AG_pipeline bucket status after validation step. (#3155) by @vasunvidia
- Enhance and fix NVTX for training (#3642) by @yaox12
- NVFP4 native weights for DDP (#4005) by @WanZzzzzz
- Remove unnecessary arguments for layerwise distributed optimizer (#4272) by @FDecaYed
- reuse grad buffer for layer-wise param allgather (#3751) by @FDecaYed
- feat(ci): add strict review mode to Claude review workflow (#4197) by @Victarry
- Fix stale approvals (#4280) by @Phlip79
- [MoE] Add a new score function to the router (#3673) by @yaox12
- [MoE] Improvement of shared expert overlap, support shared expert overlap for FlexDispatcher (#2207) by @Victarry
- build: bump DeepEP to 34152ae (#4228) by @ko3n1g
- ci: mark test_fused_indexer_loss_gradient_tp_consistency as flaky_in_dev (#4299) by @ko3n1g
- Fix typo in PR4133. (#4277) by @cspades
- ci: add retry loop to apt-get update to handle transient mirror sync failures (#4209) by @ko3n1g
- remove legacy biencoder and realm models (#4205) by @dimapihtar
- fix: enforce correct pass thresholds for deterministic and approximate tests (#4238) by @ko3n1g
- ci: add configurable launcher support for functional tests (ft_launcher / torchrun) (#4298) by @ko3n1g
- chore: document --target main for local Docker builds (#4307) by @ko3n1g
- Extract args init to launch scripts (#4225) by @maanug-nv
- [Main] Fix TE version check for retain_pinned_cpu_buffers in cpu offload (#4267) by @BestJuly
- Fix documented shape (#3486) by @janEbert
- ci: add sync-skills workflow, rename CLAUDE.md → AGENTS.md, move .claude/skills → skills/ (#4303) by @ko3n1g
- chore(beep boop 🤖): symlink skills/ → .claude/skills, .agents/skills and AGENTS.md → CLAUDE.md by @github-actions[bot]
- Get
devicecorrectly when module returns a dict instead of individual tensor (#4265) by @shifangx - remove vision legacy code (#4202) by @dimapihtar
- feat: long convergence resiliency for release tests (#4335) by @ko3n1g
- ci(action): improve GitHub Actions output UX (#4337) by @ko3n1g
- build: bump TransformerEngine to release_v2.14 (#4331) by @ko3n1g
- M4 leftover for TE cuda graph (#3137) by @shifangx
- feat: add create-issue skill (#4338) by @ko3n1g
- Set megatron-fsdp to 0.5.0 by @ko3n1g
- fix: wait for async P2P send before deallocating output tensor (#4047) by @ZhiyuLi-Nvidia
- ci(gb200): add 1-node mr-github functional test variants (#4334) by @ko3n1g
- Fix potential coredump issue that occurs when saving a checkpoint (#1871) by @ezioliao
- Port DeepSeek Sparse Attention to
MambaModel(#3553) by @janEbert - docs: bump versions1.json to 0.17.0 (latest) (#4360) by @ko3n1g
- Add tables and histogram for RL staleness (#4097) by @tdene
- [docs] ci: use parent-relative json_url for version picker (#4367) by @ko3n1g
- Fix bug with non-partial rollouts (#3964) by @tdene
- Add QK layernorm support for dot-product attention in MambaModel (#4067) by @Phlip79
- Docs: improve docstrings and comments in example training loop (#4041) by @DhineshPonnarasan
- feat(ckpt): add --async-ckpt-use-cpu-shm argument (#4355) by @sbak5
- cp: Fix UT timeout (#4310) (#4373) by @chtruong814
- Fix RL reward due to stop token (#4096) by @tdene
- FA4 Inference (#4186) by @wdykas
- Make param_index_map always use unpacked (full numel) offsets (#4328) by @deepakn94
- Add activation logging and tokens per expert logging (#3842) by @Mellonta
- Fix RL to once again work with --skip-train (#4249) by @tdene
- Fix Megatron initialization with extra_args_provider (#4327) by @santhnm2
- Rename MambaModel/MambaStack to HybridModel/HybridStack (#4099) by @Phlip79
- chore(beep boop 🤖): Bump (main) (2026-04-20) by @github-actions[bot]
- fix(ci): wrap uv install in retry block (#4387) by @ko3n1g
- Call save_checkpoint_and_time() when saving checkpoint and compute elapsed duration for saving checkpoint before logging timer (#4263) by @awsankur
- refactor(tests): move NCCL env vars from docker launcher to shell training script (#4390) by @ko3n1g
- Remove packed_attention_mask unused parameter (#3859) by @tdene
- Second batch of audit edits (#4115) by @megnvidia
- Replace rampup batch size scheduler with custom step batch size schedules (#3779) by @mkhona-nvidia
- revert: replace rampup batch size scheduler with custom step batch size schedules (#3779) (#4404) by @ko3n1g
- Megatron-FSDP: log mcore detection only after imports succeed (#4400) by @wujingyue
- Replace rampup batch size scheduler with custom step batch size schedules (#4411) by @deepakn94
- ci(gb200): re-enable tunable_overlap 1-node mr-github test (#4405) by @ko3n1g
- Fix local docs building (#4416) by @Phlip79
- RL: Onload optimizer after logprobs computation (#4235) by @tdene
- Add RL token throughput and packing metrics (#3877) by @tdene
- ci: remove publish:merge_into_dev job (#4421) by @ko3n1g
- docs: add data loading best practices for large-scale training (#4236) by @sbhavani
- Fix: Auto enable manual registration and enhance the docummentation (#3295) by @youngeunkwon0405
- Fix nvtx_decorator to check _nvtx_enabled at call time (#4184) by @minitu
- fix merges_file typo in megatron_hf_tokenizer (#4392) by @chelseajohn
- Enable NullTokenizer for pretraining to reduce I/O access (#4057) by @asolergi-nv
- docs: Add SECURITY.md (#4431) by @chtruong814
- Mamba inference opt (#4414) by @wdykas
- DDP refactoring: Extract parameter layout computation into optimizer classmethod (#3812) by @deepakn94
- Update PR template with explicit request for issue (#4409) by @Phlip79
- Misc inference fixes (#4397) by @sidsingh-nvidia
- Rename Mamba to Hybrid outside megatron/core (#4159) by @Phlip79
- Include mtp layers in token per expert logging (#4412) by @Mellonta
- fix: NVRx async compatibility and defer resiliency import (#4420) by @sbak5
- ci: add base_sha to codecov/codecov-action upload step (#4445) by @ko3n1g
- fix(checkpoint_inspector): allow empty --param-to-param-group-map-json (#4403) by @DAISY-gh
- Add the YARN support for hybrid_model (#4244) by @guihong-nv
- [training migration] Add container class for config dataclasses (#4227) by @maanug-nv
- Inference: Fix broken functional tests on gitlab (#4454) by @sidsingh-nvidia
- SafeUnpickler class for safe pickle usage (#4319) by @dimapihtar
- get rid of weights_only=False (#4434) by @dimapihtar
- Inference | Per-block MoE routing storage for prefix caching (#4301) by @lmcafee-nvidia
- Add troubleshooting tip for 'access forbidden' (#4449) by @balasaajay
- Fix checkpoint loading with rerun state machine (#4448) by @YangFei1990
- Add misc CUDA graph sugar to CudaGraphManager (#4425) by @tdene
- Inference: Add the embedding and output layer in the full_iteration_inference cuda graph scope for hybrid models (#4440) by @sidsingh-nvidia
- Important bugfixes in local CG implementation that were leading to loss curve gaps for latent MoE models (#4433) by @jiemingz
- fix: Replace polynomial rolling hash with SHA-256 for prefix caching (#4158) by @lmcafee-nvidia
- feat(ckpt): expose validate_access_integrity knob on dist-ckpt load (#4422) by @asolergi-nv
- Fix multivalidation (#3388) by @RPrenger
- Add missing knob for reduce_scatter_with_fp32_accumulation (#4410) by @WanZzzzzz
- Enable CUDA graphs for MTP inference (#4260) by @santhnm2
- chore(beep boop 🤖): Bump (main) (2026-04-27) by @github-actions[bot]
- checkpoint integrity verification (#4305) by @dimapihtar
- Fix cache gating (#4455) by @wdykas
- [Main] Fix FusedAdam.use_decoupled_grad mis-set for Megatron-FSDP. (#4427) by @cspades
- add permute fusion into hybrid ep (#4089) by @Autumn1998
- Add ColocatedBridgeCommunicator for heterogeneous TP/DP MIMO training (NMFW-17) (#4368) by @yashaswikarnati
- Fix incorrect bias display in extra_repr of Column/RowParallelLinear (#4330) by @HelloWorldBeginner
- Fix assertion logic in combined_1f1b_schedule_for_interleaved_pipelining (#4276) by @joapolarbear
- ci: Fix event name reference in CI workflow condition for merge group (#4462) by @balasaajay
- Add manual sync workflow from main to dev (#4165) by @Phlip79
- fix: handle list-format quant_cfg from ModelOpt PR #1094 (#4187) by @ChenhanYu
- ci: also add Run MBridge tests label in nightly sync workflow (#4499) by @Phlip79
- [training migration] Add serialization features to config container (#4309) by @maanug-nv
- Fix conflict with inference graphs (#4504) by @tdene
- Add tools/prepare_cache.py for offline GPT dataset cache preparation (#4080) by @asolergi-nv
- [build] fix: move mamba-ssm and causal-conv1d to optional [ssm] extra (#4517) by @ko3n1g
- mamba: avoid redundant HBM reloads in causal_conv1d_update shift loop (#4460) by @wdykas
- Standardize misc graph interface (#4485) by @tdene
- Fix inference graph override in RL flow (#4323) by @tdene
- Unify and refactor Megatron-FSDP documentation. (#4418) by @cspades
- Skills for running unit tests and working with slurm (#4502) by @yashaswikarnati
- Revert "ci: add base_sha to codecov/codecov-action upload step (#4445)" (#4526) by @chtruong814
- Reorganize order of operations in inference context and text generation controller (#2929) by @tdene
- ci: Update CI workflow conditions to include merge group handling (#4532) by @balasaajay
- ci: add base_sha to codecov/codecov-action upload step (#4540) by @chtruong814
- Fix release tests: remove --global-batch-size conflicting with --step-batch-size-schedule (#4545) by @deepakn94
- docs: use @file-path notation for file references in skills (#4542) by @ko3n1g
- Support YAML quant recipe in PTQ and remove first/last layer modifier code (#4503) by @jenchen13
- Avoid nsys profile crash with CUDA graphs (#4541) by @tdene
- fix(ci): add retry with backoff to approve-test-queue bot (#4559) by @ko3n1g
- New allgathervdispatcher for inference and simplify old dispatcher. (#4258) by @sidsingh-nvidia
- Fixes for modelopt examples and SFTTokenizer for transformers v5 (#4450) by @jenchen13
- Adding code for Flextron (#4429) by @sheliang-nv
- Fix partial cudagraphs + HybridEP not properly triggering DDP hook (#4500) by @jiemingz
- Ignore pytorch link anchors (#4582) by @maanug-nv
- MoE dispatcher fixes: size NVLS dispatcher buffers from actual tensor sizes (#4576) by @mathemakitten
- Finalize all builders in preprocess_data, not just the last key (#4573) by @sayalinvidia
- refactor(skills): add when_to_use frontmatter, split ci-test-system, enforce skill workflow (#4574) by @ko3n1g
- Make last_token_logits graphable (#4552) by @tdene
- fix(ci): correct off-by-one in total_steps_evaluated formula (#4591) by @ko3n1g
- Add fault injection support via nvidia_resiliency_ext. (#4370) by @hexinw-nvidia
- Guard vocab reduce_scatter on TP > 1 (#4565) by @mathemakitten
- Move inference context bookkeeping to CPU with ContextGPUView (#4306) by @lmcafee-nvidia
- Enable InJob restart on failures. (#4594) by @hexinw-nvidia
- Enable shared expert overlap with allgatherv in inference (#4570) by @sidsingh-nvidia
- Add vLLM grouped gemm backend for MoE inference (#4566) by @santhnm2
- Move KD teacher loading to after Float16Module (#4394) by @AAnoosheh
- ci: update gpt3_7b_tp4_pp1_memory_speed gb200 golden values (#4601) by @ko3n1g
- Fix inference unit test (#4589) by @maanug-nv
- Checkpoint conversion between GPT_model and Hybrid_model (#4482) by @guihong-nv
- ci: add cadence input for test filtering in CI workflows (#4561) by @balasaajay
- Handle SSM sharded tensor merge OOM with CPU fallback (#4442) by @returnL
- Fix
mtp_use_repeated_layerbehavior for GPT models (#3965) by @rkarimimahab - FlashInfer sampling (#2456) by @tdene
- Fix main2dev workflow (#4610) by @Phlip79
- Add logic to enable chunked MLP during training (#3656) by @pengdurice
- Inference bug-fixes: Re-enable EP syncs for the legacy A2A dispatcher and re-simplify ep_sync accidentally reverted by #4306 (#4587) by @sidsingh-nvidia
- Remove invalid
timeoutargument for dist.barrier (#4512) by @zhaoyinglia - Fix buffers in refit (#4580) by @wdykas
- Named validation sets (#4578) by @RPrenger
- Fix Hang in tests (#4575) by @wdykas
- Single commit for main2dev nightly (#4614) by @Phlip79
- convert tokenizer args to config (#4406) by @dimapihtar
- Siddharth/fix ep sync (#4607) by @wdykas
- mmiranda working on another set of broken links (#4534) by @megnvidia
- Fix gradient corruption with layerwise param all-gather overlap (#4609) by @deepakn94
- test: mark TestFusedApplyMLARope::test_forward_backward_for_q flaky_in_dev (#4639) by @ko3n1g
- remove legacy GPT code (#4322) by @dimapihtar
- ci: introduce L-tier scope vocabulary via parser (#4625) by @balasaajay
- Inference: Tune vLLM grouped gemm, moe_sum kernel, and enable shared expert overlap in latent MoEs (#4603) by @sidsingh-nvidia
- Fix crash involving evicted requests and tpot (#4645) by @tdene
- remove legacy tranformer and modules (#4207) by @dimapihtar
- chore: Update Docker image version to 26.04-py3 (#4611) by @balasaajay
- Propagate errors for failed inference requests (#4679) by @mathemakitten
- Inference: Cache input + position ID views (#4634) by @mathemakitten
- ci: Update Gitlab base image to 26.04 pytorch (#4688) by @chtruong814
- Add periodic GPU sniff tests to detect hardware stragglers (#4662) by @deepakn94
- ci: Bump GHA versions (#4606) by @chtruong814
- build: widen flashinfer-python pin to <0.7.0 (#4700) by @ko3n1g
- Add Python-side guardrail for HybridEP InfiniBand limit and rename seq_len (#4094) by @Shreyas-S-809
- Switch oncall (#4702) by @janEbert
- Update golden values for various functional tests (#4703) by @balasaajay
- chore: Update golden values for various functional tests (#4706) by @balasaajay
- build: upgrade mamba-ssm to 2.3.2.post1, causal-conv1d to 1.6.2.post1 (#4712) by @ko3n1g
- ci: replace uuidgen with /proc/sys/kernel/random/uuid (#4714) by @ko3n1g
- chore(codeowners): add megatron/inference/ ownership (#4704) by @ko3n1g
- Create a Protocol for the MLP layer of TransformerLayer (#3435) by @nschank
- Revert "Add Python-side guardrail for HybridEP InfiniBand limit and rename seq_len (#4094)" (#4718) by @ko3n1g
- chore(beep boop 🤖): Bump (main) (2026-05-11) by @github-actions[bot]
- Add Python-side guardrail for DeepEP IB limits (#4719) by @janEbert
- ci: revert bad uv.lock bump and label future bumps with
Run functional tests(#4730) by @ko3n1g - [ci] fix: treat cancelled run-main-script step as failure (#4727) by @ko3n1g
- ci: Major refactor of release-workflows (#4602) by @ko3n1g
- fix(fsdp): recognize legacy GDN TP metadata (#4664) by @Glitchfix
- build(deps): bump nvidia-modelopt to 0.43 (#4723) by @ko3n1g
- Fixes for Nemotron3 Super release test config (#4544) by @maanug-nv
- feat(gpt): add output postprocess hook (#4686) by @Glitchfix
- Add bump-base-image skill and update golden value comparison (#4733) by @balasaajay
- Guard omegaconf imports (#4685) by @maanug-nv
- Fix a regression introduced by #4625 for nightly runs (#4734) by @balasaajay
- Add LLaVA audio (sound) model support (#4402) by @cuichenx
- Support transfomers 5.x.x for text generation server (#4732) by @tdene
- Update transformer-engine dependency to version 2.15.0 (#4682) by @balasaajay
- Increase CG cover from max_requests to max_tokens (#4214) by @tdene
- fully remove legacy code (#4759) by @dimapihtar
- fix legacy torch save when tensor_model_parallel_size > expert_model_parallel_size * expert_tensor_parallel_size (#4678) by @dimapihtar
- Wire --rl-inference-parsers into MRL (#4768) by @tdene
- Integrate LayerWiseDistributedOptimizer with DDP buffer infrastructure (#4509) by @deepakn94
- [training migration] Migrate mamba builder (#4550) by @maanug-nv
- NCCL UB fix: reduce memory cost and correctly deregister NCCL mem pool (#4492) by @xrennvidia
- fix: use no_mask in local ViT layer spec (#4395) by @Phlip79
- refit clean up and refactoring (#4762) by @wdykas
- Make weight and optimizer memory estimation take into account expert parallelism correctly (#4687) by @YangFei1990
- Support recomputing in HybridModel (#4496) by @xuantengh
- One single flag that determines if we are in inference (#4617) by @tdene
- [main] feat(moe): Support packed sequence for gated delta net (GDN) (#2645) by @yuzhongw-nvidia
- remove dead manual_release_grads code path in 1F1B overlap schedule (#4511) by @Wohox
- Fix recompute checkpointing + training CGs (#3919) by @tdene
- Use Protocols to type-check linear_proj submodules of Attention (#3434) by @nschank
- fix tokenizers in respect to newer transformers (#4608) by @dimapihtar
- Bump nvidia-modelopt>=0.44.0 (#4803) by @kevalmorabia97
- Update owners (#4794) by @Phlip79
- ci: Update workflow to use same commit for building docker image and running tests (#4787) by @balasaajay
- chore: Update nightly tests golden values (#4805) by @balasaajay
- Inference: Optimize Prefill Engine Steps for Nemotron (#4764) by @sidsingh-nvidia
- Disable MSC by default; opt in via --enable-msc (#4629) by @asolergi-nv
- Strengthen test_checkpoint to verify distributed checkpoint behavior (#4711) by @lichenlu
- Combine GEMM + SwiGLU fused MLP PRs (3890, 4071, 4095, 4219, 4311, 4324) → main (#4636) by @Connor-XY
- additional tests for nvrx (#4522) by @dimapihtar
- [fix] Use MSC for checking checkpoint existence (#4251) by @pavelgein
- ci: tolerate git-gc race in /home/runner chown after checkout (#4808) by @balasaajay
- Reorder mtp_post_process after attention backward in 1F1B schedule plan (#4695) by @gdengk
- [Main][feat] Support A2A Overlap for Megatron-FSDP (#3797) by @Wohox
- add is_torch_min_version in fsdp src (#4812) by @xrennvidia
- Add high-priority A2A stream and HybridEP preprocessing SMs (#4694) by @gdengk
- Refactor CUDA graph API: decompose cuda_graph_scope into full_iteration impl, inference scope, and per-layer capture modules (#4292) by @buptzyb
- chore(beep boop 🤖): Bump (main) (2026-05-18) by @github-actions[bot]
- Tokenizers updates (#4780) by @dimapihtar
- Fix no nvrx tests (#4847) by @dimapihtar
- Thread custom process groups through MoE grad finalization (#4782) by @yashaswikarnati
- Fix unit tests (#4689) by @shanmugamr1992
- Tests/dynamic inference functional coverage (#4761) by @shanmugamr1992
- Fix oncall references (#4722) by @janEbert
- Update golden values for nightly functional tests (#4850) by @balasaajay
- fix(inference): size DynamicInferenceContext KV layer_map for non-uniform PP (#4775) by @athitten
- Modernize post-training modelopt example scripts (#4807) by @kevalmorabia97
- test: add inference performance test harness for GPT 583M, hybrid 2B,… (#4806) by @shanmugamr1992
- Fix tokenizers bug in nightly (#4833) by @Phlip79
- ci: Prevent shell trace in parts of _run_training.sh (#4884) by @chtruong814
- Ignore Vim swap files (#4860) by @wujingyue
- M-FSDP: Make fine_grained_param_gather configurable for MXFP8 to enable performance–memory trade-offs (#4181) by @shjwudp
- MimoOptimizer: fix distributed checkpoint save and load for non-colocated MIMO (#4801) by @kamran-nvidia
- Route non-Muon params through DistributedOptimizer (#4771) by @deepakn94
- ci: Gate optional CI jobs with repository variables (#4907) by @chtruong814
- Allow optimizer CG to share the same pool as full-iter CG (#4698) by @nanz-nv
- Use sharded_state_dict_default in MLP.sharded_state_dict (#4693) by @gdengk
- Fix MTP recompute crash with packed sequences (#4593) by @BestJuly
- Update PR template (#4904) by @Phlip79
- ci: Update perf test to output logs for tests to pass (#4906) by @chtruong814
- Also persist asymmetrical units for the MXFP8 transpose weight buffer. (#4852) by @cspades
- fix no_shard training convergency and add unittest for no_shard (#3754) by @wplf
- Move policy epoch stats to the message object (#4533) by @ArEsKay3
- Add a knob to throttle the max allowed inflight offload in fine grained offloading (#4692) by @nanz-nv
- refactor(data): consolidate get_batch and enable PP for SFT THD (#4103) by @asolergi-nv
- Allow YAML MoE configs to use model specs (#4822) by @chawkins-nvidia
- Move bert and t5 pretrain files (#4820) by @Phlip79
- Paged Stashing (#4247) by @nanz-nv
- make FP4 param gather work with the mixed precisions in NVFP4 recipe (#4358) by @xrennvidia
- fix: Fix multi-node functional test phase sync (#4924) by @chtruong814
- Perf tests (#4917) by @shanmugamr1992
- fix(cuda_graphs): handle TE 2.15 removal of FP8GlobalStateManager.set_skip_fp8_weight_update_tensor (#4874) by @balasaajay
- Fix paged stashing test submodules lookup (#4925) by @Phlip79
- Add TEFusedDenseMLP for Dense+Grouped GEMM fusion on SM100+ (#4318) (#4786) by @sraman-rgb
- Fix mxfp8 param gather numerical issue when DP overlap is off (#4800) by @WanZzzzzz
- [MXFP8/FP4-param-gather] Post processing after forced param AG in eval (#4562) by @WanZzzzzz
- ci: Update training script paths in BERT and T5 (#4939) by @balasaajay
- Various training utils (#4872) by @maanug-nv
- ci: restore perf test torchrun logs (#4951) by @chtruong814
- Fix
get_batchreturn order to ignore BlendedDataset provenance fields (#4952) by @deepakn94 - test(release): add release goldens for deepseekv3/nemotron3 and set tp2pp2 exit-interval (#4932) by @ko3n1g
- test: enable NVTE_CUTEDSL_FUSED_GROUPED_MLP via pytest fixture (#4931) by @ko3n1g
- chore(beep boop 🤖): Bump (main) (2026-05-25) by @github-actions[bot]
- Avoid offsetting functional test master port (#4973) by @chtruong814
- Fix elastification unwrap_model import (#4972) by @Devil1716
- test: re-enable paged stashing MoE tests (#4978) by @ko3n1g
- test(ci): re-enable 8experts2parallel_multi_dist_optimizer_instances_1node (#4984) by @ko3n1g
- ci: Add support for MBridge job gating based on PR labels (#4926) by @balasaajay
- test: re-enable test_pp2_create_cudagraphs_first_stage on TE 2.15+ (#4985) by @ko3n1g
- fix(tests): initialize num_microbatches calculator in vision cudagraph tests (#4986) by @ko3n1g
- ci: Add allow_failure flag to gpt and moe recipes that are failing in nightlies (#4905) by @balasaajay
- Drain predecessor reduce-scatter at dispatch time (#4940) by @deepakn94
- nightly(ci): Update golden values for functional t5 tests (#4995) by @balasaajay
- chore: rotate oncall schedule by @github-actions[bot]
- [main] Refactor and Improve MoE Logginginit commit (#3431) by @yanring
- ci: validate release branch-rules (#4929) by @ko3n1g
- [Megatron-FSDP] Add conditional param.grad dereferencing logic to support full-iteration (FWD-BWD) CUDA graphability. (#4663) by @cspades
- test: restrict iter-time comparison to steady-state window (#5010) by @ko3n1g
- [fix] Release MTP assertion when EP overlap with PP=1 (#4796) by @Wohox
- fix(test): pin eval-global-batch-size on 15b gb200 release configs (#5022) by @ko3n1g
- fix(test): widen iter-time steady-state window for short tests (#5023) by @ko3n1g
- Perf fix (#4996) by @shanmugamr1992
- Add dev-feature preservation gate and change schedule (#4773) by @Phlip79
- chore(test): remove orphan nemotron3_super_release_g200 dir (#5024) by @ko3n1g
- Ignore Claude worktree directory (#5020) by @Phlip79
- Update copy-pr-bot.yaml [skip ci] by @github-actions[bot]
- ci: update CI workflow conditions for integration tests (#4658) by @balasaajay
- Add NVSkills CI request workflow (#5033) by @Phlip79
- DDP wrap pg size fixes (#5006) by @maanug-nv
- fix(layer_wise): tag MTP-stage word_embeddings as is_embedding_or_output_parameter (#5034) by @Wohox
- Move LTS dependencies from pyproject.toml to Dockerfile.ci.lts (#4877) by @balasaajay
- Use shared ModelOpt calibration loop on 0.45+ with 0.44 fallback fix (#4881) by @kevalmorabia97
- test(release): skip golden comparison on intermediate resume windows (#5040) by @ko3n1g
- [mimo] Thread position_ids through MimoModel for multimodal RoPE (#4938) by @liding-nv
- build: Switch DSv3 on H100 to HybridEP (#5039) by @ko3n1g
- Fix: Import unwrap_model from megatron.core.utils in modelopt examples (#5045) by @kevalmorabia97
- Simple and stable Inference APIs (#4697) by @YangFei1990
- ci: Add notification step for MBridge downstream test results (#5028) by @balasaajay
- Delete output tensor early (#4742) by @Phlip79
- Support ScaledSReLU in TE grouped MLP fuser (#4859) by @sraman-rgb
- Skip gradient updates when grad norm exceeds threshold (#3460) by @yfw
- Add 9 user skills (#5066) by @Phlip79
- test(nemotron): align nemotron3 super GB200 goldens with exit-interval 4768 (#5069) by @ko3n1g
- chore: Update transformer-engine dependency to version 2.16.0 (#4992) by @balasaajay
- Update energon version requirement (#4572) by @maanug-nv
- Fix test failures for new inference APIs (#5068) by @YangFei1990
- fix(ci): set PYTHONUNBUFFERED=1 in JET workload env (#5072) by @ko3n1g
- Preserve non-FSDP-unit buckets across AllGatherPipeline reset (#4717) by @wujingyue
- Add opt-in MXFP8 LM-head output projection (#4825) by @gdengk
- chore(beep boop 🤖): Bump (main) (2026-06-01) by @github-actions[bot]
- fix(ci): bound JET pipeline polling with a watchdog to prevent indefinite hangs (#5076) by @ko3n1g
- test: unmark EP A2A activation offload test flaky (#5009) by @lhb8125
- ci: prune old artifacts on cluster lustre during weekly/release runs (#5084) by @ko3n1g
- ci(test): isolate ckpt-resume tensorboard per phase (#5074) by @ko3n1g
- Change ownership groups (#5021) by @Phlip79
- test: skip mfsdp_fully_shard cases when world_size < mesh size (#4487) by @wujingyue
- fix mimo optimizer checkpoint metadata restore (#4791) by @liding-nv
- [mimo] Support bridge fan-out for variable modality tokens (#5062) by @liding-nv
- cp:
Remove DeepEP hardware limit check (4846)intocore_r0.18.0(#5126) by @ko3n1g - chore: Update transformer-engine dependency to revision 4220403 (#5112) (#5137) by @balasaajay
- cp:
fix(optimizer): gate ChainedOptimizer MXFP8 defer-sync on DDP-level overlap_param_gather (4982)intocore_r0.18.0(#5146) by @ko3n1g - cp:
build: Switch DSv3 on H100 to HybridEP (5164)intocore_r0.18.0(#5165) by @ko3n1g - chore(beep boop 🤖): Bump (core_r0.18.0) (2026-06-22) by @github-actions[bot]
- beep boop 🤖: Bumping Megatron Core to v0.18.2 [skip ci] by @github-actions[bot]
- Resetting Megatron Core version to v0.18.0 and Megatron FSDP to rc0 by @balasaajay
- make fsdp a release package by @balasaajay
- docs: Fix docs version for 0.18.0 release (#5435) by @chtruong814
- beep boop 🤖: Bumping Megatron Core to v0.18.1 [skip ci] by @github-actions[bot]
- Resetting Megatron Core and FSDP patch versions to 0 (#5437) by @balasaajay