NVIDIA/Megatron-LM core_v0.18.0 on GitHub

Changelog Details

fix(ci): replace actions/setup-python with apt-get to avoid 429 rate limits by @ko3n1g :: PR: #4072
ci: Fix package name for code-freeze workflow by @ko3n1g :: PR: #4077
chore: bump _code_freeze workflow to v0.86.0 by @ko3n1g :: PR: #4078
Fix checkpoint inspector by @janEbert :: PR: #4079
Update docs to conform to NVIDIA style guides by @megnvidia :: PR: #4068
Miscellaneous inference fixes by @santhnm2 :: PR: #4030
fix fine_grained_callables with fused rmsnorm residual by @CarlosGomes98 :: PR: #4026
[Main][feat] Support overlapping A2A Combine backprop with wgrad GEMM by @Wohox :: PR: #3795
Modify mfsdp default data-parallel-sharding-strategy by @wplf :: PR: #3691
Fix fsdp_dtensor conversion for pretrained-only checkpoints by @DAISY-gh :: PR: #3912
Guard NVshmem issues by @wdykas :: PR: #4093
m-fsdp: wire use_precision_aware_optimizer from ddp_config to ParamAn… by @rapatel :: PR: #4024
Megatron-FSDP: Add MXFP8 transpose helper buffer for Hybrid FSDP by @shjwudp :: PR: #3918
feat(fsdp): use TE general_gemm for mixed-precision wgrad in FSDP path by @Victarry :: PR: #3822
Megatron-FSDP: Fix insufficient double buffers during gradient reduce by @shjwudp :: PR: #4054
Fix M-FSDP MXFP8 related BUGs by @shjwudp :: PR: #3991
Megatron-FSDP: Make _pre_forward_param_unshard and _register_post_backward_hook formal by @shjwudp :: PR: #4029
FIX: Use decoupled gradients for precision-aware M-FSDP grad norm by @XueSongTap :: PR: #3746
Align chat completions endpoint with vLLM by @santhnm2 :: PR: #4063
[Megatron-FSDP] Fix compatibility with frozen parameters and add unit tests by @shjwudp :: PR: #3287
[M-FSDP] Refactor uneven dtensor to full tensor and add UT by @shjwudp :: PR: #3190
Add agent instruction files by @Phlip79 :: PR: #4102
Bump eopt version by @skyw :: PR: #4100
Refactor emerging optimizer integration by @skyw :: PR: #4113
Fix over provisioning of Mamba state memory when max_requests is set by @santhnm2 :: PR: #4114
base strategy simplification by @dimapihtar :: PR: #4001
add support for DCP and FSDP async save by @dimapihtar :: PR: #4027
Add more emerging optimizers (#3907) by @skyw :: PR: #4119
Fix FSDP checkpoint conversion and loading for Qwen3.5-VL by @DAISY-gh :: PR: #3936
docs: update mcore optimizer docstrings to google style by @Akshat8510 :: PR: #2799
Set tensor-parallel attributes irrespective of perform_initialization by @ilml :: PR: #4084
docs: add developer-guide skill with CI/CD and failure navigation guidance by @ko3n1g :: PR: #4035
chore: Move skills by @ko3n1g :: PR: #4136
ci: Let Claude react to comment by @ko3n1g :: PR: #4135
Nemotron3 Super GB200 release config by @maanug-nv :: PR: #4118
Enable CUDA graph for ADAM optimizer by @vasunvidia :: PR: #3429
Claude review should recommend testing by @Phlip79 :: PR: #4137
cleanup: remove unused scatter_gather_tensors_in_pipeline argument by @Phlip79 :: PR: #4140
fix: Remove fail-fast (-x) and guard distributed teardown against deadlock by @ko3n1g :: PR: #4139
Claude: add respond-to-issue skill by @Phlip79 :: PR: #4141
Fix muon getter backward compatability by @skyw :: PR: #4157
Audit of user guide by @megnvidia :: PR: #4098
Fix RerunStateMachine crash (TypeError: 'NoneType' object is not subscriptable) by not saving a checkpoint after a transient NaN / Inf by @yezhengmao1 :: PR: #3981
Preserve type of decorated methods/classes by @nschank :: PR: #4062
update muon test case to use new interface by @skyw :: PR: #4163
[M-FSDP] Fix Tensor Parallel mode detection by @shjwudp :: PR: #3191
fix: remove weights_only=False for multimodal example by @faradawn :: PR: #4104
Cudagraphs: Fix sequence packing segfault more generally by @mathemakitten :: PR: #4162
Make MTP work with materialize_only_last_token_logits by @santhnm2 :: PR: #4166
Add unit test for Mamba EP inference (eager fallback with mixed CUDA graphs) by @santhnm2 :: PR: #4085
update docs in respect to async changes by @dimapihtar :: PR: #4177
update checkpointing docs in respect to async changes by @dimapihtar :: PR: #4208
chore: improve build-and-test skill with trigger rules and dependency workflow by @ko3n1g :: PR: #4199
Fix layerwise optimizer with expt_dp_size=1 and contention with element-wise distributed optimizer by @skyw :: PR: #4138
ci: add --cluster-a100/h100/gb200 args to trigger_internal_ci.py by @ko3n1g :: PR: #4195
ci: Update golden values for nightly tests by @chtruong814 :: PR: #4215
rename async_allgather to overlap_param_gather by @skyw :: PR: #4217
Fix Slack sync for users with GitHub email privacy enabled by @Phlip79 :: PR: #4220
Miscellaneous MTP inference fixes by @santhnm2 :: PR: #4191
Move inference guards out of arguments.py by @mathemakitten :: PR: #4210
Fix: enable fine-grained activation offloading for Mamba model. by @fanshiqing :: PR: #4173
bump NVRx by @dimapihtar :: PR: #4178
Update tokenizer args for Nemotron3 release config by @maanug-nv :: PR: #4239
build: add dynamic git-versioning and drop rc0 pre-release tag by @ko3n1g :: PR: #4212
Fix unnecessary permute padding for non-quantized MoE dispatch by @xiaoxi-wangfj :: PR: #4038
Fix split state dict main by @kunlunl :: PR: #3676
Add /split-pr Claude Code command for splitting PRs by CODEOWNERS by @Phlip79 :: PR: #4160
Enable FP8 DPA for MXFP8 recipe by @vasunvidia :: PR: #4066
Enable AG/RS overlap with explicit process group passing by @jeffnvidia :: PR: #3249
Enable cpu_offloading with Full iteration CUDA graph by @vasunvidia :: PR: #3969
Fix TransformerConfig validation for mixed dense/MoE upcycling by @rkteddy :: PR: #3647
Remove cross-rank synchronization during checkpoint load & deprecate torch.distributed.checkpoint.state_dict_loader.load_state_dict by @asolergi-nv :: PR: #2864
Fix incorrectly set decoupled_grad and DistOpt mechanics for MFSDP. by @cspades :: PR: #4133
Refit Miscelaneous by @wdykas :: PR: #3973
Add conditions_embeddings argument to TransformerBlock, TransformerLayer for DiT (diffusion transformer) by @huvunvidia :: PR: #4134
Fix build_sequences_per_dataset output path arg usage by @DhineshPonnarasan :: PR: #4144
ci: Flush pending CUDA work before the barrier in destroy_model_parallel by @chtruong814 :: PR: #4259
Update oncall schedule by @Phlip79 :: PR: #4257
docs(moe): Update MoE README by @sbhavani :: PR: #3664
Revert "Add conditions_embeddings argument to TransformerBlock, Trans… by @ko3n1g :: PR: #4270
reduce the number of shared expert streams by @yangbofun :: PR: #3752
remove legacy Bert code by @dimapihtar :: PR: #4204
[Main] Feat(moe): Gated delta net context parallel (CP) by @yuzhongw-nvidia :: PR: #2642
remove t5 legacy code by @dimapihtar :: PR: #4203
fix: handle list-typed process groups in ProcessGroupCollection.repr by @cluster2600 :: PR: #3753
Fix Context Parallelism documentation link by @liangxs :: PR: #4149
[MLA] fix: Pad V when Q/V head dims differ for THD by @HollowMan6 :: PR: #3003
fix(megatron-fsdp): build expt_device_mesh only for MoE models by @xuwchen :: PR: #3831
Allow the evaluation batch size to differ from the training batch size by @michal2409 :: PR: #4014
Add @NVIDIA/transformer review group to megatron/core/transformer/ by @Phlip79 :: PR: #4281
Reset AG_pipeline bucket status after validation step. by @vasunvidia :: PR: #3155
Enhance and fix NVTX for training by @yaox12 :: PR: #3642
NVFP4 native weights for DDP by @WanZzzzzz :: PR: #4005
Remove unnecessary arguments for layerwise distributed optimizer by @FDecaYed :: PR: #4272
reuse grad buffer for layer-wise param allgather by @FDecaYed :: PR: #3751
feat(ci): add strict review mode to Claude review workflow by @Victarry :: PR: #4197
Fix stale approvals by @Phlip79 :: PR: #4280
[MoE] Add a new score function to the router by @yaox12 :: PR: #3673
[MoE] Improvement of shared expert overlap, support shared expert overlap for FlexDispatcher by @Victarry :: PR: #2207
build: bump DeepEP to 34152ae by @ko3n1g :: PR: #4228
ci: mark test_fused_indexer_loss_gradient_tp_consistency as flaky_in_dev by @ko3n1g :: PR: #4299
Fix typo in PR4133. by @cspades :: PR: #4277
ci: add retry loop to apt-get update to handle transient mirror sync failures by @ko3n1g :: PR: #4209
fix: enforce correct pass thresholds for deterministic and approximate tests by @ko3n1g :: PR: #4238
remove legacy biencoder and realm models by @dimapihtar :: PR: #4205
ci: add configurable launcher support for functional tests (ft_launcher / torchrun) by @ko3n1g :: PR: #4298
chore: document --target main for local Docker builds by @ko3n1g :: PR: #4307
Extract args init to launch scripts by @maanug-nv :: PR: #4225
[Main] Fix TE version check for retain_pinned_cpu_buffers in cpu offload by @BestJuly :: PR: #4267
Fix documented shape by @janEbert :: PR: #3486
ci: add sync-skills workflow, rename CLAUDE.md → AGENTS.md, move .claude/skills → skills/ by @ko3n1g :: PR: #4303
Get device correctly when module returns a dict instead of individual tensor by @shifangx :: PR: #4265
remove vision legacy code by @dimapihtar :: PR: #4202
feat: long convergence resiliency for release tests by @ko3n1g :: PR: #4335
ci(action): improve GitHub Actions output UX by @ko3n1g :: PR: #4337
build: bump TransformerEngine to release_v2.14 by @ko3n1g :: PR: #4331
feat: add create-issue skill by @ko3n1g :: PR: #4338
M4 leftover for TE cuda graph by @shifangx :: PR: #3137
fix: wait for async P2P send before deallocating output tensor by @ZhiyuLi-Nvidia :: PR: #4047
ci(gb200): add 1-node mr-github functional test variants by @ko3n1g :: PR: #4334
Fix potential coredump issue that occurs when saving a checkpoint by @ezioliao :: PR: #1871
docs: bump versions1.json to 0.17.0 (latest) by @ko3n1g :: PR: #4360
Port DeepSeek Sparse Attention to MambaModel by @janEbert :: PR: #3553
Add tables and histogram for RL staleness by @tdene :: PR: #4097
[docs] ci: use parent-relative json_url for version picker by @ko3n1g :: PR: #4367
Fix bug with non-partial rollouts by @tdene :: PR: #3964
Add QK layernorm support for dot-product attention in MambaModel by @Phlip79 :: PR: #4067
Docs: improve docstrings and comments in example training loop by @DhineshPonnarasan :: PR: #4041
feat(ckpt): add --async-ckpt-use-cpu-shm argument by @sbak5 :: PR: #4355
cp: Fix UT timeout (#4310) by @chtruong814 :: PR: #4373
Fix RL reward due to stop token by @tdene :: PR: #4096
FA4 Inference by @wdykas :: PR: #4186
Make param_index_map always use unpacked (full numel) offsets by @deepakn94 :: PR: #4328
Add activation logging and tokens per expert logging by @Mellonta :: PR: #3842
Fix RL to once again work with --skip-train by @tdene :: PR: #4249
Fix Megatron initialization with extra_args_provider by @santhnm2 :: PR: #4327
Rename MambaModel/MambaStack to HybridModel/HybridStack by @Phlip79 :: PR: #4099
fix(ci): wrap uv install in retry block by @ko3n1g :: PR: #4387
Call save_checkpoint_and_time() when saving checkpoint and compute elapsed duration for saving checkpoint before logging timer by @awsankur :: PR: #4263
refactor(tests): move NCCL env vars from docker launcher to shell training script by @ko3n1g :: PR: #4390
Remove packed_attention_mask unused parameter by @tdene :: PR: #3859
Second batch of audit edits by @megnvidia :: PR: #4115
revert: replace rampup batch size scheduler with custom step batch size schedules (#3779) by @ko3n1g :: PR: #4404
Replace rampup batch size scheduler with custom step batch size schedules by @deepakn94 :: PR: #4411
Megatron-FSDP: log mcore detection only after imports succeed by @wujingyue :: PR: #4400
ci(gb200): re-enable tunable_overlap 1-node mr-github test by @ko3n1g :: PR: #4405
Fix local docs building by @Phlip79 :: PR: #4416
RL: Onload optimizer after logprobs computation by @tdene :: PR: #4235
Add RL token throughput and packing metrics by @tdene :: PR: #3877
ci: remove publish:merge_into_dev job by @ko3n1g :: PR: #4421
docs: add data loading best practices for large-scale training by @sbhavani :: PR: #4236
Fix: Auto enable manual registration and enhance the docummentation by @youngeunkwon0405 :: PR: #3295
Fix nvtx_decorator to check _nvtx_enabled at call time by @minitu :: PR: #4184
fix merges_file typo in megatron_hf_tokenizer by @chelseajohn :: PR: #4392
Enable NullTokenizer for pretraining to reduce I/O access by @asolergi-nv :: PR: #4057
docs: Add SECURITY.md by @chtruong814 :: PR: #4431
Mamba inference opt by @wdykas :: PR: #4414
DDP refactoring: Extract parameter layout computation into optimizer classmethod by @deepakn94 :: PR: #3812
Update PR template with explicit request for issue by @Phlip79 :: PR: #4409
Misc inference fixes by @sidsingh-nvidia :: PR: #4397
Rename Mamba to Hybrid outside megatron/core by @Phlip79 :: PR: #4159
Include mtp layers in token per expert logging by @Mellonta :: PR: #4412
fix: NVRx async compatibility and defer resiliency import by @sbak5 :: PR: #4420
fix(checkpoint_inspector): allow empty --param-to-param-group-map-json by @DAISY-gh :: PR: #4403
Add the YARN support for hybrid_model by @guihong-nv :: PR: #4244
[training migration] Add container class for config dataclasses by @maanug-nv :: PR: #4227
Inference: Fix broken functional tests on gitlab by @sidsingh-nvidia :: PR: #4454
SafeUnpickler class for safe pickle usage by @dimapihtar :: PR: #4319
get rid of weights_only=False by @dimapihtar :: PR: #4434
Inference | Per-block MoE routing storage for prefix caching by @lmcafee-nvidia :: PR: #4301
Add troubleshooting tip for 'access forbidden' by @balasaajay :: PR: #4449
Fix checkpoint loading with rerun state machine by @YangFei1990 :: PR: #4448
Add misc CUDA graph sugar to CudaGraphManager by @tdene :: PR: #4425
Inference: Add the embedding and output layer in the full_iteration_inference cuda graph scope for hybrid models by @sidsingh-nvidia :: PR: #4440
Important bugfixes in local CG implementation that were leading to loss curve gaps for latent MoE models by @jiemingz :: PR: #4433
fix: Replace polynomial rolling hash with SHA-256 for prefix caching by @lmcafee-nvidia :: PR: #4158
feat(ckpt): expose validate_access_integrity knob on dist-ckpt load by @asolergi-nv :: PR: #4422
Fix multivalidation by @RPrenger :: PR: #3388
Add missing knob for reduce_scatter_with_fp32_accumulation by @WanZzzzzz :: PR: #4410
Enable CUDA graphs for MTP inference by @santhnm2 :: PR: #4260
checkpoint integrity verification by @dimapihtar :: PR: #4305
Fix cache gating by @wdykas :: PR: #4455
[Main] Fix FusedAdam.use_decoupled_grad mis-set for Megatron-FSDP. by @cspades :: PR: #4427
add permute fusion into hybrid ep by @Autumn1998 :: PR: #4089
Add ColocatedBridgeCommunicator for heterogeneous TP/DP MIMO training (NMFW-17) by @yashaswikarnati :: PR: #4368
Fix incorrect bias display in extra_repr of Column/RowParallelLinear by @HelloWorldBeginner :: PR: #4330
Fix assertion logic in combined_1f1b_schedule_for_interleaved_pipelining by @joapolarbear :: PR: #4276
ci: Fix event name reference in CI workflow condition for merge group by @balasaajay :: PR: #4462
Add nightly sync workflow from main to dev by @Phlip79 :: PR: #4165
fix: handle list-format quant_cfg from ModelOpt PR #1094 by @ChenhanYu :: PR: #4187
ci: also add Run MBridge tests label in nightly sync workflow by @Phlip79 :: PR: #4499
[training migration] Add serialization features to config container by @maanug-nv :: PR: #4309
Fix conflict with inference graphs by @tdene :: PR: #4504
Add tools/prepare_cache.py for offline GPT dataset cache preparation by @asolergi-nv :: PR: #4080
[build] fix: move mamba-ssm and causal-conv1d to optional [ssm] extra by @ko3n1g :: PR: #4517
mamba: avoid redundant HBM reloads in causal_conv1d_update shift loop by @wdykas :: PR: #4460
Standardize misc graph interface by @tdene :: PR: #4485
Fix inference graph override in RL flow by @tdene :: PR: #4323
Unify and refactor Megatron-FSDP documentation. by @cspades :: PR: #4418
Revert "ci: add base_sha to codecov/codecov-action upload step (#4445)" by @chtruong814 :: PR: #4526
Skills for running unit tests and working with slurm by @yashaswikarnati :: PR: #4502
Reorganize order of operations in inference context and text generation controller by @tdene :: PR: #2929
ci: Update CI workflow conditions to include merge group handling by @balasaajay :: PR: #4532
ci: add base_sha to codecov/codecov-action upload step by @chtruong814 :: PR: #4540
Fix release tests: remove --global-batch-size conflicting with --step-batch-size-schedule by @deepakn94 :: PR: #4545
docs: use @file-path notation for file references in skills by @ko3n1g :: PR: #4542
Support YAML quant recipe in PTQ and remove first/last layer modifier code by @jenchen13 :: PR: #4503
Avoid nsys profile crash with CUDA graphs by @tdene :: PR: #4541
fix(ci): add retry with backoff to approve-test-queue bot by @ko3n1g :: PR: #4559
New allgathervdispatcher for inference and simplify old dispatcher. by @sidsingh-nvidia :: PR: #4258
Fixes for modelopt examples and SFTTokenizer for transformers v5 by @jenchen13 :: PR: #4450
Adding code for Flextron by @sheliang-nv :: PR: #4429
Fix partial cudagraphs + HybridEP not properly triggering DDP hook by @jiemingz :: PR: #4500
Ignore pytorch link anchors by @maanug-nv :: PR: #4582
MoE dispatcher fixes: size NVLS dispatcher buffers from actual tensor sizes by @mathemakitten :: PR: #4576
Finalize all builders in preprocess_data, not just the last key by @sayalinvidia :: PR: #4573
refactor(skills): add when_to_use frontmatter, split ci-test-system, enforce skill workflow by @ko3n1g :: PR: #4574
Make last_token_logits graphable by @tdene :: PR: #4552
fix(ci): correct off-by-one in total_steps_evaluated formula by @ko3n1g :: PR: #4591
Add fault injection support via nvidia_resiliency_ext. by @hexinw-nvidia :: PR: #4370
Guard vocab reduce_scatter on TP > 1 by @mathemakitten :: PR: #4565
Move inference context bookkeeping to CPU with ContextGPUView by @lmcafee-nvidia :: PR: #4306
Enable InJob restart on failures. by @hexinw-nvidia :: PR: #4594
Enable shared expert overlap with allgatherv in inference by @sidsingh-nvidia :: PR: #4570
Add vLLM grouped gemm backend for MoE inference by @santhnm2 :: PR: #4566
Move KD teacher loading to after Float16Module by @AAnoosheh :: PR: #4394
ci: update gpt3_7b_tp4_pp1_memory_speed gb200 golden values by @ko3n1g :: PR: #4601
Fix inference unit test by @maanug-nv :: PR: #4589
Checkpoint conversion between GPT_model and Hybrid_model by @guihong-nv :: PR: #4482
ci: add cadence input for test filtering in CI workflows by @balasaajay :: PR: #4561
Fix mtp_use_repeated_layer behavior for GPT models by @rkarimimahab :: PR: #3965
Handle SSM sharded tensor merge OOM with CPU fallback by @returnL :: PR: #4442
FlashInfer sampling by @tdene :: PR: #2456
Fix main2dev workflow by @Phlip79 :: PR: #4610
Add logic to enable chunked MLP during training by @pengdurice :: PR: #3656
Inference bug-fixes: Re-enable EP syncs for the legacy A2A dispatcher and re-simplify ep_sync accidentally reverted by #4306 by @sidsingh-nvidia :: PR: #4587
Remove invalid timeout argument for dist.barrier by @zhaoyinglia :: PR: #4512
Fix buffers in refit by @wdykas :: PR: #4580
Named validation sets by @RPrenger :: PR: #4578
Fix Hang in tests by @wdykas :: PR: #4575
Single commit for main2dev nightly by @Phlip79 :: PR: #4614
convert tokenizer args to config by @dimapihtar :: PR: #4406
Siddharth/fix ep sync by @wdykas :: PR: #4607
mmiranda working on another set of broken links by @megnvidia :: PR: #4534
Fix gradient corruption with layerwise param all-gather overlap by @deepakn94 :: PR: #4609
test: mark TestFusedApplyMLARope::test_forward_backward_for_q flaky_in_dev by @ko3n1g :: PR: #4639
remove legacy GPT code by @dimapihtar :: PR: #4322
ci: introduce L-tier scope vocabulary via parser by @balasaajay :: PR: #4625
Inference: Tune vLLM grouped gemm, moe_sum kernel, and enable shared expert overlap in latent MoEs by @sidsingh-nvidia :: PR: #4603
Fix crash involving evicted requests and tpot by @tdene :: PR: #4645
remove legacy tranformer and modules by @dimapihtar :: PR: #4207
chore: Update Docker image version to 26.04-py3 by @balasaajay :: PR: #4611
Propagate errors for failed inference requests by @mathemakitten :: PR: #4679
Inference: Cache input + position ID views by @mathemakitten :: PR: #4634
ci: Update Gitlab base image to 26.04 pytorch by @chtruong814 :: PR: #4688
Add periodic GPU sniff tests to detect hardware stragglers by @deepakn94 :: PR: #4662
ci: Bump GHA versions by @chtruong814 :: PR: #4606
build: widen flashinfer-python pin to <0.7.0 by @ko3n1g :: PR: #4700
Add Python-side guardrail for HybridEP InfiniBand limit and rename seq_len by @Shreyas-S-809 :: PR: #4094
Switch oncall by @janEbert :: PR: #4702
Update golden values for various functional tests by @balasaajay :: PR: #4703
chore: Update golden values for various functional tests by @balasaajay :: PR: #4706
build: upgrade mamba-ssm to 2.3.2.post1, causal-conv1d to 1.6.2.post1 by @ko3n1g :: PR: #4712
ci: replace uuidgen with /proc/sys/kernel/random/uuid by @ko3n1g :: PR: #4714
chore(codeowners): add megatron/inference/ ownership by @ko3n1g :: PR: #4704
Create a Protocol for the MLP layer of TransformerLayer by @nschank :: PR: #3435
Revert "Add Python-side guardrail for HybridEP InfiniBand limit and rename seq_len (#4094)" by @ko3n1g :: PR: #4718
Add Python-side guardrail for DeepEP IB limits by @janEbert :: PR: #4719
ci: revert bad uv.lock bump and label future bumps with Run functional tests by @ko3n1g :: PR: #4730
[ci] fix: treat cancelled run-main-script step as failure by @ko3n1g :: PR: #4727
ci: Major refactor of release-workflows by @ko3n1g :: PR: #4602
build(deps): bump nvidia-modelopt to 0.43 by @ko3n1g :: PR: #4723
fix(fsdp): recognize legacy GDN TP metadata by @Glitchfix :: PR: #4664
Fixes for Nemotron3 Super release test config by @maanug-nv :: PR: #4544
feat(gpt): add output postprocess hook by @Glitchfix :: PR: #4686
Add bump-base-image skill and update golden value comparison by @balasaajay :: PR: #4733
Guard omegaconf imports by @maanug-nv :: PR: #4685
Fix a regression introduced by #4625 for nightly runs by @balasaajay :: PR: #4734
Add LLaVA audio (sound) model support by @cuichenx :: PR: #4402
Support transfomers 5.x.x for text generation server by @tdene :: PR: #4732
Update transformer-engine dependency to version 2.15.0 by @balasaajay :: PR: #4682
Increase CG cover from max_requests to max_tokens by @tdene :: PR: #4214
fully remove legacy code by @dimapihtar :: PR: #4759
fix legacy torch save when tensor_model_parallel_size > expert_model_parallel_size * expert_tensor_parallel_size by @dimapihtar :: PR: #4678
Wire --rl-inference-parsers into MRL by @tdene :: PR: #4768
Integrate LayerWiseDistributedOptimizer with DDP buffer infrastructure by @deepakn94 :: PR: #4509
[training migration] Migrate mamba builder by @maanug-nv :: PR: #4550
NCCL UB fix: reduce memory cost and correctly deregister NCCL mem pool by @xrennvidia :: PR: #4492
fix: use no_mask in local ViT layer spec by @Phlip79 :: PR: #4395
refit clean up and refactoring by @wdykas :: PR: #4762
Support recomputing in HybridModel by @xuantengh :: PR: #4496
Make weight and optimizer memory estimation take into account expert parallelism correctly by @YangFei1990 :: PR: #4687
One single flag that determines if we are in inference by @tdene :: PR: #4617
[main] feat(moe): Support packed sequence for gated delta net (GDN) by @yuzhongw-nvidia :: PR: #2645
remove dead manual_release_grads code path in 1F1B overlap schedule by @Wohox :: PR: #4511
Fix recompute checkpointing + training CGs by @tdene :: PR: #3919
Use Protocols to type-check linear_proj submodules of Attention by @nschank :: PR: #3434
fix tokenizers in respect to newer transformers by @dimapihtar :: PR: #4608
Bump nvidia-modelopt>=0.44.0 by @kevalmorabia97 :: PR: #4803
Update owners by @Phlip79 :: PR: #4794
ci: Update workflow to use same commit for building docker image and running tests by @balasaajay :: PR: #4787
chore: Update nightly tests golden values by @balasaajay :: PR: #4805
Inference: Optimize Prefill Engine Steps for Nemotron by @sidsingh-nvidia :: PR: #4764
ci: tolerate git-gc race in /home/runner chown after checkout by @balasaajay :: PR: #4808
additional tests for nvrx by @dimapihtar :: PR: #4522
Disable MSC by default; opt in via --enable-msc by @asolergi-nv :: PR: #4629
Strengthen test_checkpoint to verify distributed checkpoint behavior by @lichenlu :: PR: #4711
Combine GEMM + SwiGLU fused MLP PRs (3890, 4071, 4095, 4219, 4311, 4324) → main by @Connor-XY :: PR: #4636
[fix] Use MSC for checking checkpoint existence by @pavelgein :: PR: #4251
[Main][feat] Support A2A Overlap for Megatron-FSDP by @Wohox :: PR: #3797
Reorder mtp_post_process after attention backward in 1F1B schedule plan by @gdengk :: PR: #4695
add is_torch_min_version in fsdp src by @xrennvidia :: PR: #4812
Add high-priority A2A stream and HybridEP preprocessing SMs by @gdengk :: PR: #4694
Refactor CUDA graph API: decompose cuda_graph_scope into full_iteration impl, inference scope, and per-layer capture modules by @buptzyb :: PR: #4292
Tokenizers updates by @dimapihtar :: PR: #4780
Fix no nvrx tests by @dimapihtar :: PR: #4847
Thread custom process groups through MoE grad finalization by @yashaswikarnati :: PR: #4782
Fix unit tests by @shanmugamr1992 :: PR: #4689
Tests/dynamic inference functional coverage by @shanmugamr1992 :: PR: #4761
Fix oncall references by @janEbert :: PR: #4722
Update golden values for nightly functional tests by @balasaajay :: PR: #4850
fix(inference): size DynamicInferenceContext KV layer_map for non-uniform PP by @athitten :: PR: #4775
Modernize post-training modelopt example scripts by @kevalmorabia97 :: PR: #4807
test: add inference performance test harness for GPT 583M, hybrid 2B,… by @shanmugamr1992 :: PR: #4806
ci: Gate optional CI jobs with repository variables by @chtruong814 :: PR: #4907
Fix tokenizers bug in nightly by @Phlip79 :: PR: #4833
ci: Prevent shell trace in parts of _run_training.sh by @chtruong814 :: PR: #4884
Ignore Vim swap files by @wujingyue :: PR: #4860
M-FSDP: Make fine_grained_param_gather configurable for MXFP8 to enable performance–memory trade-offs by @shjwudp :: PR: #4181
MimoOptimizer: fix distributed checkpoint save and load for non-colocated MIMO by @kamran-nvidia :: PR: #4801
Route non-Muon params through DistributedOptimizer by @deepakn94 :: PR: #4771
Allow optimizer CG to share the same pool as full-iter CG by @nanz-nv :: PR: #4698
Use sharded_state_dict_default in MLP.sharded_state_dict by @gdengk :: PR: #4693
Update PR template by @Phlip79 :: PR: #4904
Fix MTP recompute crash with packed sequences by @BestJuly :: PR: #4593
ci: Update perf test to output logs for tests to pass by @chtruong814 :: PR: #4906
Also persist asymmetrical units for the MXFP8 transpose weight buffer. by @cspades :: PR: #4852
fix no_shard training convergency and add unittest for no_shard by @wplf :: PR: #3754
Move policy epoch stats to the message object by @ArEsKay3 :: PR: #4533
Add a knob to throttle the max allowed inflight offload in fine grained offloading by @nanz-nv :: PR: #4692
refactor(data): consolidate get_batch and enable PP for SFT THD by @asolergi-nv :: PR: #4103
Allow YAML MoE configs to use model specs by @chawkins-nvidia :: PR: #4822
Move bert and t5 pretrain files by @Phlip79 :: PR: #4820
Paged Stashing by @nanz-nv :: PR: #4247
make FP4 param gather work with the mixed precisions in NVFP4 recipe by @xrennvidia :: PR: #4358
fix: Fix multi-node functional test phase sync by @chtruong814 :: PR: #4924
Perf tests by @shanmugamr1992 :: PR: #4917
fix(cuda_graphs): handle TE 2.15 removal of FP8GlobalStateManager.set_skip_fp8_weight_update_tensor by @balasaajay :: PR: #4874
Fix paged stashing test submodules lookup by @Phlip79 :: PR: #4925
Add TEFusedDenseMLP for Dense+Grouped GEMM fusion on SM100+ (#4318) by @sraman-rgb :: PR: #4786
Fix mxfp8 param gather numerical issue when DP overlap is off by @WanZzzzzz :: PR: #4800
[MXFP8/FP4-param-gather] Post processing after forced param AG in eval by @WanZzzzzz :: PR: #4562
ci: Update training script paths in BERT and T5 by @balasaajay :: PR: #4939
Various training utils by @maanug-nv :: PR: #4872
ci: restore perf test torchrun logs by @chtruong814 :: PR: #4951
Fix get_batch return order to ignore BlendedDataset provenance fields by @deepakn94 :: PR: #4952
test(release): add release goldens for deepseekv3/nemotron3 and set tp2pp2 exit-interval by @ko3n1g :: PR: #4932
test: enable NVTE_CUTEDSL_FUSED_GROUPED_MLP via pytest fixture by @ko3n1g :: PR: #4931
Avoid offsetting functional test master port by @chtruong814 :: PR: #4973
Fix elastification unwrap_model import by @Devil1716 :: PR: #4972
test: re-enable paged stashing MoE tests by @ko3n1g :: PR: #4978
test(ci): re-enable 8experts2parallel_multi_dist_optimizer_instances_1node by @ko3n1g :: PR: #4984
ci: Add support for MBridge job gating based on PR labels by @balasaajay :: PR: #4926
test: re-enable test_pp2_create_cudagraphs_first_stage on TE 2.15+ by @ko3n1g :: PR: #4985
fix(tests): initialize num_microbatches calculator in vision cudagraph tests by @ko3n1g :: PR: #4986
ci: Add allow_failure flag to gpt and moe recipes that are failing in nightlies by @balasaajay :: PR: #4905
Drain predecessor reduce-scatter at dispatch time by @deepakn94 :: PR: #4940
nightly(ci): Update golden values for functional t5 tests by @balasaajay :: PR: #4995
[main] Refactor and Improve MoE Logginginit commit by @yanring :: PR: #3431
ci: validate release branch-rules by @ko3n1g :: PR: #4929
[Megatron-FSDP] Add conditional param.grad dereferencing logic to support full-iteration (FWD-BWD) CUDA graphability. by @cspades :: PR: #4663
test: restrict iter-time comparison to steady-state window by @ko3n1g :: PR: #5010
fix(test): pin eval-global-batch-size on 15b gb200 release configs by @ko3n1g :: PR: #5022
[fix] Release MTP assertion when EP overlap with PP=1 by @Wohox :: PR: #4796
fix(test): widen iter-time steady-state window for short tests by @ko3n1g :: PR: #5023
Perf fix by @shanmugamr1992 :: PR: #4996
Add dev-feature preservation gate and change schedule by @Phlip79 :: PR: #4773
chore(test): remove orphan nemotron3_super_release_g200 dir by @ko3n1g :: PR: #5024
Ignore Claude worktree directory by @Phlip79 :: PR: #5020
ci: update CI workflow conditions for integration tests by @balasaajay :: PR: #4658
Add NVSkills CI request workflow by @Phlip79 :: PR: #5033
DDP wrap pg size fixes by @maanug-nv :: PR: #5006
fix(layer_wise): tag MTP-stage word_embeddings as is_embedding_or_output_parameter by @Wohox :: PR: #5034
Move LTS dependencies from pyproject.toml to Dockerfile.ci.lts by @balasaajay :: PR: #4877
Use shared ModelOpt calibration loop on 0.45+ with 0.44 fallback fix by @kevalmorabia97 :: PR: #4881
test(release): skip golden comparison on intermediate resume windows by @ko3n1g :: PR: #5040
[mimo] Thread position_ids through MimoModel for multimodal RoPE by @liding-nv :: PR: #4938
build: Switch DSv3 on H100 to HybridEP by @ko3n1g :: PR: #5039
Fix: Import unwrap_model from megatron.core.utils in modelopt examples by @kevalmorabia97 :: PR: #5045
Simple and stable Inference APIs by @YangFei1990 :: PR: #4697
ci: Add notification step for MBridge downstream test results by @balasaajay :: PR: #5028
Delete output tensor early by @Phlip79 :: PR: #4742
Support ScaledSReLU in TE grouped MLP fuser by @sraman-rgb :: PR: #4859
Skip gradient updates when grad norm exceeds threshold by @yfw :: PR: #3460
Add 9 user skills by @Phlip79 :: PR: #5066
test(nemotron): align nemotron3 super GB200 goldens with exit-interval 4768 by @ko3n1g :: PR: #5069
chore: Update transformer-engine dependency to version 2.16.0 by @balasaajay :: PR: #4992
Update energon version requirement by @maanug-nv :: PR: #4572
Fix test failures for new inference APIs by @YangFei1990 :: PR: #5068
fix(ci): set PYTHONUNBUFFERED=1 in JET workload env by @ko3n1g :: PR: #5072
Preserve non-FSDP-unit buckets across AllGatherPipeline reset by @wujingyue :: PR: #4717
Add opt-in MXFP8 LM-head output projection by @gdengk :: PR: #4825
fix(ci): bound JET pipeline polling with a watchdog to prevent indefinite hangs by @ko3n1g :: PR: #5076
ci: prune old artifacts on cluster lustre during weekly/release runs by @ko3n1g :: PR: #5084
ci(test): isolate ckpt-resume tensorboard per phase by @ko3n1g :: PR: #5074
test: unmark EP A2A activation offload test flaky by @lhb8125 :: PR: #5009
Change ownership groups by @Phlip79 :: PR: #5021
test: skip mfsdp_fully_shard cases when world_size < mesh size by @wujingyue :: PR: #4487
fix mimo optimizer checkpoint metadata restore by @liding-nv :: PR: #4791
[mimo] Support bridge fan-out for variable modality tokens by @liding-nv :: PR: #5062
cp: Remove DeepEP hardware limit check (4846) into core_r0.18.0 by @ko3n1g :: PR: #5126
chore: Update transformer-engine dependency to revision 4220403 (#5112) by @balasaajay :: PR: #5137
cp: fix(optimizer): gate ChainedOptimizer MXFP8 defer-sync on DDP-level overlap_param_gather (4982) into core_r0.18.0 by @ko3n1g :: PR: #5146
cp: build: Switch DSv3 on H100 to HybridEP (5164) into core_r0.18.0 by @ko3n1g :: PR: #5165
beep boop 🤖: Bumping Megatron Core to v0.18.2 by @nvidia-megatron-lm-release-bot[bot] :: PR: #5434
docs: Fix docs version for 0.18.0 release by @chtruong814 :: PR: #5435
beep boop 🤖: Bumping Megatron Core to v0.18.1 by @nvidia-megatron-lm-release-bot[bot] :: PR: #5436
Resetting Megatron Core and FSDP patch versions to 0 by @balasaajay :: PR: #5437
chore: Bump versions by @ko3n1g
fix(ci): replace actions/setup-python with apt-get to avoid 429 rate limits (#4072) by @ko3n1g
ci: Fix package name for code-freeze workflow (#4077) by @ko3n1g
chore: bump _code_freeze workflow to v0.86.0 (#4078) by @ko3n1g
Fix checkpoint inspector (#4079) by @janEbert
Update docs to conform to NVIDIA style guides (#4068) by @megnvidia
Miscellaneous inference fixes (#4030) by @santhnm2
fix fine_grained_callables with fused rmsnorm residual (#4026) by @CarlosGomes98
[Main][feat] Support overlapping A2A Combine backprop with wgrad GEMM (#3795) by @Wohox
Modify mfsdp default data-parallel-sharding-strategy (#3691) by @wplf
Fix fsdp_dtensor conversion for pretrained-only checkpoints (#3912) by @DAISY-gh
Guard NVshmem issues (#4093) by @wdykas
m-fsdp: wire use_precision_aware_optimizer from ddp_config to ParamAn… (#4024) by @rapatel
Megatron-FSDP: Add MXFP8 transpose helper buffer for Hybrid FSDP (#3918) by @shjwudp
feat(fsdp): use TE general_gemm for mixed-precision wgrad in FSDP path (#3822) by @Victarry
Megatron-FSDP: Make _pre_forward_param_unshard and _register_post_backward_hook formal (#4029) by @shjwudp
Megatron-FSDP: Fix insufficient double buffers during gradient reduce (#4054) by @shjwudp
Fix M-FSDP MXFP8 related BUGs (#3991) by @shjwudp
FIX: Use decoupled gradients for precision-aware M-FSDP grad norm (#3746) by @XueSongTap
[Megatron-FSDP] Fix compatibility with frozen parameters and add unit tests (#3287) by @shjwudp
Align chat completions endpoint with vLLM (#4063) by @santhnm2
[M-FSDP] Refactor uneven dtensor to full tensor and add UT (#3190) by @shjwudp
Add agent instruction files (#4102) by @Phlip79
Bump eopt version (#4100) by @skyw
Refactor emerging optimizer integration (#4113) by @skyw
Fix over provisioning of Mamba state memory when max_requests is set (#4114) by @santhnm2
base strategy simplification (#4001) by @dimapihtar
add support for DCP and FSDP async save (#4027) by @dimapihtar
Add more emerging optimizers (#3907) (#4119) by @skyw
Fix FSDP checkpoint conversion and loading for Qwen3.5-VL (#3936) by @DAISY-gh
docs: update mcore optimizer docstrings to google style (#2799) by @Akshat8510
Update oncall schedule (#4117) by @Phlip79
Set tensor-parallel attributes irrespective of perform_initialization (#4084) by @ilml
docs: add developer-guide skill with CI/CD and failure navigation guidance (#4035) by @ko3n1g
chore: Move skills (#4136) by @ko3n1g
ci: Let Claude react to comment (#4135) by @ko3n1g
Nemotron3 Super GB200 release config (#4118) by @maanug-nv
Enable CUDA graph for ADAM optimizer (#3429) by @vasunvidia
Claude review should recommend testing (#4137) by @Phlip79
cleanup: remove unused scatter_gather_tensors_in_pipeline argument (#4140) by @Phlip79
fix: Remove fail-fast (-x) and guard distributed teardown against deadlock (#4139) by @ko3n1g
chore(beep boop 🤖): Bump (main) (2026-04-06) by @github-actions[bot]
Claude: add respond-to-issue skill (#4141) by @Phlip79
Fix muon getter backward compatability (#4157) by @skyw
Audit of user guide (#4098) by @megnvidia
Fix RerunStateMachine crash (TypeError: 'NoneType' object is not subscriptable) by not saving a checkpoint after a transient NaN / Inf (#3981) by @yezhengmao1
Preserve type of decorated methods/classes (#4062) by @nschank
update muon test case to use new interface (#4163) by @skyw
[M-FSDP] Fix Tensor Parallel mode detection (#3191) by @shjwudp
fix: remove weights_only=False for multimodal example (#4104) by @faradawn
Cudagraphs: Fix sequence packing segfault more generally (#4162) by @mathemakitten
Make MTP work with materialize_only_last_token_logits (#4166) by @santhnm2
Add unit test for Mamba EP inference (eager fallback with mixed CUDA graphs) (#4085) by @santhnm2
update docs in respect to async changes (#4177) by @dimapihtar
update checkpointing docs in respect to async changes (#4208) by @dimapihtar
chore: improve build-and-test skill with trigger rules and dependency workflow (#4199) by @ko3n1g
Fix layerwise optimizer with expt_dp_size=1 and contention with element-wise distributed optimizer (#4138) by @skyw
ci: add --cluster-a100/h100/gb200 args to trigger_internal_ci.py (#4195) by @ko3n1g
ci: Update golden values for nightly tests (#4215) by @chtruong814
rename async_allgather to overlap_param_gather (#4217) by @skyw
Fix Slack sync for users with GitHub email privacy enabled (#4220) by @Phlip79
Miscellaneous MTP inference fixes (#4191) by @santhnm2
Move inference guards out of arguments.py (#4210) by @mathemakitten
Fix: enable fine-grained activation offloading for Mamba model. (#4173) by @fanshiqing
bump NVRx (#4178) by @dimapihtar
Update tokenizer args for Nemotron3 release config (#4239) by @maanug-nv
build: add dynamic git-versioning and drop rc0 pre-release tag (#4212) by @ko3n1g
Fix unnecessary permute padding for non-quantized MoE dispatch (#4038) by @xiaoxi-wangfj
Fix split state dict main (#3676) by @kunlunl
Enable FP8 DPA for MXFP8 recipe (#4066) by @vasunvidia
Add /split-pr Claude Code command for splitting PRs by CODEOWNERS (#4160) by @Phlip79
Enable AG/RS overlap with explicit process group passing (#3249) by @jeffnvidia
Enable cpu_offloading with Full iteration CUDA graph (#3969) by @vasunvidia
Fix TransformerConfig validation for mixed dense/MoE upcycling (#3647) by @rkteddy
Remove cross-rank synchronization during checkpoint load & deprecate torch.distributed.checkpoint.state_dict_loader.load_state_dict (#2864) by @asolergi-nv
Fix incorrectly set decoupled_grad and DistOpt mechanics for MFSDP. (#4133) by @cspades
Refit Miscelaneous (#3973) by @wdykas
Add conditions_embeddings argument to TransformerBlock, TransformerLayer for DiT (diffusion transformer) (#4134) by @huvunvidia
Fix build_sequences_per_dataset output path arg usage (#4144) by @DhineshPonnarasan
ci: Flush pending CUDA work before the barrier in destroy_model_parallel (#4259) by @chtruong814
Update oncall schedule (#4257) by @Phlip79
docs(moe): Update MoE README (#3664) by @sbhavani
Revert "Add conditions_embeddings argument to TransformerBlock, Trans… (#4270) by @ko3n1g
reduce the number of shared expert streams (#3752) by @yangbofun
remove legacy Bert code (#4204) by @dimapihtar
[Main] Feat(moe): Gated delta net context parallel (CP) (#2642) by @yuzhongw-nvidia
remove t5 legacy code (#4203) by @dimapihtar
fix: handle list-typed process groups in ProcessGroupCollection.repr (#3753) by @cluster2600
Fix Context Parallelism documentation link (#4149) by @liangxs
[MLA] fix: Pad V when Q/V head dims differ for THD (#3003) by @HollowMan6
Allow the evaluation batch size to differ from the training batch size (#4014) by @michal2409
fix(megatron-fsdp): build expt_device_mesh only for MoE models (#3831) by @xuwchen
Add @NVIDIA/transformer review group to megatron/core/transformer/ (#4281) by @Phlip79
Reset AG_pipeline bucket status after validation step. (#3155) by @vasunvidia
Enhance and fix NVTX for training (#3642) by @yaox12
NVFP4 native weights for DDP (#4005) by @WanZzzzzz
Remove unnecessary arguments for layerwise distributed optimizer (#4272) by @FDecaYed
reuse grad buffer for layer-wise param allgather (#3751) by @FDecaYed
feat(ci): add strict review mode to Claude review workflow (#4197) by @Victarry
Fix stale approvals (#4280) by @Phlip79
[MoE] Add a new score function to the router (#3673) by @yaox12
[MoE] Improvement of shared expert overlap, support shared expert overlap for FlexDispatcher (#2207) by @Victarry
build: bump DeepEP to 34152ae (#4228) by @ko3n1g
ci: mark test_fused_indexer_loss_gradient_tp_consistency as flaky_in_dev (#4299) by @ko3n1g
Fix typo in PR4133. (#4277) by @cspades
ci: add retry loop to apt-get update to handle transient mirror sync failures (#4209) by @ko3n1g
remove legacy biencoder and realm models (#4205) by @dimapihtar
fix: enforce correct pass thresholds for deterministic and approximate tests (#4238) by @ko3n1g
ci: add configurable launcher support for functional tests (ft_launcher / torchrun) (#4298) by @ko3n1g
chore: document --target main for local Docker builds (#4307) by @ko3n1g
Extract args init to launch scripts (#4225) by @maanug-nv
[Main] Fix TE version check for retain_pinned_cpu_buffers in cpu offload (#4267) by @BestJuly
Fix documented shape (#3486) by @janEbert
ci: add sync-skills workflow, rename CLAUDE.md → AGENTS.md, move .claude/skills → skills/ (#4303) by @ko3n1g
chore(beep boop 🤖): symlink skills/ → .claude/skills, .agents/skills and AGENTS.md → CLAUDE.md by @github-actions[bot]
Get device correctly when module returns a dict instead of individual tensor (#4265) by @shifangx
remove vision legacy code (#4202) by @dimapihtar
feat: long convergence resiliency for release tests (#4335) by @ko3n1g
ci(action): improve GitHub Actions output UX (#4337) by @ko3n1g
build: bump TransformerEngine to release_v2.14 (#4331) by @ko3n1g
M4 leftover for TE cuda graph (#3137) by @shifangx
feat: add create-issue skill (#4338) by @ko3n1g
Set megatron-fsdp to 0.5.0 by @ko3n1g
fix: wait for async P2P send before deallocating output tensor (#4047) by @ZhiyuLi-Nvidia
ci(gb200): add 1-node mr-github functional test variants (#4334) by @ko3n1g
Fix potential coredump issue that occurs when saving a checkpoint (#1871) by @ezioliao
Port DeepSeek Sparse Attention to MambaModel (#3553) by @janEbert
docs: bump versions1.json to 0.17.0 (latest) (#4360) by @ko3n1g
Add tables and histogram for RL staleness (#4097) by @tdene
[docs] ci: use parent-relative json_url for version picker (#4367) by @ko3n1g
Fix bug with non-partial rollouts (#3964) by @tdene
Add QK layernorm support for dot-product attention in MambaModel (#4067) by @Phlip79
Docs: improve docstrings and comments in example training loop (#4041) by @DhineshPonnarasan
feat(ckpt): add --async-ckpt-use-cpu-shm argument (#4355) by @sbak5
cp: Fix UT timeout (#4310) (#4373) by @chtruong814
Fix RL reward due to stop token (#4096) by @tdene
FA4 Inference (#4186) by @wdykas
Make param_index_map always use unpacked (full numel) offsets (#4328) by @deepakn94
Add activation logging and tokens per expert logging (#3842) by @Mellonta
Fix RL to once again work with --skip-train (#4249) by @tdene
Fix Megatron initialization with extra_args_provider (#4327) by @santhnm2
Rename MambaModel/MambaStack to HybridModel/HybridStack (#4099) by @Phlip79
chore(beep boop 🤖): Bump (main) (2026-04-20) by @github-actions[bot]
fix(ci): wrap uv install in retry block (#4387) by @ko3n1g
Call save_checkpoint_and_time() when saving checkpoint and compute elapsed duration for saving checkpoint before logging timer (#4263) by @awsankur
refactor(tests): move NCCL env vars from docker launcher to shell training script (#4390) by @ko3n1g
Remove packed_attention_mask unused parameter (#3859) by @tdene
Second batch of audit edits (#4115) by @megnvidia
Replace rampup batch size scheduler with custom step batch size schedules (#3779) by @mkhona-nvidia
revert: replace rampup batch size scheduler with custom step batch size schedules (#3779) (#4404) by @ko3n1g
Megatron-FSDP: log mcore detection only after imports succeed (#4400) by @wujingyue
Replace rampup batch size scheduler with custom step batch size schedules (#4411) by @deepakn94
ci(gb200): re-enable tunable_overlap 1-node mr-github test (#4405) by @ko3n1g
Fix local docs building (#4416) by @Phlip79
RL: Onload optimizer after logprobs computation (#4235) by @tdene
Add RL token throughput and packing metrics (#3877) by @tdene
ci: remove publish:merge_into_dev job (#4421) by @ko3n1g
docs: add data loading best practices for large-scale training (#4236) by @sbhavani
Fix: Auto enable manual registration and enhance the docummentation (#3295) by @youngeunkwon0405
Fix nvtx_decorator to check _nvtx_enabled at call time (#4184) by @minitu
fix merges_file typo in megatron_hf_tokenizer (#4392) by @chelseajohn
Enable NullTokenizer for pretraining to reduce I/O access (#4057) by @asolergi-nv
docs: Add SECURITY.md (#4431) by @chtruong814
Mamba inference opt (#4414) by @wdykas
DDP refactoring: Extract parameter layout computation into optimizer classmethod (#3812) by @deepakn94
Update PR template with explicit request for issue (#4409) by @Phlip79
Misc inference fixes (#4397) by @sidsingh-nvidia
Rename Mamba to Hybrid outside megatron/core (#4159) by @Phlip79
Include mtp layers in token per expert logging (#4412) by @Mellonta
fix: NVRx async compatibility and defer resiliency import (#4420) by @sbak5
ci: add base_sha to codecov/codecov-action upload step (#4445) by @ko3n1g
fix(checkpoint_inspector): allow empty --param-to-param-group-map-json (#4403) by @DAISY-gh
Add the YARN support for hybrid_model (#4244) by @guihong-nv
[training migration] Add container class for config dataclasses (#4227) by @maanug-nv
Inference: Fix broken functional tests on gitlab (#4454) by @sidsingh-nvidia
SafeUnpickler class for safe pickle usage (#4319) by @dimapihtar
get rid of weights_only=False (#4434) by @dimapihtar
Inference | Per-block MoE routing storage for prefix caching (#4301) by @lmcafee-nvidia
Add troubleshooting tip for 'access forbidden' (#4449) by @balasaajay
Fix checkpoint loading with rerun state machine (#4448) by @YangFei1990
Add misc CUDA graph sugar to CudaGraphManager (#4425) by @tdene
Inference: Add the embedding and output layer in the full_iteration_inference cuda graph scope for hybrid models (#4440) by @sidsingh-nvidia
Important bugfixes in local CG implementation that were leading to loss curve gaps for latent MoE models (#4433) by @jiemingz
fix: Replace polynomial rolling hash with SHA-256 for prefix caching (#4158) by @lmcafee-nvidia
feat(ckpt): expose validate_access_integrity knob on dist-ckpt load (#4422) by @asolergi-nv
Fix multivalidation (#3388) by @RPrenger
Add missing knob for reduce_scatter_with_fp32_accumulation (#4410) by @WanZzzzzz
Enable CUDA graphs for MTP inference (#4260) by @santhnm2
chore(beep boop 🤖): Bump (main) (2026-04-27) by @github-actions[bot]
checkpoint integrity verification (#4305) by @dimapihtar
Fix cache gating (#4455) by @wdykas
[Main] Fix FusedAdam.use_decoupled_grad mis-set for Megatron-FSDP. (#4427) by @cspades
add permute fusion into hybrid ep (#4089) by @Autumn1998
Add ColocatedBridgeCommunicator for heterogeneous TP/DP MIMO training (NMFW-17) (#4368) by @yashaswikarnati
Fix incorrect bias display in extra_repr of Column/RowParallelLinear (#4330) by @HelloWorldBeginner
Fix assertion logic in combined_1f1b_schedule_for_interleaved_pipelining (#4276) by @joapolarbear
ci: Fix event name reference in CI workflow condition for merge group (#4462) by @balasaajay
Add manual sync workflow from main to dev (#4165) by @Phlip79
fix: handle list-format quant_cfg from ModelOpt PR #1094 (#4187) by @ChenhanYu
ci: also add Run MBridge tests label in nightly sync workflow (#4499) by @Phlip79
[training migration] Add serialization features to config container (#4309) by @maanug-nv
Fix conflict with inference graphs (#4504) by @tdene
Add tools/prepare_cache.py for offline GPT dataset cache preparation (#4080) by @asolergi-nv
[build] fix: move mamba-ssm and causal-conv1d to optional [ssm] extra (#4517) by @ko3n1g
mamba: avoid redundant HBM reloads in causal_conv1d_update shift loop (#4460) by @wdykas
Standardize misc graph interface (#4485) by @tdene
Fix inference graph override in RL flow (#4323) by @tdene
Unify and refactor Megatron-FSDP documentation. (#4418) by @cspades
Skills for running unit tests and working with slurm (#4502) by @yashaswikarnati
Revert "ci: add base_sha to codecov/codecov-action upload step (#4445)" (#4526) by @chtruong814
Reorganize order of operations in inference context and text generation controller (#2929) by @tdene
ci: Update CI workflow conditions to include merge group handling (#4532) by @balasaajay
ci: add base_sha to codecov/codecov-action upload step (#4540) by @chtruong814
Fix release tests: remove --global-batch-size conflicting with --step-batch-size-schedule (#4545) by @deepakn94
docs: use @file-path notation for file references in skills (#4542) by @ko3n1g
Support YAML quant recipe in PTQ and remove first/last layer modifier code (#4503) by @jenchen13
Avoid nsys profile crash with CUDA graphs (#4541) by @tdene
fix(ci): add retry with backoff to approve-test-queue bot (#4559) by @ko3n1g
New allgathervdispatcher for inference and simplify old dispatcher. (#4258) by @sidsingh-nvidia
Fixes for modelopt examples and SFTTokenizer for transformers v5 (#4450) by @jenchen13
Adding code for Flextron (#4429) by @sheliang-nv
Fix partial cudagraphs + HybridEP not properly triggering DDP hook (#4500) by @jiemingz
Ignore pytorch link anchors (#4582) by @maanug-nv
MoE dispatcher fixes: size NVLS dispatcher buffers from actual tensor sizes (#4576) by @mathemakitten
Finalize all builders in preprocess_data, not just the last key (#4573) by @sayalinvidia
refactor(skills): add when_to_use frontmatter, split ci-test-system, enforce skill workflow (#4574) by @ko3n1g
Make last_token_logits graphable (#4552) by @tdene
fix(ci): correct off-by-one in total_steps_evaluated formula (#4591) by @ko3n1g
Add fault injection support via nvidia_resiliency_ext. (#4370) by @hexinw-nvidia
Guard vocab reduce_scatter on TP > 1 (#4565) by @mathemakitten
Move inference context bookkeeping to CPU with ContextGPUView (#4306) by @lmcafee-nvidia
Enable InJob restart on failures. (#4594) by @hexinw-nvidia
Enable shared expert overlap with allgatherv in inference (#4570) by @sidsingh-nvidia
Add vLLM grouped gemm backend for MoE inference (#4566) by @santhnm2
Move KD teacher loading to after Float16Module (#4394) by @AAnoosheh
ci: update gpt3_7b_tp4_pp1_memory_speed gb200 golden values (#4601) by @ko3n1g
Fix inference unit test (#4589) by @maanug-nv
Checkpoint conversion between GPT_model and Hybrid_model (#4482) by @guihong-nv
ci: add cadence input for test filtering in CI workflows (#4561) by @balasaajay
Handle SSM sharded tensor merge OOM with CPU fallback (#4442) by @returnL
Fix mtp_use_repeated_layer behavior for GPT models (#3965) by @rkarimimahab
FlashInfer sampling (#2456) by @tdene
Fix main2dev workflow (#4610) by @Phlip79
Add logic to enable chunked MLP during training (#3656) by @pengdurice
Inference bug-fixes: Re-enable EP syncs for the legacy A2A dispatcher and re-simplify ep_sync accidentally reverted by #4306 (#4587) by @sidsingh-nvidia
Remove invalid timeout argument for dist.barrier (#4512) by @zhaoyinglia
Fix buffers in refit (#4580) by @wdykas
Named validation sets (#4578) by @RPrenger
Fix Hang in tests (#4575) by @wdykas
Single commit for main2dev nightly (#4614) by @Phlip79
convert tokenizer args to config (#4406) by @dimapihtar
Siddharth/fix ep sync (#4607) by @wdykas
mmiranda working on another set of broken links (#4534) by @megnvidia
Fix gradient corruption with layerwise param all-gather overlap (#4609) by @deepakn94
test: mark TestFusedApplyMLARope::test_forward_backward_for_q flaky_in_dev (#4639) by @ko3n1g
remove legacy GPT code (#4322) by @dimapihtar
ci: introduce L-tier scope vocabulary via parser (#4625) by @balasaajay
Inference: Tune vLLM grouped gemm, moe_sum kernel, and enable shared expert overlap in latent MoEs (#4603) by @sidsingh-nvidia
Fix crash involving evicted requests and tpot (#4645) by @tdene
remove legacy tranformer and modules (#4207) by @dimapihtar
chore: Update Docker image version to 26.04-py3 (#4611) by @balasaajay
Propagate errors for failed inference requests (#4679) by @mathemakitten
Inference: Cache input + position ID views (#4634) by @mathemakitten
ci: Update Gitlab base image to 26.04 pytorch (#4688) by @chtruong814
Add periodic GPU sniff tests to detect hardware stragglers (#4662) by @deepakn94
ci: Bump GHA versions (#4606) by @chtruong814
build: widen flashinfer-python pin to <0.7.0 (#4700) by @ko3n1g
Add Python-side guardrail for HybridEP InfiniBand limit and rename seq_len (#4094) by @Shreyas-S-809
Switch oncall (#4702) by @janEbert
Update golden values for various functional tests (#4703) by @balasaajay
chore: Update golden values for various functional tests (#4706) by @balasaajay
build: upgrade mamba-ssm to 2.3.2.post1, causal-conv1d to 1.6.2.post1 (#4712) by @ko3n1g
ci: replace uuidgen with /proc/sys/kernel/random/uuid (#4714) by @ko3n1g
chore(codeowners): add megatron/inference/ ownership (#4704) by @ko3n1g
Create a Protocol for the MLP layer of TransformerLayer (#3435) by @nschank
Revert "Add Python-side guardrail for HybridEP InfiniBand limit and rename seq_len (#4094)" (#4718) by @ko3n1g
chore(beep boop 🤖): Bump (main) (2026-05-11) by @github-actions[bot]
Add Python-side guardrail for DeepEP IB limits (#4719) by @janEbert
ci: revert bad uv.lock bump and label future bumps with Run functional tests (#4730) by @ko3n1g
[ci] fix: treat cancelled run-main-script step as failure (#4727) by @ko3n1g
ci: Major refactor of release-workflows (#4602) by @ko3n1g
fix(fsdp): recognize legacy GDN TP metadata (#4664) by @Glitchfix
build(deps): bump nvidia-modelopt to 0.43 (#4723) by @ko3n1g
Fixes for Nemotron3 Super release test config (#4544) by @maanug-nv
feat(gpt): add output postprocess hook (#4686) by @Glitchfix
Add bump-base-image skill and update golden value comparison (#4733) by @balasaajay
Guard omegaconf imports (#4685) by @maanug-nv
Fix a regression introduced by #4625 for nightly runs (#4734) by @balasaajay
Add LLaVA audio (sound) model support (#4402) by @cuichenx
Support transfomers 5.x.x for text generation server (#4732) by @tdene
Update transformer-engine dependency to version 2.15.0 (#4682) by @balasaajay
Increase CG cover from max_requests to max_tokens (#4214) by @tdene
fully remove legacy code (#4759) by @dimapihtar
fix legacy torch save when tensor_model_parallel_size > expert_model_parallel_size * expert_tensor_parallel_size (#4678) by @dimapihtar
Wire --rl-inference-parsers into MRL (#4768) by @tdene
Integrate LayerWiseDistributedOptimizer with DDP buffer infrastructure (#4509) by @deepakn94
[training migration] Migrate mamba builder (#4550) by @maanug-nv
NCCL UB fix: reduce memory cost and correctly deregister NCCL mem pool (#4492) by @xrennvidia
fix: use no_mask in local ViT layer spec (#4395) by @Phlip79
refit clean up and refactoring (#4762) by @wdykas
Make weight and optimizer memory estimation take into account expert parallelism correctly (#4687) by @YangFei1990
Support recomputing in HybridModel (#4496) by @xuantengh
One single flag that determines if we are in inference (#4617) by @tdene
[main] feat(moe): Support packed sequence for gated delta net (GDN) (#2645) by @yuzhongw-nvidia
remove dead manual_release_grads code path in 1F1B overlap schedule (#4511) by @Wohox
Fix recompute checkpointing + training CGs (#3919) by @tdene
Use Protocols to type-check linear_proj submodules of Attention (#3434) by @nschank
fix tokenizers in respect to newer transformers (#4608) by @dimapihtar
Bump nvidia-modelopt>=0.44.0 (#4803) by @kevalmorabia97
Update owners (#4794) by @Phlip79
ci: Update workflow to use same commit for building docker image and running tests (#4787) by @balasaajay
chore: Update nightly tests golden values (#4805) by @balasaajay
Inference: Optimize Prefill Engine Steps for Nemotron (#4764) by @sidsingh-nvidia
Disable MSC by default; opt in via --enable-msc (#4629) by @asolergi-nv
Strengthen test_checkpoint to verify distributed checkpoint behavior (#4711) by @lichenlu
Combine GEMM + SwiGLU fused MLP PRs (3890, 4071, 4095, 4219, 4311, 4324) → main (#4636) by @Connor-XY
additional tests for nvrx (#4522) by @dimapihtar
[fix] Use MSC for checking checkpoint existence (#4251) by @pavelgein
ci: tolerate git-gc race in /home/runner chown after checkout (#4808) by @balasaajay
Reorder mtp_post_process after attention backward in 1F1B schedule plan (#4695) by @gdengk
[Main][feat] Support A2A Overlap for Megatron-FSDP (#3797) by @Wohox
add is_torch_min_version in fsdp src (#4812) by @xrennvidia
Add high-priority A2A stream and HybridEP preprocessing SMs (#4694) by @gdengk
Refactor CUDA graph API: decompose cuda_graph_scope into full_iteration impl, inference scope, and per-layer capture modules (#4292) by @buptzyb
chore(beep boop 🤖): Bump (main) (2026-05-18) by @github-actions[bot]
Tokenizers updates (#4780) by @dimapihtar
Fix no nvrx tests (#4847) by @dimapihtar
Thread custom process groups through MoE grad finalization (#4782) by @yashaswikarnati
Fix unit tests (#4689) by @shanmugamr1992
Tests/dynamic inference functional coverage (#4761) by @shanmugamr1992
Fix oncall references (#4722) by @janEbert
Update golden values for nightly functional tests (#4850) by @balasaajay
fix(inference): size DynamicInferenceContext KV layer_map for non-uniform PP (#4775) by @athitten
Modernize post-training modelopt example scripts (#4807) by @kevalmorabia97
test: add inference performance test harness for GPT 583M, hybrid 2B,… (#4806) by @shanmugamr1992
Fix tokenizers bug in nightly (#4833) by @Phlip79
ci: Prevent shell trace in parts of _run_training.sh (#4884) by @chtruong814
Ignore Vim swap files (#4860) by @wujingyue
M-FSDP: Make fine_grained_param_gather configurable for MXFP8 to enable performance–memory trade-offs (#4181) by @shjwudp
MimoOptimizer: fix distributed checkpoint save and load for non-colocated MIMO (#4801) by @kamran-nvidia
Route non-Muon params through DistributedOptimizer (#4771) by @deepakn94
ci: Gate optional CI jobs with repository variables (#4907) by @chtruong814
Allow optimizer CG to share the same pool as full-iter CG (#4698) by @nanz-nv
Use sharded_state_dict_default in MLP.sharded_state_dict (#4693) by @gdengk
Fix MTP recompute crash with packed sequences (#4593) by @BestJuly
Update PR template (#4904) by @Phlip79
ci: Update perf test to output logs for tests to pass (#4906) by @chtruong814
Also persist asymmetrical units for the MXFP8 transpose weight buffer. (#4852) by @cspades
fix no_shard training convergency and add unittest for no_shard (#3754) by @wplf
Move policy epoch stats to the message object (#4533) by @ArEsKay3
Add a knob to throttle the max allowed inflight offload in fine grained offloading (#4692) by @nanz-nv
refactor(data): consolidate get_batch and enable PP for SFT THD (#4103) by @asolergi-nv
Allow YAML MoE configs to use model specs (#4822) by @chawkins-nvidia
Move bert and t5 pretrain files (#4820) by @Phlip79
Paged Stashing (#4247) by @nanz-nv
make FP4 param gather work with the mixed precisions in NVFP4 recipe (#4358) by @xrennvidia
fix: Fix multi-node functional test phase sync (#4924) by @chtruong814
Perf tests (#4917) by @shanmugamr1992
fix(cuda_graphs): handle TE 2.15 removal of FP8GlobalStateManager.set_skip_fp8_weight_update_tensor (#4874) by @balasaajay
Fix paged stashing test submodules lookup (#4925) by @Phlip79
Add TEFusedDenseMLP for Dense+Grouped GEMM fusion on SM100+ (#4318) (#4786) by @sraman-rgb
Fix mxfp8 param gather numerical issue when DP overlap is off (#4800) by @WanZzzzzz
[MXFP8/FP4-param-gather] Post processing after forced param AG in eval (#4562) by @WanZzzzzz
ci: Update training script paths in BERT and T5 (#4939) by @balasaajay
Various training utils (#4872) by @maanug-nv
ci: restore perf test torchrun logs (#4951) by @chtruong814
Fix get_batch return order to ignore BlendedDataset provenance fields (#4952) by @deepakn94
test(release): add release goldens for deepseekv3/nemotron3 and set tp2pp2 exit-interval (#4932) by @ko3n1g
test: enable NVTE_CUTEDSL_FUSED_GROUPED_MLP via pytest fixture (#4931) by @ko3n1g
chore(beep boop 🤖): Bump (main) (2026-05-25) by @github-actions[bot]
Avoid offsetting functional test master port (#4973) by @chtruong814
Fix elastification unwrap_model import (#4972) by @Devil1716
test: re-enable paged stashing MoE tests (#4978) by @ko3n1g
test(ci): re-enable 8experts2parallel_multi_dist_optimizer_instances_1node (#4984) by @ko3n1g
ci: Add support for MBridge job gating based on PR labels (#4926) by @balasaajay
test: re-enable test_pp2_create_cudagraphs_first_stage on TE 2.15+ (#4985) by @ko3n1g
fix(tests): initialize num_microbatches calculator in vision cudagraph tests (#4986) by @ko3n1g
ci: Add allow_failure flag to gpt and moe recipes that are failing in nightlies (#4905) by @balasaajay
Drain predecessor reduce-scatter at dispatch time (#4940) by @deepakn94
nightly(ci): Update golden values for functional t5 tests (#4995) by @balasaajay
chore: rotate oncall schedule by @github-actions[bot]
[main] Refactor and Improve MoE Logginginit commit (#3431) by @yanring
ci: validate release branch-rules (#4929) by @ko3n1g
[Megatron-FSDP] Add conditional param.grad dereferencing logic to support full-iteration (FWD-BWD) CUDA graphability. (#4663) by @cspades
test: restrict iter-time comparison to steady-state window (#5010) by @ko3n1g
[fix] Release MTP assertion when EP overlap with PP=1 (#4796) by @Wohox
fix(test): pin eval-global-batch-size on 15b gb200 release configs (#5022) by @ko3n1g
fix(test): widen iter-time steady-state window for short tests (#5023) by @ko3n1g
Perf fix (#4996) by @shanmugamr1992
Add dev-feature preservation gate and change schedule (#4773) by @Phlip79
chore(test): remove orphan nemotron3_super_release_g200 dir (#5024) by @ko3n1g
Ignore Claude worktree directory (#5020) by @Phlip79
Update copy-pr-bot.yaml [skip ci] by @github-actions[bot]
ci: update CI workflow conditions for integration tests (#4658) by @balasaajay
Add NVSkills CI request workflow (#5033) by @Phlip79
DDP wrap pg size fixes (#5006) by @maanug-nv
fix(layer_wise): tag MTP-stage word_embeddings as is_embedding_or_output_parameter (#5034) by @Wohox
Move LTS dependencies from pyproject.toml to Dockerfile.ci.lts (#4877) by @balasaajay
Use shared ModelOpt calibration loop on 0.45+ with 0.44 fallback fix (#4881) by @kevalmorabia97
test(release): skip golden comparison on intermediate resume windows (#5040) by @ko3n1g
[mimo] Thread position_ids through MimoModel for multimodal RoPE (#4938) by @liding-nv
build: Switch DSv3 on H100 to HybridEP (#5039) by @ko3n1g
Fix: Import unwrap_model from megatron.core.utils in modelopt examples (#5045) by @kevalmorabia97
Simple and stable Inference APIs (#4697) by @YangFei1990
ci: Add notification step for MBridge downstream test results (#5028) by @balasaajay
Delete output tensor early (#4742) by @Phlip79
Support ScaledSReLU in TE grouped MLP fuser (#4859) by @sraman-rgb
Skip gradient updates when grad norm exceeds threshold (#3460) by @yfw
Add 9 user skills (#5066) by @Phlip79
test(nemotron): align nemotron3 super GB200 goldens with exit-interval 4768 (#5069) by @ko3n1g
chore: Update transformer-engine dependency to version 2.16.0 (#4992) by @balasaajay
Update energon version requirement (#4572) by @maanug-nv
Fix test failures for new inference APIs (#5068) by @YangFei1990
fix(ci): set PYTHONUNBUFFERED=1 in JET workload env (#5072) by @ko3n1g
Preserve non-FSDP-unit buckets across AllGatherPipeline reset (#4717) by @wujingyue
Add opt-in MXFP8 LM-head output projection (#4825) by @gdengk
chore(beep boop 🤖): Bump (main) (2026-06-01) by @github-actions[bot]
fix(ci): bound JET pipeline polling with a watchdog to prevent indefinite hangs (#5076) by @ko3n1g
test: unmark EP A2A activation offload test flaky (#5009) by @lhb8125
ci: prune old artifacts on cluster lustre during weekly/release runs (#5084) by @ko3n1g
ci(test): isolate ckpt-resume tensorboard per phase (#5074) by @ko3n1g
Change ownership groups (#5021) by @Phlip79
test: skip mfsdp_fully_shard cases when world_size < mesh size (#4487) by @wujingyue
fix mimo optimizer checkpoint metadata restore (#4791) by @liding-nv
[mimo] Support bridge fan-out for variable modality tokens (#5062) by @liding-nv
cp: Remove DeepEP hardware limit check (4846) into core_r0.18.0 (#5126) by @ko3n1g
chore: Update transformer-engine dependency to revision 4220403 (#5112) (#5137) by @balasaajay
cp: fix(optimizer): gate ChainedOptimizer MXFP8 defer-sync on DDP-level overlap_param_gather (4982) into core_r0.18.0 (#5146) by @ko3n1g
cp: build: Switch DSv3 on H100 to HybridEP (5164) into core_r0.18.0 (#5165) by @ko3n1g
chore(beep boop 🤖): Bump (core_r0.18.0) (2026-06-22) by @github-actions[bot]
beep boop 🤖: Bumping Megatron Core to v0.18.2 [skip ci] by @github-actions[bot]
Resetting Megatron Core version to v0.18.0 and Megatron FSDP to rc0 by @balasaajay
make fsdp a release package by @balasaajay
docs: Fix docs version for 0.18.0 release (#5435) by @chtruong814
beep boop 🤖: Bumping Megatron Core to v0.18.1 [skip ci] by @github-actions[bot]
Resetting Megatron Core and FSDP patch versions to 0 (#5437) by @balasaajay

NVIDIA/Megatron-LM core_v0.18.0 NVIDIA Megatron Core 0.18.0 on GitHub

NVIDIA/Megatron-LM core_v0.18.0
NVIDIA Megatron Core 0.18.0

on GitHub