NVIDIA/Megatron-LM core_v0.17.0rc0 on GitHub

Changelog Details

Fix two minor bugs in MTP implementation for hybrid models by @deepakn94 :: PR: #3194
Update README.md by @mvirts :: PR: #2111
mRoPE for MTP by @BestJuly :: PR: #3114
Fix bug in SFTDataset by @duncanriach :: PR: #3185
Fix several syntax error by @HollowMan6 :: PR: #3004
Fix for RL Test by @wdykas :: PR: #3148
Fix latent moe flops and backward_dw by @buptzyb :: PR: #2977
Use global user buffer when the bucket size does not fit FixedPoolAllocator by @shengf-nv :: PR: #2857
ci: Checkpoint retention by @ko3n1g :: PR: #3205
Add unit test for LatentMoE by @venmugil :: PR: #2892
ci: Enable unit tests on merge-queue by @ko3n1g :: PR: #3186
Fix seq pack flag in get_logprobs by @mathemakitten :: PR: #3206
ci(fix): Parse unit tests in merge-queue by @ko3n1g :: PR: #3224
Fix TE 2.12 AllGather CI failure by @BestJuly :: PR: #3101
ci(hotfix): Pin uv by @ko3n1g :: PR: #3233
Add a unit test to check that RL get_logprobs will reuse training cudagraphed forward pass by @mathemakitten :: PR: #3209
Do not offload grad buffers when training graphs are enabled by @mathemakitten :: PR: #3231
Fix missing PackedSeqParams import by @parthmannan :: PR: #3214
Synchronize the request counts for EP inference with strict matching by @santhnm2 :: PR: #3033
Fix coordinator address collision check in flask by @tdene :: PR: #3208
Do not let requests fail silently inside inference engine by @tdene :: PR: #3228
torch saver inference model offload by @wdykas :: PR: #3170
enable cuda graph ut by @Autumn1998 :: PR: #3197
Support EP with HSDP by @wplf :: PR: #2840
[Main] Add the missing part to support 1F1B overlap for Qwen3-Next by @BestJuly :: PR: #2997
Missing import fix by @parthmannan :: PR: #3241
Miscellaneous inference cleanup (Replay of !2955) by @santhnm2 :: PR: #3232
Add DistributedInitConfig by @maanug-nv :: PR: #3173
Fix checkpoint converter missing parallel group initialization by @yashaswikarnati :: PR: #3217
Skip empty sequences and chunks in MTP tensor roll by @BestJuly :: PR: #3035
Implement get_parameters for ChainedOptimizer by @nschank :: PR: #3201
ci(fix): Create main/dev image tags by @ko3n1g :: PR: #3252
Reapply "Add MTP support for hybrid models (#2363)" by @sancha :: PR: #3207
Fix uv install for GH actions by @Phlip79 :: PR: #3259
Update the project structure in README by @janEbert :: PR: #3251
Cherry-pick: Fix mtp_num_layers and clip_qk issues (#2581, #2776) by @BestJuly :: PR: #3075
RL: training cudagraphs functional test by @mathemakitten :: PR: #3235
[Main] fix cg missing wgrad hook by @Wohox :: PR: #3074
Avoid .cuda call on meta device in LanguageModel by @nschank :: PR: #3202
fix checkpointing error message by @dimapihtar :: PR: #3203
Nano QAT/D fix with sft tokenizer and datasets by @ChenhanYu :: PR: #3254
Revert "fix checkpointing error message (#3203)" by @ko3n1g :: PR: #3283
Reapply "fix checkpointing error message (#3203)" (#3283) by @ko3n1g :: PR: #3285
docs: Add changelog for 0.15.3 by @ko3n1g :: PR: #3286
ci: Set throughput tests as flaky by @chtruong814 :: PR: #3301
chore: Move GB200 tests to nightly by @ko3n1g :: PR: #3302
Ensure type-checker understands use of Submodules in bert_model by @nschank :: PR: #3256
Override extra_repr instead of repr by @nschank :: PR: #3200
Replace ModuleSpec with Protocols for LayerNorm submodules by @nschank :: PR: #3090
Non colocated refit by @wdykas :: PR: #3213
Fuse permute+pad and unpermute+unpad ops for FP8/FP4 training by @xiaoxi-wangfj :: PR: #2763
Add check to prevent MFSDP from numeric issue in gradient accumulate fusion by @shjwudp :: PR: #2904
update get_embedding_ranks and get_position_embedding_ranks docstrings by @c1lovez1 :: PR: #3223
Param offset in _ParamAndGradBucket should be aligned by @skydoorkai :: PR: #3007
ci: Add secrets detector by @chtruong814 :: PR: #3180
Ensure type-checker understands use of Submodules in llava_model by @nschank :: PR: #3257
updates to support modelopt EAGLE training with CP by @yeyu-nvidia :: PR: #3147
fully remove legacy tokenizer system by @dimapihtar :: PR: #2946
M-FSDP: Remove redundant stream waits in HSDP to prevent CG fail by @shjwudp :: PR: #2941
General README and pyproject fixes by @ahmadki :: PR: #2907
chore: More aggressive checkpointing by @ko3n1g :: PR: #3315
ci: Pin down setuptools to lt 82 by @ko3n1g :: PR: #3313
fix: numpy overflow by @ko3n1g :: PR: #3306
fix: T5 dataset by @ko3n1g :: PR: #3307
ci: Revert "ci: Add secrets detector (#3180)" by @chtruong814 :: PR: #3330
ci: Add more tests, run on merge-queue by @ko3n1g :: PR: #3317
ci: Remove merge-gate environment check by @chtruong814 :: PR: #3331
Use FP4 context for mamba by @kwyss-nvidia :: PR: #2604
ci: Ensure we run all functional tests in merge group by @chtruong814 :: PR: #3332
Replace ModuleSpec with Protocols for inputs to MLP by @nschank :: PR: #3084
ci: Fix merge queue functional tests by @chtruong814 :: PR: #3337
ci: skip queue in merge-gate by @ko3n1g :: PR: #3343
ci: Timeout for functional tests by @ko3n1g :: PR: #3346
update checkpointing documentation by @dimapihtar :: PR: #3347
Update golden values to reflect improvements by @tdene :: PR: #3350
BUGFIX: gpt vs hybrid model mtp naming mismatch by @sancha :: PR: #3334
Disable flaky test by @tdene :: PR: #3354
re-enable gpt grpo tests by @jon-barker :: PR: #3348
Fix SFT Pipeline when TP>1 by @asolergi-nv :: PR: #3268
Fixes for KD mode by @AAnoosheh :: PR: #3342
chore: Update codeowners file by @ko3n1g :: PR: #3365
Siddharth/fix inference functional tests by @sidsingh-nvidia :: PR: #3357
Switch oncall by @janEbert :: PR: #3360
Add missing RMSNorm to llama train script by @AAnoosheh :: PR: #3314
Fix inference for MTP models by @tdene :: PR: #3297
Add a logprobs test with real gpt model. by @yobibyte :: PR: #2870
Add simple GRPO functional test by @tdene :: PR: #3323
ci: Concurrency control for merge-queue by @ko3n1g :: PR: #3353
ci: Update golden value download script to work with Github by @chtruong814 :: PR: #3335
fix: correct typos 'seperated' and 'recieved' by @thecaptain789 :: PR: #3305
Improved PyTorch profiler and added PyTorch execution trace by @shengf-nv :: PR: #3273
Removing etc from main index page, shifted name of discussions by @megnvidia :: PR: #3271
build: Bump TE on 2.12 by @ko3n1g :: PR: #3371
ci(hotfix): job conditions by @ko3n1g :: PR: #3376
Record moe routing decisions during inference. by @sidsingh-nvidia :: PR: #3034
[Main] Fix EP Overlap Bugs for Full-Iter CG by @Wohox :: PR: #3164
Avoid direct pickle import by @maanug-nv :: PR: #3375
Delete old pretrain_* files by @Phlip79 :: PR: #3359
Add Qwen3-VL support with Megatron-FSDP by @xuwchen :: PR: #2841
Refactor Mamba chunked prefill by @santhnm2 :: PR: #3265
Improved parallel logging of learning rate by @jstjohn :: PR: #3319
Add enhanced event tracking with TTFT measurement and compact serialization. by @lmcafee-nvidia :: PR: #3253
Add assertion that max_requests is divisible by tp_size by @santhnm2 :: PR: #3304
Move to using the Inference OpenAI API server by @ArEsKay3 :: PR: #3107
Update moe github test cases. by @Victarry :: PR: #3077
Split layer_specs to return Submodules instead of ModuleSpecs by @nschank :: PR: #3255
ci: Remove gpu sanity check by @chtruong814 :: PR: #3420
[Critical-Bug] Fix Uneven PP for Mamba models (Nemotron3-nano) by @kevalmorabia97 :: PR: #3399
Fix for rl by @shanmugamr1992 :: PR: #3390
Add check for full_iteration scope before instantiating CudaGraphManager by @vasunvidia :: PR: #3362
Fix broken links throughout by @megnvidia :: PR: #3230
Decouple topk and loss from DSA Indexer by @kunlunl :: PR: #3248
Extract intermediate embeddings of transformer block by @sajadn :: PR: #3060
Move to using the Inference OpenAI API server (bis) by @tdene :: PR: #3395
Make Mamba inference state memory ratio configurable by @santhnm2 :: PR: #3322
Fix configs for RL model environments by @tdene :: PR: #3441
Replace pickle with json in rl_utils by @tdene :: PR: #3351
fix: correct typo in demo training example by @dndnda :: PR: #3428
Clean up logging inside inference flask server by @tdene :: PR: #3437
ci: Update release-docs workflow to use FW-CI-templates v0.72.0 by @chtruong814 :: PR: #3438
Fix --tokenizer-hf-include-special-tokens by @jon-barker :: PR: #3422
Update num_tokens_to_generate default for Gym by @tdene :: PR: #3453
Fix slowdown in inference flask server by @tdene :: PR: #3445
Add a normalized scale for MTP per token loss by @BestJuly :: PR: #3159
[Bugfix] Fix nan loss caused by zero token in MTP by @BestJuly :: PR: #3396
Log RL metrics per environment by @yobibyte :: PR: #3446
Move tensor offload/onload out of RL code by @tdene :: PR: #3029
Fix another inference flask / Gym interaction by @tdene :: PR: #3467
Add Engine event to the follow up requests after checkpointing by @ArEsKay3 :: PR: #3473
adding in copyright blurb at the top of md file by @megnvidia :: PR: #3394
[Megatron-FSDP] Add fsdp_all_gather_in_start_param_sync option in DDP Config by @shjwudp :: PR: #3095
ci: Update release workflow to include changelog and publish docs by @chtruong814 :: PR: #3472
ci(fix): Weekly GPT tests by @ko3n1g :: PR: #3443
ci: Remove environments by @ko3n1g :: PR: #3462
update HF tokenizer defaults by @dimapihtar :: PR: #3440
ci: Bump preflight to detect our svc by @ko3n1g :: PR: #3494
build: Drop Python 3.10 support and pip install one-logger by @ko3n1g :: PR: #3485
PTQ changes for upcoming QAD by @AAnoosheh :: PR: #3124
ci: Bump pre-flight for Bot SSO by @ko3n1g :: PR: #3497
Revert "build: Drop Python 3.10 support and pip install one-logger (#… by @ko3n1g :: PR: #3500
Fix chunked prefill edge cases by @santhnm2 :: PR: #3404
ci: Enable MBridge downstream testing via PR by @ko3n1g :: PR: #3483
ci: Remove gitlab docs build job and set LTS integration and functional tests to allow failure by @chtruong814 :: PR: #3349
[OMNIML-3232] ModelOpt: add full TE spec option and wire Mamba stack / post-training scripts by @yueshen2016 :: PR: #3393
Track off-policyness across RL steps by @tdene :: PR: #3030
ci: MBridge testing branch name during merge-queues by @ko3n1g :: PR: #3513
ci: Enable Dependabot Automerge by @ko3n1g :: PR: #3487
ci: Also sync direct teams by @ko3n1g :: PR: #3484
Multimodal: fix argument checking by @faradawn :: PR: #3449
Fix Megatron-FSDP fully_shard() optimizer state DCP checkpointing, and fix DTensor deepcopy bug from PyTorch 26.01. by @cspades :: PR: #3510
Renable full_iteration cuda graphs for inference. Add them for the mamba block. by @sidsingh-nvidia :: PR: #3250
do not add EoD by @arendu :: PR: #3526
Do not Slack notify for draft PRs by @Phlip79 :: PR: #3536
remove deprecated SampleListWebdataset by @dimapihtar :: PR: #3407
remove deprecated get_te_version by @dimapihtar :: PR: #3413
remove deprecated async_grad_allreduce param by @dimapihtar :: PR: #3412
remove deprecated mamba params by @dimapihtar :: PR: #3411
remove deprecated params from model parallel config by @dimapihtar :: PR: #3408
Remove redundant CUDA calls in the LLaVA dataloader by @duncanriach :: PR: #3476
Inference: Create finer grained cuda-graphs with better coverage of smaller batch sizes by @sidsingh-nvidia :: PR: #3527
fix: skip non-tensor optimizer state entries in distrib_optimizer sav… by @ahmadki :: PR: #3537
remove encoder_and_decoder from enums by @dimapihtar :: PR: #3406
remove is_unitialized & get_data_modulo_expert_parallel_group by @dimapihtar :: PR: #3414
remove deprecated TE module by @dimapihtar :: PR: #3409
Add knobs to choose process groups for fully-parallel-save / load and load-exchange-algo by @sbak5 :: PR: #2161
Fix off-by-2 error in RL sequence packing by @tdene :: PR: #3551
Skip unnecessary flattening for Save / Load Planner by @sbak5 :: PR: #3263
Multimodal: fix model provider by @faradawn :: PR: #3508
docs: Enable nightly docs publish by @chtruong814 :: PR: #3546
Ensure type-checker understands use of Submodules in unit tests by @nschank :: PR: #3425
Use copy_signature to preserve typing of pass-through methods by @nschank :: PR: #3419
Ensure type-checker understands use of Submodules in MTP by @nschank :: PR: #3308
Add mxfp8 quantization for inference linear layers by @santhnm2 :: PR: #3447
Fixed fp32 residuals by @mkhona-nvidia :: PR: #3504
Move config src files into a dedicated dir by @maanug-nv :: PR: #3570
Revert "remove encoder_and_decoder from enums (#3406)" by @ko3n1g :: PR: #3579
Fix default cuda graph persist arg. Add persist to rl common.sh. by @yobibyte :: PR: #3584
Optimize away add request overheads in dummy ep cuda-graphed forward passes by @sidsingh-nvidia :: PR: #3525
ci: Test docs build by @ko3n1g :: PR: #3583
ci: Fix docs build for release by @ko3n1g :: PR: #3597
ci: Remove secrets by @ko3n1g :: PR: #3598
ci: Define secrets by @ko3n1g :: PR: #3599
ci: gh-release-from-tag by @ko3n1g :: PR: #3600
Ko3n1g/ci/remove twine username by @ko3n1g :: PR: #3601
Add training code to MCore wheel by @maanug-nv :: PR: #3573
FP8 attention knob for nvFP4 recipe by @vasunvidia :: PR: #3363
Fix error with --load-main-params-from-ckpt by @guyueh1 :: PR: #3569
ci: Create comment by @ko3n1g :: PR: #3610
ci: Skip cleanup-taint-node jobs during deployments by @ko3n1g :: PR: #3612
ci: No comment for release workflow by @ko3n1g :: PR: #3615
ci: Re-add release tag prefix by @ko3n1g :: PR: #3619
docs: Fix version picker urls by @chtruong814 :: PR: #3621
ci: Increase changelog generation max PRs fetched by @chtruong814 :: PR: #3620
Add debug info to an assert. by @yobibyte :: PR: #3588
fix: async_utils: explicit GC in persistent checkpoint worker loop by @sbak5 :: PR: #3591
Fix: Perform sigmoid calculation in fp32 for aux loss stability by @CodersAcademy006 :: PR: #2765
fix: forward use_te_activation_func flag in non-MoE GPT layer spec by @saakshigupta2002 :: PR: #3300
Revert "Add single-process checkpoint save to avoid forked multiproce… by @ko3n1g :: PR: #3630
Track and plot per-token off-policy in RL by @tdene :: PR: #3515
Multimodal: fix VQA dataset selection by @faradawn :: PR: #3464
Multimodal: Fix multimodal training example - tokenizer, Triton Cache Manager patch, docs by @faradawn :: PR: #3507
Multimodal: Limit transformer version in Dockerfile by @faradawn :: PR: #3448
Support TP > GQA for inference by @santhnm2 :: PR: #3627
μP: Maximal Update Parameterization by @plugyawn :: PR: #3058
Add flexible virtual pipeline parallel (fVPP) to hybrid model by @duncanriach :: PR: #3377
Explicitly close and join Pool in preprocess_data.py by @weijiac0619 :: PR: #3592
remove indexer by @dimapihtar :: PR: #3416
Multimodal: add load weights only by @faradawn :: PR: #3452
Add single-process checkpoint save to avoid forked multiprocessing by @sbak5 :: PR: #3633
Update oncall schedule by @Phlip79 :: PR: #3632
M-FSDP: Cancel erroneous grad accumulation check by @shjwudp :: PR: #3629
Fix MoE aux loss tracker hang with MTP enabled by @Victarry :: PR: #3401
Fix test data preparation by @janEbert :: PR: #3652
Add GPTOSS Example with Megatron-LM + Megatron Bridge by @faradawn :: PR: #3018
Add thd unit test main by @kunlunl :: PR: #3617
Inference | KV prefix caching. by @lmcafee-nvidia :: PR: #3063
[Megatron-FSDP] Add dtype customization to Megatron-FSDP. by @cspades :: PR: #3067
CachedMetadataFileSystemReader: shared cache by @sbak5 :: PR: #3326
Inference Optimized MoEs by @sidsingh-nvidia :: PR: #3496
Log torch_memory_saver offload/onload by @tdene :: PR: #3567
Prefix caching | Mamba memory only. by @lmcafee-nvidia :: PR: #3657
Prefix caching | Coordinator scheduling. by @lmcafee-nvidia :: PR: #3665
Adding manual Claude reviewer by @Phlip79 :: PR: #3679
Nemo-RL Refit by @wdykas :: PR: #3520
Add extra permissions and make other changes by @Phlip79 :: PR: #3683
Claude should always comment something by @Phlip79 :: PR: #3685
[Cleanup] Remove the deprecated GroupedMLP by @dimapihtar :: PR: #3410
Fix illegal memory access with mamba inference by @tdene :: PR: #3631
Fix illegal memory access with mamba inference (bis) by @tdene :: PR: #3696
remove duplicate rerun_state_machine.set_mode(rerun_mode) by @YangWang92 :: PR: #3279
Correct indexing when cp_comms_type is a list by @jeromeku :: PR: #3389
Fix optional chat_completions returnables by @tdene :: PR: #3519
ci: Claude code review by @ko3n1g :: PR: #3704
ci: Fix event payload by @ko3n1g :: PR: #3705
ci: Use issue number by @ko3n1g :: PR: #3706
ci: Finalize Claude review by @ko3n1g :: PR: #3707
ci: Add codecov yml by @thomasdhc :: PR: #3455
adding public_docs_features: True to get proper legal footer… by @megnvidia :: PR: #3681
Robust signaling for coordinator inference by @tdene :: PR: #3563
Fix memory issue in mxfp8 model init by @WanZzzzzz :: PR: #3461
add --overlap-param-gather support for layer-wise optimizer. lots of unit tests. by @mchrzanowski :: PR: #3524
ci: Mount and enforce HF_HOME by @ko3n1g :: PR: #3700
Add flags for changing Mamba inference state tensor dtype by @santhnm2 :: PR: #3660
chore: CLI launch internal CI by @ko3n1g :: PR: #3695
Change Review Process by @Phlip79 :: PR: #3659
ci: Separate queues for internal/external contributors by @ko3n1g :: PR: #3718
Update to correct token by @Phlip79 :: PR: #3724
build: Bump to NGC PyTorch 26.02 by @ko3n1g :: PR: #3474
Claude: use Opus 4.6 and auto-review on ready by @Phlip79 :: PR: #3727
Claude to add complexity label by @Phlip79 :: PR: #3709
Offload Flask frontend to separate process by @santhnm2 :: PR: #3648
chore: Use PAT for CLI Launcher by @ko3n1g :: PR: #3734
ci: Add missing gitlab rule by @ko3n1g :: PR: #3735
[main] Add TE CUDA Graph Support for Vision Encoder by @tomlifu :: PR: #3293
fix(moe): fix TE general_gemm API change by @hxbai :: PR: #3582
Review process fixes by @Phlip79 :: PR: #3728
Print more verbose error message about incorrect model_parallel_size. by @rj42 :: PR: #2639
ci: Update golden values after PyT bump by @ko3n1g :: PR: #3733
Optimize process management and delete operations for async save by @sbak5 :: PR: #3262
Align gpt-oss window-size with 128-token sliding window by @returnL :: PR: #2771
fix: temperature validation error message 1000.0 -> 100.0 by @CreeperLKF :: PR: #2688
RL: Hybrid MoE training cudagraphs and fix training <-> inference transition by @mathemakitten :: PR: #3373
Fix dynamic inference and GRPO functional tests by @santhnm2 :: PR: #3740
Swap oncall by @janEbert :: PR: #3585
[bugfix] fix the bug that loss: 0 will not be printed by @leisuzz :: PR: #1555
Fused dLN + add in backwards pass by @CarlosGomes98 :: PR: #3384
Claude: run actions on target branch by @Phlip79 :: PR: #3745
revert of #2658 by @dimapihtar :: PR: #3736
Update README Quick Start by @ilml :: PR: #3596
Re-enable tests which were failing on #3373 by @mathemakitten :: PR: #3757
Check reviews properly by @Phlip79 :: PR: #3756
Add CP + Sequence Packing support for Mimo by @mehraakash :: PR: #2135
MXFP8 refit by @wdykas :: PR: #3742
Claude: update token usage by @Phlip79 :: PR: #3760
Handle Tool Call Argument Parsing by @sancha :: PR: #3662
RL support for nanov3 sft checkpoint by @jon-barker :: PR: #3741
add mix_hidden_states option in conversion by @yeyu-nvidia :: PR: #3655
ci: Optimize release-configs for GB200 by @ko3n1g :: PR: #3541
Add absorbed-mla by @kunlunl :: PR: #3198
feat(checkpoint): zero-copy storage sharing in CheckpointWithoutOutput by @Victarry :: PR: #3649
Fuse MLA DOWN projection GEMMs by @cjld :: PR: #3039
fix: skip FSDP DTensor boundary validation under fake process group by @Victarry :: PR: #3686
[main] fix(moe): fix the bug where gate was not sliced when kv_head < tp_size. by @yuzhongw-nvidia :: PR: #3575
fix(offload): reset activation offload manager after eval as well as … by @rapatel :: PR: #3739
Improve error logging when invalid number of tokens is requested. by @yobibyte :: PR: #3680
Add NVIDIA-Nemotron-3-Super-120B-A12B-BF16 to ModelOpt examples by @jenchen13 :: PR: #3805
build: Bump TE2.13 by @ko3n1g :: PR: #3800
Ensure dummy_forward does not attempt to run cudagraphs by @jalbericiola :: PR: #3789
Add speculative decoding support with MTP layers by @santhnm2 :: PR: #3594
Shanmugamr1992/megatron inference ultra by @shanmugamr1992 :: PR: #3784
Fix backward compatibility issue with MFSDP --grad-reduce-in-bf16 by @shjwudp :: PR: #3799
feat: add NCCL flight recorder configuration support by @sbak5 :: PR: #3806
Revert "Ensure dummy_forward does not attempt to run cudagraphs (#3789)" by @ko3n1g :: PR: #3834
Fix if statement in main by @tdene :: PR: #3833
Update golden values of weekly tests by @ko3n1g :: PR: #3829
build: Loosen TE restriction by @ko3n1g :: PR: #3827
Do not let chunked prefill generate decode logprobs by @tdene :: PR: #3777
Prevent double serialization inside Flask server by @tdene :: PR: #3653
Allow RL to run inference-only via skip-train by @tdene :: PR: #3744
Announce Python 3.12 migration by @ko3n1g :: PR: #3825
ci: Skip test_wrong_cuda_graph_impl_returns_false in LTS by @chtruong814 :: PR: #3847
ci: Mark TestCoordinator.test_throughput as flaky by @chtruong814 :: PR: #3849
find optimal number of workers by @dimapihtar :: PR: #3699
remove encoder_and_decoder by @dimapihtar :: PR: #3836
ci: Skip more tests in test_vision_cuda_graphs for LTS by @chtruong814 :: PR: #3860
Ensure that inference dummy_forward does not try to match on a cudagraph when running eager by @mathemakitten :: PR: #3815
Fix flakiness due to timing between shutdowns by @tdene :: PR: #3857
Add unit tests for speculative decoding by @santhnm2 :: PR: #3817
Exposing interleave argument for fused_apply_rotary_pos_emb_thd by @huvunvidia :: PR: #3794
ci: install nvidia-resiliency-ext from source by @ko3n1g :: PR: #3861
Miscellaneous inference bug fixes by @santhnm2 :: PR: #3840
Nemo-RL integration bugfixes for --transformer-impl inference_optimized by @sidsingh-nvidia :: PR: #3851
remove legacy mpu by @dimapihtar :: PR: #3854
enable async save for functional tests by @dimapihtar :: PR: #3855
remove legacy data by @dimapihtar :: PR: #3853
docs: Document python-gitlab dependency by @ko3n1g :: PR: #3863
Fsdp dsv3 proxy by @gautham-kollu :: PR: #3844
Fix token dispatched cudagraph_attrs by @asolergi-nv :: PR: #3625
Fix slowdown in serialization by @tdene :: PR: #3872
Establish reviewers for training code by @maanug-nv :: PR: #3765
Fix quantize.py script and support packed sequences in pretrain_gpt.py by @AAnoosheh :: PR: #3564
Use fp32 state dtypes for Mamba inference functional test by @santhnm2 :: PR: #3888
[Megatron-FSDP] Support 'auto' argument which defaults to pre-MixedPrecisionPolicy be… by @cspades :: PR: #3810
Bug fix: add missing packages to Multimodal Dockerfile by @faradawn :: PR: #3417
Reverse polarity of the off-policy measurement by @tdene :: PR: #3580
Update nightly golden values after TE2.13 by @ko3n1g :: PR: #3886
enable use_persistent_ckpt_worker for ci tests by @dimapihtar :: PR: #3898
Correctly generate state dict in MultiTokenPredictionBlock by @asolergi-nv :: PR: #3624
Add torch grouped gemm bf16 and mxfp8 support w/ cuda graphed + inference_optimized MoEs by @sidsingh-nvidia :: PR: #3858
ci: Fix build-test-publish summary job always passing by @ko3n1g :: PR: #3905
ci: Skip gpt3_mcore_te_tp1_pp4_vp1 for now by @chtruong814 :: PR: #3908
ci: Fix build-and-test-wheels jobs for arm by @chtruong814 :: PR: #3910
Add Lion optimizer support by @mchrzanowski :: PR: #3813
Support multimodule pipelining in 1F1B schedule by @yashaswikarnati :: PR: #3129
Add a config parameter for retaining pinned cpu buffers for cpu offloading by @rapatel :: PR: #3151
Inference | Hybrid prefix caching. by @lmcafee-nvidia :: PR: #3225
Hotfix for eviction issue by @tdene :: PR: #3914
Parity with VLLM over the reasoning field by @tdene :: PR: #3873
CI: add parallel GB200 integration test track by @ko3n1g :: PR: #3901
Track errors through the inference return path by @tdene :: PR: #3776
Fix: Defensively close GPU device FDs in dataloader worker processes by @hexinw-nvidia :: PR: #3684
Fix hybrid dynamic inference functional tests by @santhnm2 :: PR: #3924
Patch EOD out of inference results by @tdene :: PR: #3866
ci: Add mr-github-slim label by @ko3n1g :: PR: #3934
Revert "ci: Skip gpt3_mcore_te_tp1_pp4_vp1 for now (#3908)" by @chtruong814 :: PR: #3926
Exclude arguments.py from training review by @maanug-nv :: PR: #3906
ci: Fix sso users check by @chtruong814 :: PR: #3938
move router replay doc to advanced feature part by @ilml :: PR: #3929
Fix DDP bug with --overlap-grad-reduce and --num-distributed-optimizer-instances > 1 by @wplf :: PR: #3693
Fix incorrect HAVE_TE detection in multiple modules by @returnL :: PR: #3763
Implement forced lag in RL by @tdene :: PR: #3517
Refactor VisionTECudaGraphHelper to minimize overrides and clarify state tracking by @buptzyb :: PR: #3748
Fix external contributor concurrency to be global across all branches by @ko3n1g :: PR: #3951
Fix 3-way merge issue that broke main by @tdene :: PR: #3949
Fix Nemo_CICD_Test not catching cancelled/skipped functional tests by @ko3n1g :: PR: #3947
Guard cudagraph input copy on whether data pointers have actually changed by @mathemakitten :: PR: #3948
Enforce that flashinfer cache has been installed for inference-optimized MoE layers by @santhnm2 :: PR: #3941
chore: remove nv-grouped-gemm dependency by @liuyun7345 :: PR: #3770
Prevent failures due to prevent_retokenization by @tdene :: PR: #3958
ultra refit by @wdykas :: PR: #3904
[Fix][Main] Missing Assertion for moe layer recomptue in A2A Overlap by @Wohox :: PR: #3917
Move Megatron-FSDP MixedPrecisionPolicy arguments from FSDP adapter t… by @cspades :: PR: #3903
chore: bump FW-CI-templates to v0.80.2 by @ko3n1g :: PR: #3961
Rename RL timers to be consistent by @tdene :: PR: #3878
ci: centralize run configuration in a single configure job by @ko3n1g :: PR: #3962
ci: Split unit tests into smaller groups by @ko3n1g :: PR: #3966
Refit optimization by @wdykas :: PR: #3933
common strategy simplification by @dimapihtar :: PR: #3229
Cudagraphs: Remove fwd_graph_input_surface weakref by @mathemakitten :: PR: #3970
fix: interpolate version correctly in release Slack notification by @ko3n1g :: PR: #3977
Make args and kwargs optional positional arguments for the Module hooks. by @cspades :: PR: #3976
ci: Add core-adlr and core-nemo to megatron/training codeowners by @chtruong814 :: PR: #3979
Small quality-of-life improvements in megatron/training by @deepakn94 :: PR: #3957
Update throughput golden values to reflect speedup by @tdene :: PR: #3983
Revert "ci: Add core-adlr and core-nemo to megatron/training codeowners (#3979)" by @chtruong814 :: PR: #3982
ci: Add --repo flag to gh pr view in configure job by @ko3n1g :: PR: #3989
Add common pile scripts by @Phlip79 :: PR: #3902
Introduce GDN to Mamba by @Phlip79 :: PR: #3535
Fix IndexError in uniform activation recompute when num_layers not divisible by recompute_num_layers by @saakshigupta2002 :: PR: #3562
Scaling for MuP over Muon optimizer. by @plugyawn :: PR: #3715
Pass Megatron-FSDP MixedPrecision args to DDPConfig. by @cspades :: PR: #3992
[OMNIML-3721] Fix tokenizer unwrapping for nested Megatron-Core tokenizer by @jenchen13 :: PR: #3967
Forced load imbalance by @nanz-nv :: PR: #3380
Add /claude copy command by @Phlip79 :: PR: #3978
Add multi-module heterogeneous parallelism support for MIMO model by @yashaswikarnati :: PR: #3211
added vllm fakequant export support by @kinjalpatel27 :: PR: #3050
fix(modelopt): use bash array for MLM_EXTRA_ARGS to preserve quoting by @jenchen13 :: PR: #4002
fix: use dump file prefix for NCCL flight recorder temp files by @sbak5 :: PR: #3955
Fix PersistentAsyncCaller.del crash during Python shutdown by @cluster2600 :: PR: #3781
ci: Run L1 MBridge tests in merge queue by @chtruong814 :: PR: #4009
Update Claude review by @Phlip79 :: PR: #3980
Migrate MoeLayer submodules from ModuleSpec to Protocols by @nschank :: PR: #3426
Guard non-core imports by @maanug-nv :: PR: #3993
Fix config compatibility with Megatron-Core by @maanug-nv :: PR: #3995
Add MimoOptimizer for heterogeneous parallelism by @yashaswikarnati :: PR: #4019
[Main] Support EP Overlap's Dynamic Computation Stream For Full-Iter CUDA Graph by @Wohox :: PR: #3820
fix: Handle quantized CUDA tensors in async checkpoint writer by @sbak5 :: PR: #3845
accept hooks marked with with_kwargs when using te.ops.sequential by @CarlosGomes98 :: PR: #4000
Use GroupedMLPSubmodules for InferenceGroupedMLP by @nschank :: PR: #3743
Fix 2D tensor communication for asymmetric DP in Bridge Communicator by @yashaswikarnati :: PR: #4021
Add distributed checkpoint support for non-colocated MiMo by @yashaswikarnati :: PR: #4020
CUDA graph support for prefix caching on hybrid models by @lmcafee-nvidia :: PR: #3922
Add ability to perform local gradient accumulation in FP32 for a subset of parameters in the model by @deepakn94 :: PR: #4028
Miscellaneous MXFP8 inference fixes by @santhnm2 :: PR: #4017
Use torch.int64 for grad_num_zero accumulation by @WanZzzzzz :: PR: #4015
Make text generation server hostname configurable by @santhnm2 :: PR: #3935
Add --muon-coefficient-type argument for Muon optimizer by @mchrzanowski :: PR: #3927
Pass gracefully if token_id not found in message by @i-riyad :: PR: #3862
Improve load balancing behavior for prefix cache-aware routing by @santhnm2 :: PR: #3930
Refactor setup.py to use get_pybind_include by @sakgoyal :: PR: #3658
build: Bump TE to 2.14 by @ko3n1g :: PR: #4025
fix traceback when interrupting run by @dimapihtar :: PR: #3439
chore: update goldenvalues by @ko3n1g :: PR: #4059
Fix TemporalAsyncCaller pin_memory lifetime in async checkpointing by @lvdunlin :: PR: #2288
chore: Move to Py3.12 by @ko3n1g :: PR: #3826
Adding NVRx as a dependency and keeping the current code base optionally by @dimapihtar :: PR: #3899
build: Set ENV NVTE_BUILD_NUM_PHILOX_ROUNDS=3 by @ko3n1g :: PR: #4074
fix checkpointing conversion by @dimapihtar :: PR: #4058
cp: Megatron-FSDP: Add MXFP8 transpose helper buffer for Hybrid FSDP (3918) into core_r0.17.0 by @ko3n1g :: PR: #4105
cp: Enable CUDA graph for ADAM optimizer (3429) into core_r0.17.0 by @ko3n1g :: PR: #4142
Release testing/0170 by @ko3n1g :: PR: #4147
cp: fix: remove weights_only=False for multimodal example (4104) into core_r0.17.0 by @ko3n1g :: PR: #4188
cp: Bump nvrx by @ko3n1g :: PR: #4237
cp: build: bump DeepEP to 34152ae (#4228) into core_r0.17.0 by @ko3n1g :: PR: #4297

NVIDIA/Megatron-LM core_v0.17.0rc0 NVIDIA Megatron Core 0.17.0rc0 on GitHub

NVIDIA/Megatron-LM core_v0.17.0rc0
NVIDIA Megatron Core 0.17.0rc0

on GitHub