github NVIDIA/Megatron-LM core_v0.17.0
NVIDIA Megatron Core 0.17.0

8 hours ago
Changelog Details
  • Fix two minor bugs in MTP implementation for hybrid models by @deepakn94 :: PR: #3194
  • Update README.md by @mvirts :: PR: #2111
  • mRoPE for MTP by @BestJuly :: PR: #3114
  • Fix bug in SFTDataset by @duncanriach :: PR: #3185
  • Fix several syntax error by @HollowMan6 :: PR: #3004
  • Fix for RL Test by @wdykas :: PR: #3148
  • Fix latent moe flops and backward_dw by @buptzyb :: PR: #2977
  • Use global user buffer when the bucket size does not fit FixedPoolAllocator by @shengf-nv :: PR: #2857
  • ci: Checkpoint retention by @ko3n1g :: PR: #3205
  • Add unit test for LatentMoE by @venmugil :: PR: #2892
  • ci: Enable unit tests on merge-queue by @ko3n1g :: PR: #3186
  • Fix seq pack flag in get_logprobs by @mathemakitten :: PR: #3206
  • ci(fix): Parse unit tests in merge-queue by @ko3n1g :: PR: #3224
  • Fix TE 2.12 AllGather CI failure by @BestJuly :: PR: #3101
  • ci(hotfix): Pin uv by @ko3n1g :: PR: #3233
  • Add a unit test to check that RL get_logprobs will reuse training cudagraphed forward pass by @mathemakitten :: PR: #3209
  • Do not offload grad buffers when training graphs are enabled by @mathemakitten :: PR: #3231
  • Fix missing PackedSeqParams import by @parthmannan :: PR: #3214
  • Synchronize the request counts for EP inference with strict matching by @santhnm2 :: PR: #3033
  • Fix coordinator address collision check in flask by @tdene :: PR: #3208
  • Do not let requests fail silently inside inference engine by @tdene :: PR: #3228
  • torch saver inference model offload by @wdykas :: PR: #3170
  • enable cuda graph ut by @Autumn1998 :: PR: #3197
  • Support EP with HSDP by @wplf :: PR: #2840
  • [Main] Add the missing part to support 1F1B overlap for Qwen3-Next by @BestJuly :: PR: #2997
  • Missing import fix by @parthmannan :: PR: #3241
  • Miscellaneous inference cleanup (Replay of !2955) by @santhnm2 :: PR: #3232
  • Add DistributedInitConfig by @maanug-nv :: PR: #3173
  • Fix checkpoint converter missing parallel group initialization by @yashaswikarnati :: PR: #3217
  • Skip empty sequences and chunks in MTP tensor roll by @BestJuly :: PR: #3035
  • Implement get_parameters for ChainedOptimizer by @nschank :: PR: #3201
  • ci(fix): Create main/dev image tags by @ko3n1g :: PR: #3252
  • Reapply "Add MTP support for hybrid models (#2363)" by @sancha :: PR: #3207
  • Fix uv install for GH actions by @Phlip79 :: PR: #3259
  • Update the project structure in README by @janEbert :: PR: #3251
  • Cherry-pick: Fix mtp_num_layers and clip_qk issues (#2581, #2776) by @BestJuly :: PR: #3075
  • RL: training cudagraphs functional test by @mathemakitten :: PR: #3235
  • [Main] fix cg missing wgrad hook by @Wohox :: PR: #3074
  • Avoid .cuda call on meta device in LanguageModel by @nschank :: PR: #3202
  • fix checkpointing error message by @dimapihtar :: PR: #3203
  • Nano QAT/D fix with sft tokenizer and datasets by @ChenhanYu :: PR: #3254
  • Revert "fix checkpointing error message (#3203)" by @ko3n1g :: PR: #3283
  • Reapply "fix checkpointing error message (#3203)" (#3283) by @ko3n1g :: PR: #3285
  • docs: Add changelog for 0.15.3 by @ko3n1g :: PR: #3286
  • ci: Set throughput tests as flaky by @chtruong814 :: PR: #3301
  • chore: Move GB200 tests to nightly by @ko3n1g :: PR: #3302
  • Ensure type-checker understands use of Submodules in bert_model by @nschank :: PR: #3256
  • Override extra_repr instead of repr by @nschank :: PR: #3200
  • Replace ModuleSpec with Protocols for LayerNorm submodules by @nschank :: PR: #3090
  • Non colocated refit by @wdykas :: PR: #3213
  • Fuse permute+pad and unpermute+unpad ops for FP8/FP4 training by @xiaoxi-wangfj :: PR: #2763
  • Add check to prevent MFSDP from numeric issue in gradient accumulate fusion by @shjwudp :: PR: #2904
  • update get_embedding_ranks and get_position_embedding_ranks docstrings by @c1lovez1 :: PR: #3223
  • Param offset in _ParamAndGradBucket should be aligned by @skydoorkai :: PR: #3007
  • ci: Add secrets detector by @chtruong814 :: PR: #3180
  • Ensure type-checker understands use of Submodules in llava_model by @nschank :: PR: #3257
  • updates to support modelopt EAGLE training with CP by @yeyu-nvidia :: PR: #3147
  • fully remove legacy tokenizer system by @dimapihtar :: PR: #2946
  • M-FSDP: Remove redundant stream waits in HSDP to prevent CG fail by @shjwudp :: PR: #2941
  • General README and pyproject fixes by @ahmadki :: PR: #2907
  • chore: More aggressive checkpointing by @ko3n1g :: PR: #3315
  • ci: Pin down setuptools to lt 82 by @ko3n1g :: PR: #3313
  • fix: numpy overflow by @ko3n1g :: PR: #3306
  • fix: T5 dataset by @ko3n1g :: PR: #3307
  • ci: Revert "ci: Add secrets detector (#3180)" by @chtruong814 :: PR: #3330
  • ci: Add more tests, run on merge-queue by @ko3n1g :: PR: #3317
  • ci: Remove merge-gate environment check by @chtruong814 :: PR: #3331
  • Use FP4 context for mamba by @kwyss-nvidia :: PR: #2604
  • ci: Ensure we run all functional tests in merge group by @chtruong814 :: PR: #3332
  • Replace ModuleSpec with Protocols for inputs to MLP by @nschank :: PR: #3084
  • ci: Fix merge queue functional tests by @chtruong814 :: PR: #3337
  • ci: skip queue in merge-gate by @ko3n1g :: PR: #3343
  • ci: Timeout for functional tests by @ko3n1g :: PR: #3346
  • update checkpointing documentation by @dimapihtar :: PR: #3347
  • Update golden values to reflect improvements by @tdene :: PR: #3350
  • BUGFIX: gpt vs hybrid model mtp naming mismatch by @sancha :: PR: #3334
  • Disable flaky test by @tdene :: PR: #3354
  • re-enable gpt grpo tests by @jon-barker :: PR: #3348
  • Fix SFT Pipeline when TP>1 by @asolergi-nv :: PR: #3268
  • Fixes for KD mode by @AAnoosheh :: PR: #3342
  • chore: Update codeowners file by @ko3n1g :: PR: #3365
  • Siddharth/fix inference functional tests by @sidsingh-nvidia :: PR: #3357
  • Switch oncall by @janEbert :: PR: #3360
  • Add missing RMSNorm to llama train script by @AAnoosheh :: PR: #3314
  • Fix inference for MTP models by @tdene :: PR: #3297
  • Add a logprobs test with real gpt model. by @yobibyte :: PR: #2870
  • Add simple GRPO functional test by @tdene :: PR: #3323
  • ci: Concurrency control for merge-queue by @ko3n1g :: PR: #3353
  • ci: Update golden value download script to work with Github by @chtruong814 :: PR: #3335
  • fix: correct typos 'seperated' and 'recieved' by @thecaptain789 :: PR: #3305
  • Improved PyTorch profiler and added PyTorch execution trace by @shengf-nv :: PR: #3273
  • Removing etc from main index page, shifted name of discussions by @megnvidia :: PR: #3271
  • build: Bump TE on 2.12 by @ko3n1g :: PR: #3371
  • ci(hotfix): job conditions by @ko3n1g :: PR: #3376
  • Record moe routing decisions during inference. by @sidsingh-nvidia :: PR: #3034
  • [Main] Fix EP Overlap Bugs for Full-Iter CG by @Wohox :: PR: #3164
  • Avoid direct pickle import by @maanug-nv :: PR: #3375
  • Delete old pretrain_* files by @Phlip79 :: PR: #3359
  • Add Qwen3-VL support with Megatron-FSDP by @xuwchen :: PR: #2841
  • Refactor Mamba chunked prefill by @santhnm2 :: PR: #3265
  • Improved parallel logging of learning rate by @jstjohn :: PR: #3319
  • Add enhanced event tracking with TTFT measurement and compact serialization. by @lmcafee-nvidia :: PR: #3253
  • Add assertion that max_requests is divisible by tp_size by @santhnm2 :: PR: #3304
  • Move to using the Inference OpenAI API server by @ArEsKay3 :: PR: #3107
  • Update moe github test cases. by @Victarry :: PR: #3077
  • Split layer_specs to return Submodules instead of ModuleSpecs by @nschank :: PR: #3255
  • ci: Remove gpu sanity check by @chtruong814 :: PR: #3420
  • [Critical-Bug] Fix Uneven PP for Mamba models (Nemotron3-nano) by @kevalmorabia97 :: PR: #3399
  • Fix for rl by @shanmugamr1992 :: PR: #3390
  • Add check for full_iteration scope before instantiating CudaGraphManager by @vasunvidia :: PR: #3362
  • Fix broken links throughout by @megnvidia :: PR: #3230
  • Decouple topk and loss from DSA Indexer by @kunlunl :: PR: #3248
  • Extract intermediate embeddings of transformer block by @sajadn :: PR: #3060
  • Move to using the Inference OpenAI API server (bis) by @tdene :: PR: #3395
  • Make Mamba inference state memory ratio configurable by @santhnm2 :: PR: #3322
  • Fix configs for RL model environments by @tdene :: PR: #3441
  • Replace pickle with json in rl_utils by @tdene :: PR: #3351
  • fix: correct typo in demo training example by @dndnda :: PR: #3428
  • Clean up logging inside inference flask server by @tdene :: PR: #3437
  • ci: Update release-docs workflow to use FW-CI-templates v0.72.0 by @chtruong814 :: PR: #3438
  • Fix --tokenizer-hf-include-special-tokens by @jon-barker :: PR: #3422
  • Update num_tokens_to_generate default for Gym by @tdene :: PR: #3453
  • Fix slowdown in inference flask server by @tdene :: PR: #3445
  • Add a normalized scale for MTP per token loss by @BestJuly :: PR: #3159
  • [Bugfix] Fix nan loss caused by zero token in MTP by @BestJuly :: PR: #3396
  • Log RL metrics per environment by @yobibyte :: PR: #3446
  • Move tensor offload/onload out of RL code by @tdene :: PR: #3029
  • Fix another inference flask / Gym interaction by @tdene :: PR: #3467
  • Add Engine event to the follow up requests after checkpointing by @ArEsKay3 :: PR: #3473
  • adding in copyright blurb at the top of md file by @megnvidia :: PR: #3394
  • [Megatron-FSDP] Add fsdp_all_gather_in_start_param_sync option in DDP Config by @shjwudp :: PR: #3095
  • ci: Update release workflow to include changelog and publish docs by @chtruong814 :: PR: #3472
  • ci(fix): Weekly GPT tests by @ko3n1g :: PR: #3443
  • ci: Remove environments by @ko3n1g :: PR: #3462
  • update HF tokenizer defaults by @dimapihtar :: PR: #3440
  • ci: Bump preflight to detect our svc by @ko3n1g :: PR: #3494
  • build: Drop Python 3.10 support and pip install one-logger by @ko3n1g :: PR: #3485
  • PTQ changes for upcoming QAD by @AAnoosheh :: PR: #3124
  • ci: Bump pre-flight for Bot SSO by @ko3n1g :: PR: #3497
  • Revert "build: Drop Python 3.10 support and pip install one-logger (#… by @ko3n1g :: PR: #3500
  • Fix chunked prefill edge cases by @santhnm2 :: PR: #3404
  • ci: Enable MBridge downstream testing via PR by @ko3n1g :: PR: #3483
  • ci: Remove gitlab docs build job and set LTS integration and functional tests to allow failure by @chtruong814 :: PR: #3349
  • [OMNIML-3232] ModelOpt: add full TE spec option and wire Mamba stack / post-training scripts by @yueshen2016 :: PR: #3393
  • Track off-policyness across RL steps by @tdene :: PR: #3030
  • ci: MBridge testing branch name during merge-queues by @ko3n1g :: PR: #3513
  • ci: Enable Dependabot Automerge by @ko3n1g :: PR: #3487
  • ci: Also sync direct teams by @ko3n1g :: PR: #3484
  • Multimodal: fix argument checking by @faradawn :: PR: #3449
  • Fix Megatron-FSDP fully_shard() optimizer state DCP checkpointing, and fix DTensor deepcopy bug from PyTorch 26.01. by @cspades :: PR: #3510
  • Renable full_iteration cuda graphs for inference. Add them for the mamba block. by @sidsingh-nvidia :: PR: #3250
  • do not add EoD by @arendu :: PR: #3526
  • Do not Slack notify for draft PRs by @Phlip79 :: PR: #3536
  • remove deprecated SampleListWebdataset by @dimapihtar :: PR: #3407
  • remove deprecated get_te_version by @dimapihtar :: PR: #3413
  • remove deprecated async_grad_allreduce param by @dimapihtar :: PR: #3412
  • remove deprecated mamba params by @dimapihtar :: PR: #3411
  • remove deprecated params from model parallel config by @dimapihtar :: PR: #3408
  • Remove redundant CUDA calls in the LLaVA dataloader by @duncanriach :: PR: #3476
  • Inference: Create finer grained cuda-graphs with better coverage of smaller batch sizes by @sidsingh-nvidia :: PR: #3527
  • fix: skip non-tensor optimizer state entries in distrib_optimizer sav… by @ahmadki :: PR: #3537
  • remove encoder_and_decoder from enums by @dimapihtar :: PR: #3406
  • remove is_unitialized & get_data_modulo_expert_parallel_group by @dimapihtar :: PR: #3414
  • remove deprecated TE module by @dimapihtar :: PR: #3409
  • Add knobs to choose process groups for fully-parallel-save / load and load-exchange-algo by @sbak5 :: PR: #2161
  • Fix off-by-2 error in RL sequence packing by @tdene :: PR: #3551
  • Skip unnecessary flattening for Save / Load Planner by @sbak5 :: PR: #3263
  • Multimodal: fix model provider by @faradawn :: PR: #3508
  • docs: Enable nightly docs publish by @chtruong814 :: PR: #3546
  • Ensure type-checker understands use of Submodules in unit tests by @nschank :: PR: #3425
  • Use copy_signature to preserve typing of pass-through methods by @nschank :: PR: #3419
  • Ensure type-checker understands use of Submodules in MTP by @nschank :: PR: #3308
  • Add mxfp8 quantization for inference linear layers by @santhnm2 :: PR: #3447
  • Fixed fp32 residuals by @mkhona-nvidia :: PR: #3504
  • Move config src files into a dedicated dir by @maanug-nv :: PR: #3570
  • Revert "remove encoder_and_decoder from enums (#3406)" by @ko3n1g :: PR: #3579
  • Fix default cuda graph persist arg. Add persist to rl common.sh. by @yobibyte :: PR: #3584
  • Optimize away add request overheads in dummy ep cuda-graphed forward passes by @sidsingh-nvidia :: PR: #3525
  • ci: Test docs build by @ko3n1g :: PR: #3583
  • ci: Fix docs build for release by @ko3n1g :: PR: #3597
  • ci: Remove secrets by @ko3n1g :: PR: #3598
  • ci: Define secrets by @ko3n1g :: PR: #3599
  • ci: gh-release-from-tag by @ko3n1g :: PR: #3600
  • Ko3n1g/ci/remove twine username by @ko3n1g :: PR: #3601
  • Add training code to MCore wheel by @maanug-nv :: PR: #3573
  • FP8 attention knob for nvFP4 recipe by @vasunvidia :: PR: #3363
  • Fix error with --load-main-params-from-ckpt by @guyueh1 :: PR: #3569
  • ci: Create comment by @ko3n1g :: PR: #3610
  • ci: Skip cleanup-taint-node jobs during deployments by @ko3n1g :: PR: #3612
  • ci: No comment for release workflow by @ko3n1g :: PR: #3615
  • ci: Re-add release tag prefix by @ko3n1g :: PR: #3619
  • docs: Fix version picker urls by @chtruong814 :: PR: #3621
  • ci: Increase changelog generation max PRs fetched by @chtruong814 :: PR: #3620
  • Add debug info to an assert. by @yobibyte :: PR: #3588
  • fix: async_utils: explicit GC in persistent checkpoint worker loop by @sbak5 :: PR: #3591
  • Fix: Perform sigmoid calculation in fp32 for aux loss stability by @CodersAcademy006 :: PR: #2765
  • fix: forward use_te_activation_func flag in non-MoE GPT layer spec by @saakshigupta2002 :: PR: #3300
  • Revert "Add single-process checkpoint save to avoid forked multiproce… by @ko3n1g :: PR: #3630
  • Track and plot per-token off-policy in RL by @tdene :: PR: #3515
  • Multimodal: fix VQA dataset selection by @faradawn :: PR: #3464
  • Multimodal: Fix multimodal training example - tokenizer, Triton Cache Manager patch, docs by @faradawn :: PR: #3507
  • Multimodal: Limit transformer version in Dockerfile by @faradawn :: PR: #3448
  • Support TP > GQA for inference by @santhnm2 :: PR: #3627
  • μP: Maximal Update Parameterization by @plugyawn :: PR: #3058
  • Add flexible virtual pipeline parallel (fVPP) to hybrid model by @duncanriach :: PR: #3377
  • Explicitly close and join Pool in preprocess_data.py by @weijiac0619 :: PR: #3592
  • remove indexer by @dimapihtar :: PR: #3416
  • Multimodal: add load weights only by @faradawn :: PR: #3452
  • Add single-process checkpoint save to avoid forked multiprocessing by @sbak5 :: PR: #3633
  • Update oncall schedule by @Phlip79 :: PR: #3632
  • M-FSDP: Cancel erroneous grad accumulation check by @shjwudp :: PR: #3629
  • Fix MoE aux loss tracker hang with MTP enabled by @Victarry :: PR: #3401
  • Fix test data preparation by @janEbert :: PR: #3652
  • Add GPTOSS Example with Megatron-LM + Megatron Bridge by @faradawn :: PR: #3018
  • Add thd unit test main by @kunlunl :: PR: #3617
  • Inference | KV prefix caching. by @lmcafee-nvidia :: PR: #3063
  • [Megatron-FSDP] Add dtype customization to Megatron-FSDP. by @cspades :: PR: #3067
  • CachedMetadataFileSystemReader: shared cache by @sbak5 :: PR: #3326
  • Inference Optimized MoEs by @sidsingh-nvidia :: PR: #3496
  • Log torch_memory_saver offload/onload by @tdene :: PR: #3567
  • Prefix caching | Mamba memory only. by @lmcafee-nvidia :: PR: #3657
  • Prefix caching | Coordinator scheduling. by @lmcafee-nvidia :: PR: #3665
  • Adding manual Claude reviewer by @Phlip79 :: PR: #3679
  • Nemo-RL Refit by @wdykas :: PR: #3520
  • Add extra permissions and make other changes by @Phlip79 :: PR: #3683
  • Claude should always comment something by @Phlip79 :: PR: #3685
  • [Cleanup] Remove the deprecated GroupedMLP by @dimapihtar :: PR: #3410
  • Fix illegal memory access with mamba inference by @tdene :: PR: #3631
  • Fix illegal memory access with mamba inference (bis) by @tdene :: PR: #3696
  • remove duplicate rerun_state_machine.set_mode(rerun_mode) by @YangWang92 :: PR: #3279
  • Correct indexing when cp_comms_type is a list by @jeromeku :: PR: #3389
  • Fix optional chat_completions returnables by @tdene :: PR: #3519
  • ci: Claude code review by @ko3n1g :: PR: #3704
  • ci: Fix event payload by @ko3n1g :: PR: #3705
  • ci: Use issue number by @ko3n1g :: PR: #3706
  • ci: Finalize Claude review by @ko3n1g :: PR: #3707
  • ci: Add codecov yml by @thomasdhc :: PR: #3455
  • adding public_docs_features: True to get proper legal footer… by @megnvidia :: PR: #3681
  • Robust signaling for coordinator inference by @tdene :: PR: #3563
  • Fix memory issue in mxfp8 model init by @WanZzzzzz :: PR: #3461
  • add --overlap-param-gather support for layer-wise optimizer. lots of unit tests. by @mchrzanowski :: PR: #3524
  • ci: Mount and enforce HF_HOME by @ko3n1g :: PR: #3700
  • Add flags for changing Mamba inference state tensor dtype by @santhnm2 :: PR: #3660
  • chore: CLI launch internal CI by @ko3n1g :: PR: #3695
  • Change Review Process by @Phlip79 :: PR: #3659
  • ci: Separate queues for internal/external contributors by @ko3n1g :: PR: #3718
  • Update to correct token by @Phlip79 :: PR: #3724
  • build: Bump to NGC PyTorch 26.02 by @ko3n1g :: PR: #3474
  • Claude: use Opus 4.6 and auto-review on ready by @Phlip79 :: PR: #3727
  • Claude to add complexity label by @Phlip79 :: PR: #3709
  • Offload Flask frontend to separate process by @santhnm2 :: PR: #3648
  • chore: Use PAT for CLI Launcher by @ko3n1g :: PR: #3734
  • ci: Add missing gitlab rule by @ko3n1g :: PR: #3735
  • [main] Add TE CUDA Graph Support for Vision Encoder by @tomlifu :: PR: #3293
  • fix(moe): fix TE general_gemm API change by @hxbai :: PR: #3582
  • Review process fixes by @Phlip79 :: PR: #3728
  • Print more verbose error message about incorrect model_parallel_size. by @rj42 :: PR: #2639
  • ci: Update golden values after PyT bump by @ko3n1g :: PR: #3733
  • Optimize process management and delete operations for async save by @sbak5 :: PR: #3262
  • Align gpt-oss window-size with 128-token sliding window by @returnL :: PR: #2771
  • fix: temperature validation error message 1000.0 -> 100.0 by @CreeperLKF :: PR: #2688
  • RL: Hybrid MoE training cudagraphs and fix training <-> inference transition by @mathemakitten :: PR: #3373
  • Fix dynamic inference and GRPO functional tests by @santhnm2 :: PR: #3740
  • Swap oncall by @janEbert :: PR: #3585
  • [bugfix] fix the bug that loss: 0 will not be printed by @leisuzz :: PR: #1555
  • Fused dLN + add in backwards pass by @CarlosGomes98 :: PR: #3384
  • Claude: run actions on target branch by @Phlip79 :: PR: #3745
  • revert of #2658 by @dimapihtar :: PR: #3736
  • Update README Quick Start by @ilml :: PR: #3596
  • Re-enable tests which were failing on #3373 by @mathemakitten :: PR: #3757
  • Check reviews properly by @Phlip79 :: PR: #3756
  • Add CP + Sequence Packing support for Mimo by @mehraakash :: PR: #2135
  • MXFP8 refit by @wdykas :: PR: #3742
  • Claude: update token usage by @Phlip79 :: PR: #3760
  • Handle Tool Call Argument Parsing by @sancha :: PR: #3662
  • RL support for nanov3 sft checkpoint by @jon-barker :: PR: #3741
  • add mix_hidden_states option in conversion by @yeyu-nvidia :: PR: #3655
  • ci: Optimize release-configs for GB200 by @ko3n1g :: PR: #3541
  • Add absorbed-mla by @kunlunl :: PR: #3198
  • feat(checkpoint): zero-copy storage sharing in CheckpointWithoutOutput by @Victarry :: PR: #3649
  • Fuse MLA DOWN projection GEMMs by @cjld :: PR: #3039
  • fix: skip FSDP DTensor boundary validation under fake process group by @Victarry :: PR: #3686
  • [main] fix(moe): fix the bug where gate was not sliced when kv_head < tp_size. by @yuzhongw-nvidia :: PR: #3575
  • fix(offload): reset activation offload manager after eval as well as … by @rapatel :: PR: #3739
  • Improve error logging when invalid number of tokens is requested. by @yobibyte :: PR: #3680
  • Add NVIDIA-Nemotron-3-Super-120B-A12B-BF16 to ModelOpt examples by @jenchen13 :: PR: #3805
  • build: Bump TE2.13 by @ko3n1g :: PR: #3800
  • Ensure dummy_forward does not attempt to run cudagraphs by @jalbericiola :: PR: #3789
  • Add speculative decoding support with MTP layers by @santhnm2 :: PR: #3594
  • Shanmugamr1992/megatron inference ultra by @shanmugamr1992 :: PR: #3784
  • Fix backward compatibility issue with MFSDP --grad-reduce-in-bf16 by @shjwudp :: PR: #3799
  • feat: add NCCL flight recorder configuration support by @sbak5 :: PR: #3806
  • Revert "Ensure dummy_forward does not attempt to run cudagraphs (#3789)" by @ko3n1g :: PR: #3834
  • Fix if statement in main by @tdene :: PR: #3833
  • Update golden values of weekly tests by @ko3n1g :: PR: #3829
  • build: Loosen TE restriction by @ko3n1g :: PR: #3827
  • Do not let chunked prefill generate decode logprobs by @tdene :: PR: #3777
  • Prevent double serialization inside Flask server by @tdene :: PR: #3653
  • Allow RL to run inference-only via skip-train by @tdene :: PR: #3744
  • Announce Python 3.12 migration by @ko3n1g :: PR: #3825
  • ci: Skip test_wrong_cuda_graph_impl_returns_false in LTS by @chtruong814 :: PR: #3847
  • ci: Mark TestCoordinator.test_throughput as flaky by @chtruong814 :: PR: #3849
  • find optimal number of workers by @dimapihtar :: PR: #3699
  • remove encoder_and_decoder by @dimapihtar :: PR: #3836
  • ci: Skip more tests in test_vision_cuda_graphs for LTS by @chtruong814 :: PR: #3860
  • Ensure that inference dummy_forward does not try to match on a cudagraph when running eager by @mathemakitten :: PR: #3815
  • Fix flakiness due to timing between shutdowns by @tdene :: PR: #3857
  • Add unit tests for speculative decoding by @santhnm2 :: PR: #3817
  • Exposing interleave argument for fused_apply_rotary_pos_emb_thd by @huvunvidia :: PR: #3794
  • ci: install nvidia-resiliency-ext from source by @ko3n1g :: PR: #3861
  • Miscellaneous inference bug fixes by @santhnm2 :: PR: #3840
  • Nemo-RL integration bugfixes for --transformer-impl inference_optimized by @sidsingh-nvidia :: PR: #3851
  • remove legacy mpu by @dimapihtar :: PR: #3854
  • enable async save for functional tests by @dimapihtar :: PR: #3855
  • remove legacy data by @dimapihtar :: PR: #3853
  • docs: Document python-gitlab dependency by @ko3n1g :: PR: #3863
  • Fsdp dsv3 proxy by @gautham-kollu :: PR: #3844
  • Fix token dispatched cudagraph_attrs by @asolergi-nv :: PR: #3625
  • Fix slowdown in serialization by @tdene :: PR: #3872
  • Establish reviewers for training code by @maanug-nv :: PR: #3765
  • Fix quantize.py script and support packed sequences in pretrain_gpt.py by @AAnoosheh :: PR: #3564
  • Use fp32 state dtypes for Mamba inference functional test by @santhnm2 :: PR: #3888
  • [Megatron-FSDP] Support 'auto' argument which defaults to pre-MixedPrecisionPolicy be… by @cspades :: PR: #3810
  • Bug fix: add missing packages to Multimodal Dockerfile by @faradawn :: PR: #3417
  • Reverse polarity of the off-policy measurement by @tdene :: PR: #3580
  • Update nightly golden values after TE2.13 by @ko3n1g :: PR: #3886
  • enable use_persistent_ckpt_worker for ci tests by @dimapihtar :: PR: #3898
  • Correctly generate state dict in MultiTokenPredictionBlock by @asolergi-nv :: PR: #3624
  • Add torch grouped gemm bf16 and mxfp8 support w/ cuda graphed + inference_optimized MoEs by @sidsingh-nvidia :: PR: #3858
  • ci: Fix build-test-publish summary job always passing by @ko3n1g :: PR: #3905
  • ci: Skip gpt3_mcore_te_tp1_pp4_vp1 for now by @chtruong814 :: PR: #3908
  • ci: Fix build-and-test-wheels jobs for arm by @chtruong814 :: PR: #3910
  • Add Lion optimizer support by @mchrzanowski :: PR: #3813
  • Support multimodule pipelining in 1F1B schedule by @yashaswikarnati :: PR: #3129
  • Add a config parameter for retaining pinned cpu buffers for cpu offloading by @rapatel :: PR: #3151
  • Inference | Hybrid prefix caching. by @lmcafee-nvidia :: PR: #3225
  • Hotfix for eviction issue by @tdene :: PR: #3914
  • Parity with VLLM over the reasoning field by @tdene :: PR: #3873
  • CI: add parallel GB200 integration test track by @ko3n1g :: PR: #3901
  • Track errors through the inference return path by @tdene :: PR: #3776
  • Fix: Defensively close GPU device FDs in dataloader worker processes by @hexinw-nvidia :: PR: #3684
  • Fix hybrid dynamic inference functional tests by @santhnm2 :: PR: #3924
  • Patch EOD out of inference results by @tdene :: PR: #3866
  • ci: Add mr-github-slim label by @ko3n1g :: PR: #3934
  • Revert "ci: Skip gpt3_mcore_te_tp1_pp4_vp1 for now (#3908)" by @chtruong814 :: PR: #3926
  • Exclude arguments.py from training review by @maanug-nv :: PR: #3906
  • ci: Fix sso users check by @chtruong814 :: PR: #3938
  • move router replay doc to advanced feature part by @ilml :: PR: #3929
  • Fix DDP bug with --overlap-grad-reduce and --num-distributed-optimizer-instances > 1 by @wplf :: PR: #3693
  • Fix incorrect HAVE_TE detection in multiple modules by @returnL :: PR: #3763
  • Implement forced lag in RL by @tdene :: PR: #3517
  • Refactor VisionTECudaGraphHelper to minimize overrides and clarify state tracking by @buptzyb :: PR: #3748
  • Fix external contributor concurrency to be global across all branches by @ko3n1g :: PR: #3951
  • Fix 3-way merge issue that broke main by @tdene :: PR: #3949
  • Fix Nemo_CICD_Test not catching cancelled/skipped functional tests by @ko3n1g :: PR: #3947
  • Guard cudagraph input copy on whether data pointers have actually changed by @mathemakitten :: PR: #3948
  • Enforce that flashinfer cache has been installed for inference-optimized MoE layers by @santhnm2 :: PR: #3941
  • chore: remove nv-grouped-gemm dependency by @liuyun7345 :: PR: #3770
  • Prevent failures due to prevent_retokenization by @tdene :: PR: #3958
  • ultra refit by @wdykas :: PR: #3904
  • [Fix][Main] Missing Assertion for moe layer recomptue in A2A Overlap by @Wohox :: PR: #3917
  • Move Megatron-FSDP MixedPrecisionPolicy arguments from FSDP adapter t… by @cspades :: PR: #3903
  • chore: bump FW-CI-templates to v0.80.2 by @ko3n1g :: PR: #3961
  • Rename RL timers to be consistent by @tdene :: PR: #3878
  • ci: centralize run configuration in a single configure job by @ko3n1g :: PR: #3962
  • ci: Split unit tests into smaller groups by @ko3n1g :: PR: #3966
  • Refit optimization by @wdykas :: PR: #3933
  • common strategy simplification by @dimapihtar :: PR: #3229
  • Cudagraphs: Remove fwd_graph_input_surface weakref by @mathemakitten :: PR: #3970
  • fix: interpolate version correctly in release Slack notification by @ko3n1g :: PR: #3977
  • Make args and kwargs optional positional arguments for the Module hooks. by @cspades :: PR: #3976
  • ci: Add core-adlr and core-nemo to megatron/training codeowners by @chtruong814 :: PR: #3979
  • Small quality-of-life improvements in megatron/training by @deepakn94 :: PR: #3957
  • Update throughput golden values to reflect speedup by @tdene :: PR: #3983
  • Revert "ci: Add core-adlr and core-nemo to megatron/training codeowners (#3979)" by @chtruong814 :: PR: #3982
  • ci: Add --repo flag to gh pr view in configure job by @ko3n1g :: PR: #3989
  • Add common pile scripts by @Phlip79 :: PR: #3902
  • Introduce GDN to Mamba by @Phlip79 :: PR: #3535
  • Fix IndexError in uniform activation recompute when num_layers not divisible by recompute_num_layers by @saakshigupta2002 :: PR: #3562
  • Scaling for MuP over Muon optimizer. by @plugyawn :: PR: #3715
  • Pass Megatron-FSDP MixedPrecision args to DDPConfig. by @cspades :: PR: #3992
  • [OMNIML-3721] Fix tokenizer unwrapping for nested Megatron-Core tokenizer by @jenchen13 :: PR: #3967
  • Forced load imbalance by @nanz-nv :: PR: #3380
  • Add /claude copy command by @Phlip79 :: PR: #3978
  • Add multi-module heterogeneous parallelism support for MIMO model by @yashaswikarnati :: PR: #3211
  • added vllm fakequant export support by @kinjalpatel27 :: PR: #3050
  • fix(modelopt): use bash array for MLM_EXTRA_ARGS to preserve quoting by @jenchen13 :: PR: #4002
  • fix: use dump file prefix for NCCL flight recorder temp files by @sbak5 :: PR: #3955
  • Fix PersistentAsyncCaller.del crash during Python shutdown by @cluster2600 :: PR: #3781
  • ci: Run L1 MBridge tests in merge queue by @chtruong814 :: PR: #4009
  • Update Claude review by @Phlip79 :: PR: #3980
  • Migrate MoeLayer submodules from ModuleSpec to Protocols by @nschank :: PR: #3426
  • Guard non-core imports by @maanug-nv :: PR: #3993
  • Fix config compatibility with Megatron-Core by @maanug-nv :: PR: #3995
  • Add MimoOptimizer for heterogeneous parallelism by @yashaswikarnati :: PR: #4019
  • [Main] Support EP Overlap's Dynamic Computation Stream For Full-Iter CUDA Graph by @Wohox :: PR: #3820
  • fix: Handle quantized CUDA tensors in async checkpoint writer by @sbak5 :: PR: #3845
  • accept hooks marked with with_kwargs when using te.ops.sequential by @CarlosGomes98 :: PR: #4000
  • Use GroupedMLPSubmodules for InferenceGroupedMLP by @nschank :: PR: #3743
  • Fix 2D tensor communication for asymmetric DP in Bridge Communicator by @yashaswikarnati :: PR: #4021
  • Add distributed checkpoint support for non-colocated MiMo by @yashaswikarnati :: PR: #4020
  • CUDA graph support for prefix caching on hybrid models by @lmcafee-nvidia :: PR: #3922
  • Add ability to perform local gradient accumulation in FP32 for a subset of parameters in the model by @deepakn94 :: PR: #4028
  • Miscellaneous MXFP8 inference fixes by @santhnm2 :: PR: #4017
  • Use torch.int64 for grad_num_zero accumulation by @WanZzzzzz :: PR: #4015
  • Make text generation server hostname configurable by @santhnm2 :: PR: #3935
  • Add --muon-coefficient-type argument for Muon optimizer by @mchrzanowski :: PR: #3927
  • Pass gracefully if token_id not found in message by @i-riyad :: PR: #3862
  • Improve load balancing behavior for prefix cache-aware routing by @santhnm2 :: PR: #3930
  • Refactor setup.py to use get_pybind_include by @sakgoyal :: PR: #3658
  • build: Bump TE to 2.14 by @ko3n1g :: PR: #4025
  • fix traceback when interrupting run by @dimapihtar :: PR: #3439
  • chore: update goldenvalues by @ko3n1g :: PR: #4059
  • Fix TemporalAsyncCaller pin_memory lifetime in async checkpointing by @lvdunlin :: PR: #2288
  • chore: Move to Py3.12 by @ko3n1g :: PR: #3826
  • Adding NVRx as a dependency and keeping the current code base optionally by @dimapihtar :: PR: #3899
  • build: Set ENV NVTE_BUILD_NUM_PHILOX_ROUNDS=3 by @ko3n1g :: PR: #4074
  • fix checkpointing conversion by @dimapihtar :: PR: #4058
  • cp: Megatron-FSDP: Add MXFP8 transpose helper buffer for Hybrid FSDP (3918) into core_r0.17.0 by @ko3n1g :: PR: #4105
  • cp: Enable CUDA graph for ADAM optimizer (3429) into core_r0.17.0 by @ko3n1g :: PR: #4142
  • Release testing/0170 by @ko3n1g :: PR: #4147
  • cp: fix: remove weights_only=False for multimodal example (4104) into core_r0.17.0 by @ko3n1g :: PR: #4188
  • cp: Bump nvrx by @ko3n1g :: PR: #4237
  • cp: build: bump DeepEP to 34152ae (#4228) into core_r0.17.0 by @ko3n1g :: PR: #4297

Don't miss a new Megatron-LM release

NewReleases is sending notifications on new releases.