Changelog Details
- Fix two minor bugs in MTP implementation for hybrid models by @deepakn94 :: PR: #3194
- Update README.md by @mvirts :: PR: #2111
- mRoPE for MTP by @BestJuly :: PR: #3114
- Fix bug in SFTDataset by @duncanriach :: PR: #3185
- Fix several syntax error by @HollowMan6 :: PR: #3004
- Fix for RL Test by @wdykas :: PR: #3148
- Fix latent moe flops and backward_dw by @buptzyb :: PR: #2977
- Use global user buffer when the bucket size does not fit FixedPoolAllocator by @shengf-nv :: PR: #2857
- ci: Checkpoint retention by @ko3n1g :: PR: #3205
- Add unit test for LatentMoE by @venmugil :: PR: #2892
- ci: Enable unit tests on merge-queue by @ko3n1g :: PR: #3186
- Fix seq pack flag in
get_logprobsby @mathemakitten :: PR: #3206 - ci(fix): Parse unit tests in merge-queue by @ko3n1g :: PR: #3224
- Fix TE 2.12 AllGather CI failure by @BestJuly :: PR: #3101
- ci(hotfix): Pin uv by @ko3n1g :: PR: #3233
- Add a unit test to check that RL
get_logprobswill reuse training cudagraphed forward pass by @mathemakitten :: PR: #3209 - Do not offload grad buffers when training graphs are enabled by @mathemakitten :: PR: #3231
- Fix missing PackedSeqParams import by @parthmannan :: PR: #3214
- Synchronize the request counts for EP inference with strict matching by @santhnm2 :: PR: #3033
- Fix coordinator address collision check in flask by @tdene :: PR: #3208
- Do not let requests fail silently inside inference engine by @tdene :: PR: #3228
- torch saver inference model offload by @wdykas :: PR: #3170
- enable cuda graph ut by @Autumn1998 :: PR: #3197
- Support EP with HSDP by @wplf :: PR: #2840
- [Main] Add the missing part to support 1F1B overlap for Qwen3-Next by @BestJuly :: PR: #2997
- Missing import fix by @parthmannan :: PR: #3241
- Miscellaneous inference cleanup (Replay of !2955) by @santhnm2 :: PR: #3232
- Add DistributedInitConfig by @maanug-nv :: PR: #3173
- Fix checkpoint converter missing parallel group initialization by @yashaswikarnati :: PR: #3217
- Skip empty sequences and chunks in MTP tensor roll by @BestJuly :: PR: #3035
- Implement get_parameters for ChainedOptimizer by @nschank :: PR: #3201
- ci(fix): Create main/dev image tags by @ko3n1g :: PR: #3252
- Reapply "Add MTP support for hybrid models (#2363)" by @sancha :: PR: #3207
- Fix uv install for GH actions by @Phlip79 :: PR: #3259
- Update the project structure in README by @janEbert :: PR: #3251
- Cherry-pick: Fix mtp_num_layers and clip_qk issues (#2581, #2776) by @BestJuly :: PR: #3075
- RL: training cudagraphs functional test by @mathemakitten :: PR: #3235
- [Main] fix cg missing wgrad hook by @Wohox :: PR: #3074
- Avoid .cuda call on meta device in LanguageModel by @nschank :: PR: #3202
- fix checkpointing error message by @dimapihtar :: PR: #3203
- Nano QAT/D fix with sft tokenizer and datasets by @ChenhanYu :: PR: #3254
- Revert "fix checkpointing error message (#3203)" by @ko3n1g :: PR: #3283
- Reapply "fix checkpointing error message (#3203)" (#3283) by @ko3n1g :: PR: #3285
- docs: Add changelog for 0.15.3 by @ko3n1g :: PR: #3286
- ci: Set throughput tests as flaky by @chtruong814 :: PR: #3301
- chore: Move GB200 tests to nightly by @ko3n1g :: PR: #3302
- Ensure type-checker understands use of Submodules in bert_model by @nschank :: PR: #3256
- Override extra_repr instead of repr by @nschank :: PR: #3200
- Replace ModuleSpec with Protocols for LayerNorm submodules by @nschank :: PR: #3090
- Non colocated refit by @wdykas :: PR: #3213
- Fuse permute+pad and unpermute+unpad ops for FP8/FP4 training by @xiaoxi-wangfj :: PR: #2763
- Add check to prevent MFSDP from numeric issue in gradient accumulate fusion by @shjwudp :: PR: #2904
- update get_embedding_ranks and get_position_embedding_ranks docstrings by @c1lovez1 :: PR: #3223
- Param offset in _ParamAndGradBucket should be aligned by @skydoorkai :: PR: #3007
- ci: Add secrets detector by @chtruong814 :: PR: #3180
- Ensure type-checker understands use of Submodules in llava_model by @nschank :: PR: #3257
- updates to support modelopt EAGLE training with CP by @yeyu-nvidia :: PR: #3147
- fully remove legacy tokenizer system by @dimapihtar :: PR: #2946
- M-FSDP: Remove redundant stream waits in HSDP to prevent CG fail by @shjwudp :: PR: #2941
- General README and pyproject fixes by @ahmadki :: PR: #2907
- chore: More aggressive checkpointing by @ko3n1g :: PR: #3315
- ci: Pin down setuptools to lt 82 by @ko3n1g :: PR: #3313
- fix: numpy overflow by @ko3n1g :: PR: #3306
- fix: T5 dataset by @ko3n1g :: PR: #3307
- ci: Revert "ci: Add secrets detector (#3180)" by @chtruong814 :: PR: #3330
- ci: Add more tests, run on merge-queue by @ko3n1g :: PR: #3317
- ci: Remove merge-gate environment check by @chtruong814 :: PR: #3331
- Use FP4 context for mamba by @kwyss-nvidia :: PR: #2604
- ci: Ensure we run all functional tests in merge group by @chtruong814 :: PR: #3332
- Replace ModuleSpec with Protocols for inputs to MLP by @nschank :: PR: #3084
- ci: Fix merge queue functional tests by @chtruong814 :: PR: #3337
- ci: skip queue in merge-gate by @ko3n1g :: PR: #3343
- ci: Timeout for functional tests by @ko3n1g :: PR: #3346
- update checkpointing documentation by @dimapihtar :: PR: #3347
- Update golden values to reflect improvements by @tdene :: PR: #3350
- BUGFIX: gpt vs hybrid model mtp naming mismatch by @sancha :: PR: #3334
- Disable flaky test by @tdene :: PR: #3354
- re-enable gpt grpo tests by @jon-barker :: PR: #3348
- Fix SFT Pipeline when TP>1 by @asolergi-nv :: PR: #3268
- Fixes for KD mode by @AAnoosheh :: PR: #3342
- chore: Update codeowners file by @ko3n1g :: PR: #3365
- Siddharth/fix inference functional tests by @sidsingh-nvidia :: PR: #3357
- Switch oncall by @janEbert :: PR: #3360
- Add missing RMSNorm to llama train script by @AAnoosheh :: PR: #3314
- Fix inference for MTP models by @tdene :: PR: #3297
- Add a logprobs test with real gpt model. by @yobibyte :: PR: #2870
- Add simple GRPO functional test by @tdene :: PR: #3323
- ci: Concurrency control for merge-queue by @ko3n1g :: PR: #3353
- ci: Update golden value download script to work with Github by @chtruong814 :: PR: #3335
- fix: correct typos 'seperated' and 'recieved' by @thecaptain789 :: PR: #3305
- Improved PyTorch profiler and added PyTorch execution trace by @shengf-nv :: PR: #3273
- Removing etc from main index page, shifted name of discussions by @megnvidia :: PR: #3271
- build: Bump TE on 2.12 by @ko3n1g :: PR: #3371
- ci(hotfix): job conditions by @ko3n1g :: PR: #3376
- Record moe routing decisions during inference. by @sidsingh-nvidia :: PR: #3034
- [Main] Fix EP Overlap Bugs for Full-Iter CG by @Wohox :: PR: #3164
- Avoid direct pickle import by @maanug-nv :: PR: #3375
- Delete old pretrain_* files by @Phlip79 :: PR: #3359
- Add Qwen3-VL support with Megatron-FSDP by @xuwchen :: PR: #2841
- Refactor Mamba chunked prefill by @santhnm2 :: PR: #3265
- Improved parallel logging of learning rate by @jstjohn :: PR: #3319
- Add enhanced event tracking with TTFT measurement and compact serialization. by @lmcafee-nvidia :: PR: #3253
- Add assertion that max_requests is divisible by tp_size by @santhnm2 :: PR: #3304
- Move to using the Inference OpenAI API server by @ArEsKay3 :: PR: #3107
- Update moe github test cases. by @Victarry :: PR: #3077
- Split layer_specs to return Submodules instead of ModuleSpecs by @nschank :: PR: #3255
- ci: Remove gpu sanity check by @chtruong814 :: PR: #3420
- [Critical-Bug] Fix Uneven PP for Mamba models (Nemotron3-nano) by @kevalmorabia97 :: PR: #3399
- Fix for rl by @shanmugamr1992 :: PR: #3390
- Add check for full_iteration scope before instantiating CudaGraphManager by @vasunvidia :: PR: #3362
- Fix broken links throughout by @megnvidia :: PR: #3230
- Decouple topk and loss from DSA Indexer by @kunlunl :: PR: #3248
- Extract intermediate embeddings of transformer block by @sajadn :: PR: #3060
- Move to using the Inference OpenAI API server (bis) by @tdene :: PR: #3395
- Make Mamba inference state memory ratio configurable by @santhnm2 :: PR: #3322
- Fix configs for RL model environments by @tdene :: PR: #3441
- Replace pickle with json in rl_utils by @tdene :: PR: #3351
- fix: correct typo in demo training example by @dndnda :: PR: #3428
- Clean up logging inside inference flask server by @tdene :: PR: #3437
- ci: Update release-docs workflow to use FW-CI-templates v0.72.0 by @chtruong814 :: PR: #3438
- Fix --tokenizer-hf-include-special-tokens by @jon-barker :: PR: #3422
- Update num_tokens_to_generate default for Gym by @tdene :: PR: #3453
- Fix slowdown in inference flask server by @tdene :: PR: #3445
- Add a normalized scale for MTP per token loss by @BestJuly :: PR: #3159
- [Bugfix] Fix nan loss caused by zero token in MTP by @BestJuly :: PR: #3396
- Log RL metrics per environment by @yobibyte :: PR: #3446
- Move tensor offload/onload out of RL code by @tdene :: PR: #3029
- Fix another inference flask / Gym interaction by @tdene :: PR: #3467
- Add Engine event to the follow up requests after checkpointing by @ArEsKay3 :: PR: #3473
- adding in copyright blurb at the top of md file by @megnvidia :: PR: #3394
- [Megatron-FSDP] Add fsdp_all_gather_in_start_param_sync option in DDP Config by @shjwudp :: PR: #3095
- ci: Update release workflow to include changelog and publish docs by @chtruong814 :: PR: #3472
- ci(fix): Weekly GPT tests by @ko3n1g :: PR: #3443
- ci: Remove environments by @ko3n1g :: PR: #3462
- update HF tokenizer defaults by @dimapihtar :: PR: #3440
- ci: Bump preflight to detect our svc by @ko3n1g :: PR: #3494
- build: Drop Python 3.10 support and pip install one-logger by @ko3n1g :: PR: #3485
- PTQ changes for upcoming QAD by @AAnoosheh :: PR: #3124
- ci: Bump pre-flight for Bot SSO by @ko3n1g :: PR: #3497
- Revert "build: Drop Python 3.10 support and pip install one-logger (#… by @ko3n1g :: PR: #3500
- Fix chunked prefill edge cases by @santhnm2 :: PR: #3404
- ci: Enable MBridge downstream testing via PR by @ko3n1g :: PR: #3483
- ci: Remove gitlab docs build job and set LTS integration and functional tests to allow failure by @chtruong814 :: PR: #3349
- [OMNIML-3232] ModelOpt: add full TE spec option and wire Mamba stack / post-training scripts by @yueshen2016 :: PR: #3393
- Track off-policyness across RL steps by @tdene :: PR: #3030
- ci: MBridge testing branch name during merge-queues by @ko3n1g :: PR: #3513
- ci: Enable Dependabot Automerge by @ko3n1g :: PR: #3487
- ci: Also sync direct teams by @ko3n1g :: PR: #3484
- Multimodal: fix argument checking by @faradawn :: PR: #3449
- Fix Megatron-FSDP fully_shard() optimizer state DCP checkpointing, and fix DTensor deepcopy bug from PyTorch 26.01. by @cspades :: PR: #3510
- Renable full_iteration cuda graphs for inference. Add them for the mamba block. by @sidsingh-nvidia :: PR: #3250
- do not add EoD by @arendu :: PR: #3526
- Do not Slack notify for draft PRs by @Phlip79 :: PR: #3536
- remove deprecated SampleListWebdataset by @dimapihtar :: PR: #3407
- remove deprecated get_te_version by @dimapihtar :: PR: #3413
- remove deprecated async_grad_allreduce param by @dimapihtar :: PR: #3412
- remove deprecated mamba params by @dimapihtar :: PR: #3411
- remove deprecated params from model parallel config by @dimapihtar :: PR: #3408
- Remove redundant CUDA calls in the LLaVA dataloader by @duncanriach :: PR: #3476
- Inference: Create finer grained cuda-graphs with better coverage of smaller batch sizes by @sidsingh-nvidia :: PR: #3527
- fix: skip non-tensor optimizer state entries in distrib_optimizer sav… by @ahmadki :: PR: #3537
- remove encoder_and_decoder from enums by @dimapihtar :: PR: #3406
- remove is_unitialized & get_data_modulo_expert_parallel_group by @dimapihtar :: PR: #3414
- remove deprecated TE module by @dimapihtar :: PR: #3409
- Add knobs to choose process groups for fully-parallel-save / load and load-exchange-algo by @sbak5 :: PR: #2161
- Fix off-by-2 error in RL sequence packing by @tdene :: PR: #3551
- Skip unnecessary flattening for Save / Load Planner by @sbak5 :: PR: #3263
- Multimodal: fix model provider by @faradawn :: PR: #3508
- docs: Enable nightly docs publish by @chtruong814 :: PR: #3546
- Ensure type-checker understands use of Submodules in unit tests by @nschank :: PR: #3425
- Use copy_signature to preserve typing of pass-through methods by @nschank :: PR: #3419
- Ensure type-checker understands use of Submodules in MTP by @nschank :: PR: #3308
- Add mxfp8 quantization for inference linear layers by @santhnm2 :: PR: #3447
- Fixed fp32 residuals by @mkhona-nvidia :: PR: #3504
- Move config src files into a dedicated dir by @maanug-nv :: PR: #3570
- Revert "remove encoder_and_decoder from enums (#3406)" by @ko3n1g :: PR: #3579
- Fix default cuda graph persist arg. Add persist to rl common.sh. by @yobibyte :: PR: #3584
- Optimize away add request overheads in dummy ep cuda-graphed forward passes by @sidsingh-nvidia :: PR: #3525
- ci: Test docs build by @ko3n1g :: PR: #3583
- ci: Fix docs build for release by @ko3n1g :: PR: #3597
- ci: Remove secrets by @ko3n1g :: PR: #3598
- ci: Define secrets by @ko3n1g :: PR: #3599
- ci: gh-release-from-tag by @ko3n1g :: PR: #3600
- Ko3n1g/ci/remove twine username by @ko3n1g :: PR: #3601
- Add training code to MCore wheel by @maanug-nv :: PR: #3573
- FP8 attention knob for nvFP4 recipe by @vasunvidia :: PR: #3363
- Fix error with --load-main-params-from-ckpt by @guyueh1 :: PR: #3569
- ci: Create comment by @ko3n1g :: PR: #3610
- ci: Skip cleanup-taint-node jobs during deployments by @ko3n1g :: PR: #3612
- ci: No comment for release workflow by @ko3n1g :: PR: #3615
- ci: Re-add release tag prefix by @ko3n1g :: PR: #3619
- docs: Fix version picker urls by @chtruong814 :: PR: #3621
- ci: Increase changelog generation max PRs fetched by @chtruong814 :: PR: #3620
- Add debug info to an assert. by @yobibyte :: PR: #3588
- fix: async_utils: explicit GC in persistent checkpoint worker loop by @sbak5 :: PR: #3591
- Fix: Perform sigmoid calculation in fp32 for aux loss stability by @CodersAcademy006 :: PR: #2765
- fix: forward use_te_activation_func flag in non-MoE GPT layer spec by @saakshigupta2002 :: PR: #3300
- Revert "Add single-process checkpoint save to avoid forked multiproce… by @ko3n1g :: PR: #3630
- Track and plot per-token off-policy in RL by @tdene :: PR: #3515
- Multimodal: fix VQA dataset selection by @faradawn :: PR: #3464
- Multimodal: Fix multimodal training example - tokenizer, Triton Cache Manager patch, docs by @faradawn :: PR: #3507
- Multimodal: Limit transformer version in Dockerfile by @faradawn :: PR: #3448
- Support TP > GQA for inference by @santhnm2 :: PR: #3627
- μP: Maximal Update Parameterization by @plugyawn :: PR: #3058
- Add flexible virtual pipeline parallel (fVPP) to hybrid model by @duncanriach :: PR: #3377
- Explicitly close and join Pool in preprocess_data.py by @weijiac0619 :: PR: #3592
- remove indexer by @dimapihtar :: PR: #3416
- Multimodal: add load weights only by @faradawn :: PR: #3452
- Add single-process checkpoint save to avoid forked multiprocessing by @sbak5 :: PR: #3633
- Update oncall schedule by @Phlip79 :: PR: #3632
- M-FSDP: Cancel erroneous grad accumulation check by @shjwudp :: PR: #3629
- Fix MoE aux loss tracker hang with MTP enabled by @Victarry :: PR: #3401
- Fix test data preparation by @janEbert :: PR: #3652
- Add GPTOSS Example with Megatron-LM + Megatron Bridge by @faradawn :: PR: #3018
- Add thd unit test main by @kunlunl :: PR: #3617
- Inference | KV prefix caching. by @lmcafee-nvidia :: PR: #3063
- [Megatron-FSDP] Add dtype customization to Megatron-FSDP. by @cspades :: PR: #3067
- CachedMetadataFileSystemReader: shared cache by @sbak5 :: PR: #3326
- Inference Optimized MoEs by @sidsingh-nvidia :: PR: #3496
- Log torch_memory_saver offload/onload by @tdene :: PR: #3567
- Prefix caching | Mamba memory only. by @lmcafee-nvidia :: PR: #3657
- Prefix caching | Coordinator scheduling. by @lmcafee-nvidia :: PR: #3665
- Adding manual Claude reviewer by @Phlip79 :: PR: #3679
- Nemo-RL Refit by @wdykas :: PR: #3520
- Add extra permissions and make other changes by @Phlip79 :: PR: #3683
- Claude should always comment something by @Phlip79 :: PR: #3685
- [Cleanup] Remove the deprecated GroupedMLP by @dimapihtar :: PR: #3410
- Fix illegal memory access with mamba inference by @tdene :: PR: #3631
- Fix illegal memory access with mamba inference (bis) by @tdene :: PR: #3696
- remove duplicate rerun_state_machine.set_mode(rerun_mode) by @YangWang92 :: PR: #3279
- Correct indexing when cp_comms_type is a list by @jeromeku :: PR: #3389
- Fix optional chat_completions returnables by @tdene :: PR: #3519
- ci: Claude code review by @ko3n1g :: PR: #3704
- ci: Fix event payload by @ko3n1g :: PR: #3705
- ci: Use issue number by @ko3n1g :: PR: #3706
- ci: Finalize Claude review by @ko3n1g :: PR: #3707
- ci: Add codecov yml by @thomasdhc :: PR: #3455
- adding public_docs_features: True to get proper legal footer… by @megnvidia :: PR: #3681
- Robust signaling for coordinator inference by @tdene :: PR: #3563
- Fix memory issue in mxfp8 model init by @WanZzzzzz :: PR: #3461
- add --overlap-param-gather support for layer-wise optimizer. lots of unit tests. by @mchrzanowski :: PR: #3524
- ci: Mount and enforce HF_HOME by @ko3n1g :: PR: #3700
- Add flags for changing Mamba inference state tensor dtype by @santhnm2 :: PR: #3660
- chore: CLI launch internal CI by @ko3n1g :: PR: #3695
- Change Review Process by @Phlip79 :: PR: #3659
- ci: Separate queues for internal/external contributors by @ko3n1g :: PR: #3718
- Update to correct token by @Phlip79 :: PR: #3724
- build: Bump to NGC PyTorch 26.02 by @ko3n1g :: PR: #3474
- Claude: use Opus 4.6 and auto-review on ready by @Phlip79 :: PR: #3727
- Claude to add complexity label by @Phlip79 :: PR: #3709
- Offload Flask frontend to separate process by @santhnm2 :: PR: #3648
- chore: Use PAT for CLI Launcher by @ko3n1g :: PR: #3734
- ci: Add missing gitlab rule by @ko3n1g :: PR: #3735
- [main] Add TE CUDA Graph Support for Vision Encoder by @tomlifu :: PR: #3293
- fix(moe): fix TE general_gemm API change by @hxbai :: PR: #3582
- Review process fixes by @Phlip79 :: PR: #3728
- Print more verbose error message about incorrect
model_parallel_size. by @rj42 :: PR: #2639 - ci: Update golden values after PyT bump by @ko3n1g :: PR: #3733
- Optimize process management and delete operations for async save by @sbak5 :: PR: #3262
- Align gpt-oss window-size with 128-token sliding window by @returnL :: PR: #2771
- fix: temperature validation error message 1000.0 -> 100.0 by @CreeperLKF :: PR: #2688
- RL: Hybrid MoE training cudagraphs and fix training <-> inference transition by @mathemakitten :: PR: #3373
- Fix dynamic inference and GRPO functional tests by @santhnm2 :: PR: #3740
- Swap oncall by @janEbert :: PR: #3585
- [bugfix] fix the bug that loss: 0 will not be printed by @leisuzz :: PR: #1555
- Fused dLN + add in backwards pass by @CarlosGomes98 :: PR: #3384
- Claude: run actions on target branch by @Phlip79 :: PR: #3745
- revert of #2658 by @dimapihtar :: PR: #3736
- Update README Quick Start by @ilml :: PR: #3596
- Re-enable tests which were failing on #3373 by @mathemakitten :: PR: #3757
- Check reviews properly by @Phlip79 :: PR: #3756
- Add CP + Sequence Packing support for Mimo by @mehraakash :: PR: #2135
- MXFP8 refit by @wdykas :: PR: #3742
- Claude: update token usage by @Phlip79 :: PR: #3760
- Handle Tool Call Argument Parsing by @sancha :: PR: #3662
- RL support for nanov3 sft checkpoint by @jon-barker :: PR: #3741
- add mix_hidden_states option in conversion by @yeyu-nvidia :: PR: #3655
- ci: Optimize release-configs for GB200 by @ko3n1g :: PR: #3541
- Add absorbed-mla by @kunlunl :: PR: #3198
- feat(checkpoint): zero-copy storage sharing in CheckpointWithoutOutput by @Victarry :: PR: #3649
- Fuse MLA DOWN projection GEMMs by @cjld :: PR: #3039
- fix: skip FSDP DTensor boundary validation under fake process group by @Victarry :: PR: #3686
- [main] fix(moe): fix the bug where gate was not sliced when kv_head < tp_size. by @yuzhongw-nvidia :: PR: #3575
- fix(offload): reset activation offload manager after eval as well as … by @rapatel :: PR: #3739
- Improve error logging when invalid number of tokens is requested. by @yobibyte :: PR: #3680
- Add NVIDIA-Nemotron-3-Super-120B-A12B-BF16 to ModelOpt examples by @jenchen13 :: PR: #3805
- build: Bump TE2.13 by @ko3n1g :: PR: #3800
- Ensure dummy_forward does not attempt to run cudagraphs by @jalbericiola :: PR: #3789
- Add speculative decoding support with MTP layers by @santhnm2 :: PR: #3594
- Shanmugamr1992/megatron inference ultra by @shanmugamr1992 :: PR: #3784
- Fix backward compatibility issue with MFSDP
--grad-reduce-in-bf16by @shjwudp :: PR: #3799 - feat: add NCCL flight recorder configuration support by @sbak5 :: PR: #3806
- Revert "Ensure dummy_forward does not attempt to run cudagraphs (#3789)" by @ko3n1g :: PR: #3834
- Fix if statement in main by @tdene :: PR: #3833
- Update golden values of weekly tests by @ko3n1g :: PR: #3829
- build: Loosen TE restriction by @ko3n1g :: PR: #3827
- Do not let chunked prefill generate decode logprobs by @tdene :: PR: #3777
- Prevent double serialization inside Flask server by @tdene :: PR: #3653
- Allow RL to run inference-only via skip-train by @tdene :: PR: #3744
- Announce Python 3.12 migration by @ko3n1g :: PR: #3825
- ci: Skip test_wrong_cuda_graph_impl_returns_false in LTS by @chtruong814 :: PR: #3847
- ci: Mark TestCoordinator.test_throughput as flaky by @chtruong814 :: PR: #3849
- find optimal number of workers by @dimapihtar :: PR: #3699
- remove encoder_and_decoder by @dimapihtar :: PR: #3836
- ci: Skip more tests in test_vision_cuda_graphs for LTS by @chtruong814 :: PR: #3860
- Ensure that inference dummy_forward does not try to match on a cudagraph when running eager by @mathemakitten :: PR: #3815
- Fix flakiness due to timing between shutdowns by @tdene :: PR: #3857
- Add unit tests for speculative decoding by @santhnm2 :: PR: #3817
- Exposing interleave argument for fused_apply_rotary_pos_emb_thd by @huvunvidia :: PR: #3794
- ci: install nvidia-resiliency-ext from source by @ko3n1g :: PR: #3861
- Miscellaneous inference bug fixes by @santhnm2 :: PR: #3840
- Nemo-RL integration bugfixes for --transformer-impl inference_optimized by @sidsingh-nvidia :: PR: #3851
- remove legacy mpu by @dimapihtar :: PR: #3854
- enable async save for functional tests by @dimapihtar :: PR: #3855
- remove legacy data by @dimapihtar :: PR: #3853
- docs: Document python-gitlab dependency by @ko3n1g :: PR: #3863
- Fsdp dsv3 proxy by @gautham-kollu :: PR: #3844
- Fix token dispatched cudagraph_attrs by @asolergi-nv :: PR: #3625
- Fix slowdown in serialization by @tdene :: PR: #3872
- Establish reviewers for training code by @maanug-nv :: PR: #3765
- Fix quantize.py script and support packed sequences in pretrain_gpt.py by @AAnoosheh :: PR: #3564
- Use fp32 state dtypes for Mamba inference functional test by @santhnm2 :: PR: #3888
- [Megatron-FSDP] Support 'auto' argument which defaults to pre-MixedPrecisionPolicy be… by @cspades :: PR: #3810
- Bug fix: add missing packages to Multimodal Dockerfile by @faradawn :: PR: #3417
- Reverse polarity of the off-policy measurement by @tdene :: PR: #3580
- Update nightly golden values after TE2.13 by @ko3n1g :: PR: #3886
- enable use_persistent_ckpt_worker for ci tests by @dimapihtar :: PR: #3898
- Correctly generate state dict in MultiTokenPredictionBlock by @asolergi-nv :: PR: #3624
- Add torch grouped gemm bf16 and mxfp8 support w/ cuda graphed + inference_optimized MoEs by @sidsingh-nvidia :: PR: #3858
- ci: Fix build-test-publish summary job always passing by @ko3n1g :: PR: #3905
- ci: Skip gpt3_mcore_te_tp1_pp4_vp1 for now by @chtruong814 :: PR: #3908
- ci: Fix build-and-test-wheels jobs for arm by @chtruong814 :: PR: #3910
- Add Lion optimizer support by @mchrzanowski :: PR: #3813
- Support multimodule pipelining in 1F1B schedule by @yashaswikarnati :: PR: #3129
- Add a config parameter for retaining pinned cpu buffers for cpu offloading by @rapatel :: PR: #3151
- Inference | Hybrid prefix caching. by @lmcafee-nvidia :: PR: #3225
- Hotfix for eviction issue by @tdene :: PR: #3914
- Parity with VLLM over the reasoning field by @tdene :: PR: #3873
- CI: add parallel GB200 integration test track by @ko3n1g :: PR: #3901
- Track errors through the inference return path by @tdene :: PR: #3776
- Fix: Defensively close GPU device FDs in dataloader worker processes by @hexinw-nvidia :: PR: #3684
- Fix hybrid dynamic inference functional tests by @santhnm2 :: PR: #3924
- Patch EOD out of inference results by @tdene :: PR: #3866
- ci: Add mr-github-slim label by @ko3n1g :: PR: #3934
- Revert "ci: Skip gpt3_mcore_te_tp1_pp4_vp1 for now (#3908)" by @chtruong814 :: PR: #3926
- Exclude arguments.py from training review by @maanug-nv :: PR: #3906
- ci: Fix sso users check by @chtruong814 :: PR: #3938
- move router replay doc to advanced feature part by @ilml :: PR: #3929
- Fix DDP bug with --overlap-grad-reduce and --num-distributed-optimizer-instances > 1 by @wplf :: PR: #3693
- Fix incorrect HAVE_TE detection in multiple modules by @returnL :: PR: #3763
- Implement forced lag in RL by @tdene :: PR: #3517
- Refactor VisionTECudaGraphHelper to minimize overrides and clarify state tracking by @buptzyb :: PR: #3748
- Fix external contributor concurrency to be global across all branches by @ko3n1g :: PR: #3951
- Fix 3-way merge issue that broke main by @tdene :: PR: #3949
- Fix Nemo_CICD_Test not catching cancelled/skipped functional tests by @ko3n1g :: PR: #3947
- Guard cudagraph input copy on whether data pointers have actually changed by @mathemakitten :: PR: #3948
- Enforce that flashinfer cache has been installed for inference-optimized MoE layers by @santhnm2 :: PR: #3941
- chore: remove nv-grouped-gemm dependency by @liuyun7345 :: PR: #3770
- Prevent failures due to prevent_retokenization by @tdene :: PR: #3958
- ultra refit by @wdykas :: PR: #3904
- [Fix][Main] Missing Assertion for moe layer recomptue in A2A Overlap by @Wohox :: PR: #3917
- Move Megatron-FSDP MixedPrecisionPolicy arguments from FSDP adapter t… by @cspades :: PR: #3903
- chore: bump FW-CI-templates to v0.80.2 by @ko3n1g :: PR: #3961
- Rename RL timers to be consistent by @tdene :: PR: #3878
- ci: centralize run configuration in a single configure job by @ko3n1g :: PR: #3962
- ci: Split unit tests into smaller groups by @ko3n1g :: PR: #3966
- Refit optimization by @wdykas :: PR: #3933
- common strategy simplification by @dimapihtar :: PR: #3229
- Cudagraphs: Remove fwd_graph_input_surface weakref by @mathemakitten :: PR: #3970
- fix: interpolate version correctly in release Slack notification by @ko3n1g :: PR: #3977
- Make args and kwargs optional positional arguments for the Module hooks. by @cspades :: PR: #3976
- ci: Add core-adlr and core-nemo to megatron/training codeowners by @chtruong814 :: PR: #3979
- Small quality-of-life improvements in
megatron/trainingby @deepakn94 :: PR: #3957 - Update throughput golden values to reflect speedup by @tdene :: PR: #3983
- Revert "ci: Add core-adlr and core-nemo to megatron/training codeowners (#3979)" by @chtruong814 :: PR: #3982
- ci: Add --repo flag to gh pr view in configure job by @ko3n1g :: PR: #3989
- Add common pile scripts by @Phlip79 :: PR: #3902
- Introduce GDN to Mamba by @Phlip79 :: PR: #3535
- Fix IndexError in uniform activation recompute when num_layers not divisible by recompute_num_layers by @saakshigupta2002 :: PR: #3562
- Scaling for MuP over Muon optimizer. by @plugyawn :: PR: #3715
- Pass Megatron-FSDP MixedPrecision args to DDPConfig. by @cspades :: PR: #3992
- [OMNIML-3721] Fix tokenizer unwrapping for nested Megatron-Core tokenizer by @jenchen13 :: PR: #3967
- Forced load imbalance by @nanz-nv :: PR: #3380
- Add
/claude copycommand by @Phlip79 :: PR: #3978 - Add multi-module heterogeneous parallelism support for MIMO model by @yashaswikarnati :: PR: #3211
- added vllm fakequant export support by @kinjalpatel27 :: PR: #3050
- fix(modelopt): use bash array for MLM_EXTRA_ARGS to preserve quoting by @jenchen13 :: PR: #4002
- fix: use dump file prefix for NCCL flight recorder temp files by @sbak5 :: PR: #3955
- Fix PersistentAsyncCaller.del crash during Python shutdown by @cluster2600 :: PR: #3781
- ci: Run L1 MBridge tests in merge queue by @chtruong814 :: PR: #4009
- Update Claude review by @Phlip79 :: PR: #3980
- Migrate MoeLayer submodules from ModuleSpec to Protocols by @nschank :: PR: #3426
- Guard non-core imports by @maanug-nv :: PR: #3993
- Fix config compatibility with Megatron-Core by @maanug-nv :: PR: #3995
- Add MimoOptimizer for heterogeneous parallelism by @yashaswikarnati :: PR: #4019
- [Main] Support EP Overlap's Dynamic Computation Stream For Full-Iter CUDA Graph by @Wohox :: PR: #3820
- fix: Handle quantized CUDA tensors in async checkpoint writer by @sbak5 :: PR: #3845
- accept hooks marked with with_kwargs when using te.ops.sequential by @CarlosGomes98 :: PR: #4000
- Use GroupedMLPSubmodules for InferenceGroupedMLP by @nschank :: PR: #3743
- Fix 2D tensor communication for asymmetric DP in Bridge Communicator by @yashaswikarnati :: PR: #4021
- Add distributed checkpoint support for non-colocated MiMo by @yashaswikarnati :: PR: #4020
- CUDA graph support for prefix caching on hybrid models by @lmcafee-nvidia :: PR: #3922
- Add ability to perform local gradient accumulation in FP32 for a subset of parameters in the model by @deepakn94 :: PR: #4028
- Miscellaneous MXFP8 inference fixes by @santhnm2 :: PR: #4017
- Use
torch.int64for grad_num_zero accumulation by @WanZzzzzz :: PR: #4015 - Make text generation server hostname configurable by @santhnm2 :: PR: #3935
- Add --muon-coefficient-type argument for Muon optimizer by @mchrzanowski :: PR: #3927
- Pass gracefully if token_id not found in message by @i-riyad :: PR: #3862
- Improve load balancing behavior for prefix cache-aware routing by @santhnm2 :: PR: #3930
- Refactor setup.py to use get_pybind_include by @sakgoyal :: PR: #3658
- build: Bump TE to 2.14 by @ko3n1g :: PR: #4025
- fix traceback when interrupting run by @dimapihtar :: PR: #3439
- chore: update goldenvalues by @ko3n1g :: PR: #4059
- Fix TemporalAsyncCaller pin_memory lifetime in async checkpointing by @lvdunlin :: PR: #2288
- chore: Move to Py3.12 by @ko3n1g :: PR: #3826
- Adding NVRx as a dependency and keeping the current code base optionally by @dimapihtar :: PR: #3899
- build: Set
ENV NVTE_BUILD_NUM_PHILOX_ROUNDS=3by @ko3n1g :: PR: #4074 - fix checkpointing conversion by @dimapihtar :: PR: #4058
- cp:
Megatron-FSDP: Add MXFP8 transpose helper buffer for Hybrid FSDP (3918)intocore_r0.17.0by @ko3n1g :: PR: #4105 - cp:
Enable CUDA graph for ADAM optimizer (3429)intocore_r0.17.0by @ko3n1g :: PR: #4142 - Release testing/0170 by @ko3n1g :: PR: #4147
- cp:
fix: remove weights_only=False for multimodal example (4104)intocore_r0.17.0by @ko3n1g :: PR: #4188 - cp:
Bump nvrxby @ko3n1g :: PR: #4237 - cp:
build: bump DeepEP to 34152ae (#4228)intocore_r0.17.0by @ko3n1g :: PR: #4297