github NVIDIA/Megatron-LM core_v0.16.0
NVIDIA Megatron Core 0.16.0

11 hours ago
Changelog Details
  • ci: Fix copyright checker by @ko3n1g :: PR: #1893
  • chore: Add codeowners by @ko3n1g :: PR: #1897
  • ci: Extend queue-manager for dev branch by @ko3n1g :: PR: #1906
  • ci: Move test optimizer into its own bucket by @ko3n1g :: PR: #1909
  • ci: Configure cherrypick bot by @ko3n1g :: PR: #1925
  • Ci approve dev by @ko3n1g :: PR: #1933
  • ci: Update nightly schedule by @ko3n1g :: PR: #1934
  • ci: Bump pre-flight for runs on main/dev by @ko3n1g :: PR: #1935
  • ci: Allow skipping on main by @ko3n1g :: PR: #1936
  • Ko3n1g/ci/pr template community bot by @ko3n1g :: PR: #1937
  • ci: More granular unit tests buckets by @ko3n1g :: PR: #1932
  • Add sequence packing to RL by @tdene :: PR: #1911
  • chore: Update template by @ko3n1g :: PR: #1939
  • chore: Add description about who can merge by @ko3n1g :: PR: #1940
  • Ko3n1g/ci/fix main on eos by @ko3n1g :: PR: #1938
  • Ko3n1g/ci/internal mrs by @ko3n1g :: PR: #1942
  • ci: Fix branch of approval bot by @ko3n1g :: PR: #1944
  • ci: Approvalbot for other branches by @ko3n1g :: PR: #1947
  • ci(fix): Approval bot by @ko3n1g :: PR: #1949
  • Ko3n1g/ci/sync branches by @ko3n1g :: PR: #1956
  • Ko3n1g/ci/add milestone by @ko3n1g :: PR: #1951
  • Remove M-FSDP testing under LTS environment by @shjwudp :: PR: #1959
  • ci: Run on push to release branch by @ko3n1g :: PR: #1960
  • Fix typo in rl section of CODEOWNERS by @tdene :: PR: #1968
  • ci: Update copyright checker by @ko3n1g :: PR: #1973
  • Ko3n1g/ci/auto reminder GitHub by @ko3n1g :: PR: #1955
  • ci(fix): Run tests label by @ko3n1g :: PR: #1970
  • Make get_asyncio_loop safe to use repeatedly by @tdene :: PR: #1990
  • chore: Update codeowners by @ko3n1g :: PR: #2012
  • zarr soft deprecation by @dimapihtar :: PR: #2004
  • Deduplicate dynamic engine + coordinator. by @lmcafee-nvidia :: PR: #1981
  • Update symmetric registration interface to sync-up with upstream pytorch change by @youngeunkwon0405 :: PR: #1924
  • Safely access state dict args in load ckpt by @maanug-nv :: PR: #1957
  • Allow mixed-batch sampling in dynamic inference by @tdene :: PR: #1927
  • Stop Nemo_CICD_Test from failing in forks by @tdene :: PR: #2024
  • Clean up dynamic inference step by @tdene :: PR: #1992
  • ci: Auto-update copy-pr-bot vetters by @ko3n1g :: PR: #1850
  • ci: Fix build-push-wheel workflow by @ko3n1g :: PR: #2022
  • ci: Enable integration tests by @ko3n1g :: PR: #2023
  • chore: Update tooling for interactive jobs by @ko3n1g :: PR: #2032
  • Have datasets account for tokenizers which incorrectly define PAD by @tdene :: PR: #2017
  • revert(hotfix): ci: trustees_override by @ko3n1g :: PR: #2041
  • add missing warnings import in model parallel config by @yashaswikarnati :: PR: #2039
  • Reduce-scatter implementation with FP32 accumulation by @deepakn94 :: PR: #1967
  • ci(fix): Workflows on main by @ko3n1g :: PR: #2045
  • build: Bump modelopt by @ko3n1g :: PR: #2046
  • Remove TestCaptureFreezeGC unit test. by @lmcafee-nvidia :: PR: #1978
  • ci: Add multi-approval action by @ko3n1g :: PR: #2051
  • Ko3n1g/ci/test iteration time by @ko3n1g :: PR: #2067
  • Allow inference test throughput to vary by 10% by @mathemakitten :: PR: #2070
  • chore: Fix autoformatter by @ko3n1g :: PR: #2073
  • ci(hotfix): Bypass approvalbot in merge-queue by @ko3n1g :: PR: #2082
  • chore: Update local tooling by @ko3n1g :: PR: #2066
  • Add extra RL files by @tdene :: PR: #2077
  • Prevent summary jobs from running in forks by @tdene :: PR: #2083
  • ci: Fix test scope by @ko3n1g :: PR: #2091
  • Refactor the attention metadata into separate classes by @kanz-nv :: PR: #2001
  • Guard against incorrectly using MoE prefill graphs by @tdene :: PR: #2030
  • Run mr-slim tests in lightweight-mode by @chtruong814 :: PR: #2106
  • Inference | Lazy compile UVM allocator. by @lmcafee-nvidia :: PR: #1977
  • chore: Reenable trustees by @ko3n1g :: PR: #2108
  • Ko3n1g/chore/update release settings by @ko3n1g :: PR: #2097
  • ci(fix): Changeset of copyright checker by @ko3n1g :: PR: #2110
  • Remove unnecessary check on rotary_pos_cos by @santhnm2 :: PR: #2003
  • (Reverted) Inference | Lazy compile UVM allocator. by @lmcafee-nvidia :: PR: #2125
  • Refactor Attention Metadata to Separate Classes by @kanz-nv :: PR: #2112
  • Refactor model_provider to model_builder format for ModelOpt examples by @AAnoosheh :: PR: #2107
  • wandb Inference stats logging by @wdykas :: PR: #2026
  • Make PipelineParallelLayout always return str from __repr__ by @ananthsub :: PR: #2055
  • Add flash_attn_3 as first option for FA3 import by @santhnm2 :: PR: #2010
  • Add debugging hint for case when cudagraphs are created but no matching runner is found by @mathemakitten :: PR: #2129
  • ci: LTS container by @ko3n1g :: PR: #2133
  • Fix param init by @cuichenx :: PR: #2033
  • Hotfix to unit tests on hopper FA3 by @tdene :: PR: #2143
  • Add BytesIO to safe_globals by @tdene :: PR: #2074
  • add deprecation warning for legacy tokenizer system by @dimapihtar :: PR: #2145
  • replay: ci: Bump LTS container by @ko3n1g :: PR: #2157
  • Hotfix to unit tests on hopper FA3 (bis) by @tdene :: PR: #2179
  • Fix has_modelopt_state() for native Torch checkpoint format by @AAnoosheh :: PR: #2160
  • chore: Remove codeowners by @ko3n1g :: PR: #2175
  • Fix FP8 inference with sequence parallelism by @santhnm2 :: PR: #2009
  • Replace ModelOpt generation server by @AAnoosheh :: PR: #2147
  • Add hybrid model support for dynamic inference engine by @santhnm2 :: PR: #1907
  • Async task and event loop safety in Megatron Core by @tdene :: PR: #2025
  • Rename skip_prompt_log_probs by @tdene :: PR: #2181
  • Dynamic inference context | UVM only. by @lmcafee-nvidia :: PR: #1983
  • ci: Run auto-update-copy-pr-bot only on forks by @ko3n1g :: PR: #2191
  • Inference throughput tests: refactor goldens to be in list format by @mathemakitten :: PR: #2072
  • Enable TE custom quantization recipe by @negvet :: PR: #2005
  • Add MoE parameters to ModelOpt pruning example + conf fixes by @kevalmorabia97 :: PR: #2205
  • Add repr to pg collection class by @yashaswikarnati :: PR: #2089
  • Move data_samplers.py from legacy to training.datasets & add DistributedSignalHandler to DataLoader workers by @asolergi-nv :: PR: #2068
  • Fix Megatron-FSDP checkpoint save failure by @shjwudp :: PR: #2138
  • Fix moe CODEOWNERS. by @jaredcasper :: PR: #2200
  • chore: Update LICENSE by @ko3n1g :: PR: #2219
  • remove megatron.training dependency from megatron.core for FSDP checkpoint with EP by @ananthsub :: PR: #2113
  • Tensorize dynamic inference mixed sampling by @tdene :: PR: #2105
  • Add unit test for inference DP coordinator by @tdene :: PR: #2187
  • Inference linear layer by @sidsingh-nvidia :: PR: #1908
  • chore: Prefer Nvidia email addresses for reminder bot by @ko3n1g :: PR: #2221
  • [Megatron-FSDP] Fix hang caused by non-deterministic reduce-scatter by @shjwudp :: PR: #2218
  • Remove qwen symlink to fix for case-insensitive FS by @kevalmorabia97 :: PR: #2235
  • Optimizer refactor: clean up public get_megatron_optimizer interface and provide a more general API to support passing in different hyperparameters to subsets of parameters by @deepakn94 :: PR: #2047
  • Fix CI for PR#1983 by @lmcafee-nvidia :: PR: #2245
  • Fix aux-loss logging for hybrid models by @deepakn94 :: PR: #2197
  • Update flops calculation (for throughput) for hybrid MoEs by @deepakn94 :: PR: #2198
  • Enable kv cache in training for eagle by @yeyu-nvidia :: PR: #1895
  • Tensorize dynamic inference mixed sampling (bis) by @tdene :: PR: #2231
  • chore: Fix codeowners by @ko3n1g :: PR: #2264
  • Allow loading checkpoint from iteration 0 by @ananthsub :: PR: #2199
  • ci: Skip install test in merge queue by @chtruong814 :: PR: #2281
  • Add MoE layer type to hybrid models by @deepakn94 :: PR: #2259
  • Add the Hybrid-EP backend to the Flex Dispatcher by @Autumn1998 :: PR: #2176
  • [MAIN][NVFP4] Support NVFP4 MOE with Proper Padding by @zhongbozhu :: PR: #1985
  • Update ModelOpt example readmes and advanced usage by @kevalmorabia97 :: PR: #2273
  • Fix UVM compatibility with CUDA 13. by @lmcafee-nvidia :: PR: #2243
  • ci: Add flaky marker to LTS tests by @ko3n1g :: PR: #2290
  • Dynamic engine suspend/resume via prefill. by @lmcafee-nvidia :: PR: #1982
  • fix: Pass the timeout argument for the EP group by @yanring :: PR: #2268
  • JIT for MoE router and preprocess by @yaox12 :: PR: #1919
  • Hotfix to CI, until the fix gets reviewed by @tdene :: PR: #2298
  • Add functional test for DP coordinator throughput by @tdene :: PR: #2189
  • Add asyncio Queue like in Python 3.13 by @tdene :: PR: #2224
  • Fixes for PR#1982 by @lmcafee-nvidia :: PR: #2303
  • Fix PP KV cache allocation and enable multi-node PP inference by @santhnm2 :: PR: #2182
  • Revert active-buffer-size-gb arg name. by @lmcafee-nvidia :: PR: #2257
  • feat: check: api backwards compatibility by @pablo-garay :: PR: #2251
  • Add MambaInferenceStateConfig dataclass by @santhnm2 :: PR: #2265
  • Fix typo in inference example by @santhnm2 :: PR: #2311
  • feat: initialization of API backward compatibility verification by @pablo-garay :: PR: #2310
  • Fix Mamba TP and remove confusing legacy initialization by @jaredcasper :: PR: #2202
  • Refactor KD to use ModelOpt plugins file by @AAnoosheh :: PR: #2305
  • Fix dynamic context syntax and remove redundant tensors by @kanz-nv :: PR: #2336
  • Improve asyncio exception handling by @tdene :: PR: #2300
  • ci: Upload to testpypi only on main by @ko3n1g :: PR: #2342
  • implement graph config by @kanz-nv :: PR: #2203
  • feat: required check adjustment by @pablo-garay :: PR: #2350
  • fix: load iteration 0 for release checkpoints by @ananthsub :: PR: #2351
  • Explicitly zero out padding token activations for dynamic inference by @santhnm2 :: PR: #2008
  • Bugfix for Mamba with Chunked-Prefill by @sidsingh-nvidia :: PR: #2293
  • Break apart dynamic inference step into 2 methods by @tdene :: PR: #2192
  • Prevent unnecessarily overwriting the default Hugging Face chat template by @santhnm2 :: PR: #2183
  • Refactor KD to use ModelOpt plugins file (v2) by @AAnoosheh :: PR: #2355
  • add FIM dataset support by @dimapihtar :: PR: #2291
  • Revert "Explicitly zero out padding token activations for dynamic inference (#2008)" by @chtruong814 :: PR: #2360
  • Clean up DP coord code & unit test by @tdene :: PR: #2277
  • [4/4] Merge Megatron-RL into LM by @tdene :: PR: #2002
  • Update coordinator control logic to be compatible with RL by @tdene :: PR: #2227
  • ci: Update backwards compat check baseline to 53bbf7a by @chtruong814 :: PR: #2361
  • Account for test regression caused by prints by @tdene :: PR: #2354
  • Remove dependency on megatron.training within megatron.core by @ananthsub :: PR: #2274
  • Fixes for gpt-oss by @cuichenx :: PR: #2038
  • [HOT FIX] Fix bug of hybrid-ep backend in flex-dispatcher by @Autumn1998 :: PR: #2286
  • ci: Remove nemo-ci environment by @chtruong814 :: PR: #2364
  • ci: Pass COMMUNITY_PROJECT_ID to community bot by @chtruong814 :: PR: #2366
  • ci: Remove environment from community-bot by @chtruong814 :: PR: #2376
  • ci: Bump commit for api check to d61029f by @chtruong814 :: PR: #2386
  • Revert: trigger_mbridge_tests.yml‎ file change by @pablo-garay :: PR: #2389
  • build: Upgrade deps by @ko3n1g :: PR: #2289
  • Change KV cache init to empty to speedup graph recording and first prefill by @kanz-nv :: PR: #2358
  • Reduce Overhead in Timers by @yaox12 :: PR: #2210
  • Remove experimental tags for fused kernels. by @Victarry :: PR: #2233
  • Handle UVM compile lock issues by @tdene :: PR: #2299
  • Fix the entropy sign. by @yobibyte :: PR: #2374
  • Remove RL use of mock dataloader and kill RL inference interface on exit by @jon-barker :: PR: #2387
  • Fix block_bag for RL by @kanz-nv :: PR: #2399
  • adding action for checking whether PR author is nvidia employee or not for selecting ephemeral ci hosts by @theothermike :: PR: #2402
  • Added top n log probs by @shanmugamr1992 :: PR: #2262
  • fix: exit failure when PR author is external contributor removed by @theothermike :: PR: #2410
  • Fix logging when no IS is enabled. by @yobibyte :: PR: #2375
  • Various small fixes for Megatron-FSDP. by @cspades :: PR: #2346
  • Add grpo loop functional test by @jon-barker :: PR: #2403
  • YARN position embedding clear forward method lru cache in init function by @guyueh1 :: PR: #2229
  • Graph Config Implementation by @kanz-nv :: PR: #2380
  • fix: adding k8s taints for ephermeral jobs by @theothermike :: PR: #2420
  • ci: Enable functional tests by @ko3n1g :: PR: #2419
  • Reapply "build: Upgrade deps (#2289)" by @ko3n1g :: PR: #2408
  • fix: use a script to do node tainting in the cicd workflow by @theothermike :: PR: #2421
  • Fix rl training with data reuse. by @yobibyte :: PR: #2428
  • Reapply - Add grpo loop functional test by @jon-barker :: PR: #2411
  • chore: Add copyright to run_simple_mcore_train_loop.py by @chtruong814 :: PR: #2441
  • Retry inference test on different device if throughput slower than expected by @mathemakitten :: PR: #2443
  • feat: mcore trigger mbridge by @pablo-garay :: PR: #2340
  • Remove redundant reduce in aux_loss logging by @BestJuly :: PR: #2095
  • chore: Update codeowners for post-training by @ko3n1g :: PR: #2462
  • [Fix] Pass metadata to sharded_state_dict in load_modelopt_checkpoint by @kevalmorabia97 :: PR: #2451
  • Add support for fake distributed process groups. by @Victarry :: PR: #2280
  • fix: Add merge_group support with pre-flight pattern by @pablo-garay :: PR: #2463
  • Add missing checkpoint arguments for MoE models by @santhnm2 :: PR: #2465
  • Add assertion for mxfp8 params without dp overlap by @kunlunl :: PR: #2271
  • Clean log probs by @shanmugamr1992 :: PR: #2404
  • ci: Bump copyright workflow by @ko3n1g :: PR: #2473
  • Fix ImportError and NameError in examples/run_simple_mcore_train_loop.py by @marksverdhei :: PR: #1980
  • fix: Revert "Clean log probs (#2404)" by @chtruong814 :: PR: #2475
  • Make grpo CI test use read-only data by @jon-barker :: PR: #2472
  • Fix default.yaml for HFDatasetAgent use in countdown by @jon-barker :: PR: #2487
  • Update golden values to allow new PRs to be merged by @tdene :: PR: #2478
  • Clean log probs copy by @shanmugamr1992 :: PR: #2477
  • Attention mask as PackedSeqParams by @jalbericiola :: PR: #2461
  • fp8 param cuda graph support main by @kunlunl :: PR: #2088
  • docs: Add changelog for 0.15 by @ko3n1g :: PR: #2499
  • feat: improve external contributor single use ephemeral nodes by @theothermike :: PR: #2503
  • Fix sequence parallel. by @yobibyte :: PR: #2444
  • update API check baseline by @pablo-garay :: PR: #2505
  • Associate default rl cuda graphs attributes with args by @yobibyte :: PR: #2453
  • No using tokenizer in request record. by @lmcafee-nvidia :: PR: #2382
  • make default --inference-dynamic-batching-cuda-graph-max-tokens value match old version by @jon-barker :: PR: #2540
  • Adjust the default CG size for functional test by @tdene :: PR: #2544
  • feat: API compat: ignore AttributeChangedValueBreakage (not a signature change) by @pablo-garay :: PR: #2543
  • feat: add decorator: experimental_api by @pablo-garay :: PR: #2539
  • ci: Add release workflows by @ko3n1g :: PR: #2507
  • Fixing PG routing for inference & training separation by @wdykas :: PR: #2485
  • ci: Fix release workflow by @ko3n1g :: PR: #2553
  • fix: Duplicate artifact names by @ko3n1g :: PR: #2556
  • ci: Avoid naming collision by @ko3n1g :: PR: #2558
  • ci: Fixing naming collision by @ko3n1g :: PR: #2559
  • fix: publish release wheel and github release version number by @ko3n1g :: PR: #2561
  • Fix MoE capacity handling by @DaizeDong :: PR: #2214
  • Avoid calling set_save_original_input with FP8 delayed scaling by @dalgarak :: PR: #1860
  • build: Bump TE to 2.10 by @ko3n1g :: PR: #2496
  • add more tokenizer arguments by @dimapihtar :: PR: #2377
  • Add per-module TE quant config. by @kwyss-nvidia :: PR: #2359
  • Make check_large_grads non-fatal by @kwyss-nvidia :: PR: #2307
  • fix for sequence packing plus sequence parallel: padding the sequence to a multiple of TP by @jalbericiola :: PR: #2574
  • Torch symmetric - new latency optimized NVLS communication kernels for sequence parallelism by @sidsingh-nvidia :: PR: #1997
  • Various quality-of-life improvements in training loop by @deepakn94 :: PR: #2580
  • [Main] Support MTP packed-seq in main branch by @BestJuly :: PR: #2173
  • Support TP greater than num_kv_heads by supporting QKV activation sub-sharding by @deepakn94 :: PR: #2565
  • Fix FA3 import by @santhnm2 :: PR: #2577
  • Fix runaway Etpt in straggler detector by resetting FLOPs accumulator by @cms42 :: PR: #1755
  • Rename TensorRT Model Optimizer to Model Optimizer by @AAnoosheh :: PR: #2373
  • Fix aux loss scale when CP is enabled. by @Victarry :: PR: #2237
  • Save memory using main_param for moe in param_l2_norm by @BestJuly :: PR: #2249
  • Changes to support latent MoEs by @deepakn94 :: PR: #2296
  • update API compat check baseline to b51db3e by @pablo-garay :: PR: #2588
  • Fix invalid argument failing tests on main by @tdene :: PR: #2589
  • Add openmathinstruct config. by @yobibyte :: PR: #2586
  • Move model configs to github. by @yobibyte :: PR: #2587
  • fix: Assign tokenizer to Encoder.tokenizer in legacy mode by @iuyo5678 :: PR: #2498
  • Delete redundant import in yaml_arguments.py by @wplf :: PR: #2139
  • Fix world size mismatch causing distributed init deadlock (Issue #2458) by @CodersAcademy006 :: PR: #2571
  • Improve performance of request_metadata logic by @tdene :: PR: #2378
  • Fix broken Table of Contents links in README.md by @JungHoyoun :: PR: #1954
  • Add minor log update by @gautham-kollu :: PR: #2080
  • Fix link to NeMo performance summary documentation by @janbernloehr :: PR: #2190
  • Prep for refit by @wdykas :: PR: #2590
  • feat: API compat: ignore ParameterMovedBreakage for init methods by @pablo-garay :: PR: #2595
  • Fix NameError in pretrain_retro.py (add import_module), remove unused… by @vignesh1507 :: PR: #2084
  • QK logits clipping (non-split version) by @BoxiangW :: PR: #1929
  • update checkpointing documentation by @dimapihtar :: PR: #2606
  • [training migration] add training config dataclass and arg generation utility by @maanug-nv :: PR: #2306
  • Check skip_prompt_log_probs in add_request by @tdene :: PR: #2593
  • Refit prep 2 by @wdykas :: PR: #2608
  • Batch Invariance by @wdykas :: PR: #2308
  • Remove flattened_range code paths for distributed optimizer checkpointing by @dimapihtar :: PR: #2126
  • update commit by @dimapihtar :: PR: #2631
  • Create separate teacher Layer Spec in KD mode by @AAnoosheh :: PR: #2429
  • [docs] Migrate docs to new Sphinx by @Phlip79 :: PR: #2489
  • Nemotron nano v2 vl changes for Megatron Bridge by @cuichenx :: PR: #2078
  • Dynamic context | Re-add max_requests arg. by @lmcafee-nvidia :: PR: #2488
  • Inference | Fix entangled request generations. by @lmcafee-nvidia :: PR: #2584
  • fix gpt3_mcore_reruns_resume_check_grads by @dimapihtar :: PR: #2646
  • Add option to only log inference every N steps by @tdene :: PR: #2637
  • [docs] Use autodoc2 and remove automodule by @Phlip79 :: PR: #2542
  • add backward compatibility support for loading mcore 0.15 checkpoints by @dimapihtar :: PR: #2648
  • add offline eagle3 instructions to readme by @yeyu-nvidia :: PR: #2246
  • Only initialize symmetric memory when needed by @sidsingh-nvidia :: PR: #2665
  • Update docstrings for dataset by @Phlip79 :: PR: #2666
  • Simplify parameter sync for checkpoint save by @ananthsub :: PR: #2344
  • [Megatron-FSDP] Support both old and new DeviceMesh APIs. by @cspades :: PR: #2575
  • Enable hybrid tensor + expert + data parallelism in mcore inference by @sidsingh-nvidia :: PR: #2470
  • Fix failing functional tests by @sidsingh-nvidia :: PR: #2679
  • M4 + Dist Checkpoint: Replace global parallel state with explicit group parameters by @dimapihtar :: PR: #2053
  • fix deprecated decorator import by @dimapihtar :: PR: #2680
  • Inference | Add request only if no paused requests. by @lmcafee-nvidia :: PR: #2600
  • Added integration for Kitchen extensions''' SDPA and FA implementations by @frsun-nvda :: PR: #2232
  • Pipeline parallelism fix in RL and sequence packing rewriting by @jalbericiola :: PR: #2632
  • Add oncall rotation by @Phlip79 :: PR: #2622
  • Upgrade GitHub Actions to latest versions by @salmanmkc :: PR: #2678
  • docs: Adding documentation.md to cover building documentation. by @aschilling-nv :: PR: #2683
  • [Megatron-FSDP] Build default FSDP DeviceMesh, and remove model arg from fully_shard_optimizer(). by @cspades :: PR: #2471
  • Add moe layer perf UT. by @Victarry :: PR: #2673
  • [docs] Add ability to disable autodoc2 for local builds by @Phlip79 :: PR: #2669
  • Fix oncall assignment by @Phlip79 :: PR: #2686
  • docs(readme): update Latest News section by @sbhavani :: PR: #2684
  • Update RNG sharding to include EP rank by @paul-gibbons :: PR: #2658
  • Add CODEOWNER for API backwards compatibility check files by @pablo-garay :: PR: #2687
  • Mark API backwards compatibility checks as OPTIONAL (non-blocking) by @pablo-garay :: PR: #2697
  • pip install uv during GH action by @Phlip79 :: PR: #2695
  • Don'''t delete svcnvidia-nemo-ci team from oncall by @Phlip79 :: PR: #2703
  • RL: Rollouts should be distributed over the regular data parallel group by @sidsingh-nvidia :: PR: #2634
  • Use pull_request_target and don'''t use uv by @Phlip79 :: PR: #2702
  • Optimize TE cudagraph input memory by @buptzyb :: PR: #2392
  • ci(fix): Pin gojq to stable version by @ko3n1g :: PR: #2480
  • NVLS - fused reduce-scatter + residual + rms-norm + all-gather kernel by @sidsingh-nvidia :: PR: #2599
  • Default UVM level to 0. by @lmcafee-nvidia :: PR: #2450
  • docs: improve documentation organization and add additional guides by @sbhavani :: PR: #2671
  • Revert "Default UVM level to 0. (#2450)" by @chtruong814 :: PR: #2713
  • Add missing imports in no-triton fallback by @maanug-nv :: PR: #2711
  • Fixes for #2450. by @lmcafee-nvidia :: PR: #2714
  • Add RL parameter to set parallel generation tasks by @tdene :: PR: #2712
  • Refit prep 3 by @wdykas :: PR: #2708
  • chore: Add cudagraph codeowners by @ko3n1g :: PR: #2720
  • [docs] Add developer section to docs by @Phlip79 :: PR: #2717
  • Fix UVM argument for RL by @tdene :: PR: #2722
  • [docs] Update docs title to Megatron Core by @Phlip79 :: PR: #2729
  • remove fp16 assert in moe_grouped_gemm & EP by @HaochenYuan :: PR: #2495
  • Improve ModelOpt paths & add more Nemotron/hybrid model support by @jenchen13 :: PR: #2131
  • Add options to improve data loader initialization time, especially at scale by @asolergi-nv :: PR: #2445
  • ci: Fix copy-pr-bot update by @ko3n1g :: PR: #2736
  • Add oncall to all new PRs by @Phlip79 :: PR: #2734
  • Hsdp register submesh fix lifuz mirror by @tomlifu :: PR: #2467
  • Update sequence packing case when dummy PackedSeqParams are used by @mathemakitten :: PR: #2743
  • Add support for non-decode CUDA graphs for Mamba models by @santhnm2 :: PR: #2474
  • Fix oncall assign by @Phlip79 :: PR: #2737
  • Adding stop word support by @shanmugamr1992 :: PR: #2685
  • feat: manual registration mode for nccl-ub option when using megatron-fsdp by @youngeunkwon0405 :: PR: #2661
  • Update oncall for next few weeks by @Phlip79 :: PR: #2748
  • Prep work for migrating to types from ModuleSpec by @nschank :: PR: #2668
  • feat(MoE): Refactor cuda_graph_scope by @buptzyb :: PR: #1920
  • Fix merge conflict in #1920 by @tdene :: PR: #2781
  • ci: Allow disabling external contributors by @chtruong814 :: PR: #2784
  • Reflect the changes made by #1920 in RL by @tdene :: PR: #2780
  • Fix 2780 by @tdene :: PR: #2791
  • Update PR message by @Phlip79 :: PR: #2778
  • Ignore bot for oncall by @Phlip79 :: PR: #2756
  • Only assign oncall to main PRs by @Phlip79 :: PR: #2755
  • Explicitly zero out padding token outputs when using quantization scales by @santhnm2 :: PR: #2585
  • Synchronize total block count across pipeline parallel ranks by @santhnm2 :: PR: #2578
  • Optimize TE CUDA Graph capturing time by @buptzyb :: PR: #2482
  • Do a pass of typing fixes on transformer/ by @nschank :: PR: #2766
  • moe: remove unused variable scale_up by @WineChord :: PR: #1670
  • build: Pin down nvidia-nvshmem-cu13 (#2798) by @ko3n1g :: PR: #2803
  • DeepSeek V3 FSDP Fix for Precision-Aware Optimizer by @tomlifu :: PR: #2466
  • Minor Fixes on Post-Training ModelOpt Examples by @ChenhanYu :: PR: #2813
  • fix(moe): Support HybridEP and reduce memory overhead for 1F1B A2A overlap by @lhb8125 :: PR: #2236
  • Inference memory test by @wdykas :: PR: #2724
  • Move batch invariance mode init to initialize.py by @santhnm2 :: PR: #2832
  • Move full model init to cuda stream to avoid race condition leading to empty parameters in DDP by @jstjohn :: PR: #2652
  • [docs] Cleanup homepage by @Phlip79 :: PR: #2823
  • [docs] Update oncall doc by @Phlip79 :: PR: #2822
  • Make default for rerun_mode=disabled not terminate with non-fatal rer… by @kwyss-nvidia :: PR: #2773
  • Bugfix: ensure spawned persistent checkpoint worker sets its CUDA device correctly for CUDA context creation / hypothetical memory allocations by @ankurv-nvidia :: PR: #2710
  • Implementation of a more flexible optimizer/scheduler override system by @jstjohn :: PR: #2723
  • ci(fix): PyPI upload by @ko3n1g :: PR: #2843
  • ci(fix): Don'''t fail on empty var by @ko3n1g :: PR: #2850
  • Add RL support for MOEs by @jon-barker :: PR: #2742
  • ci(fix): GH release version tag by @ko3n1g :: PR: #2854
  • Reduce the scope of the side stream around DDP initialization by @jstjohn :: PR: #2852
  • Manually update first oncall rotation by @Phlip79 :: PR: #2855
  • Remove flaky iteration time functional test by @buptzyb :: PR: #2862
  • Nccl gloo refit for RL by @wdykas :: PR: #2812
  • build: Bump jet-client by @ko3n1g :: PR: #2876
  • Dynamic Inference | Evict and re-compute context requests. by @lmcafee-nvidia :: PR: #2738
  • Change oncall team name by @Phlip79 :: PR: #2861
  • Revert "Dynamic Inference | Evict and re-compute context requests. (#2738)" by @chtruong814 :: PR: #2884
  • [main] feat(moe): Support moe shared expert gate for Qwen3-Next (2/4) by @yuzhongw-nvidia :: PR: #2751
  • [main] feat(moe): Support attention output gate for Qwen3-Next (3/4) by @yuzhongw-nvidia :: PR: #2752
  • [docs] Fix docs and add generation doc by @Phlip79 :: PR: #2882
  • Fix CUDA RNG Tracker by @buptzyb :: PR: #2641
  • FP8 params support for megatron-fsdp (MXFP8/Blockwise) by @kunlunl :: PR: #2239
  • docs: fix broken images, links, and typos across documentation by @sbhavani :: PR: #2794
  • ci(fix): Release version by @ko3n1g :: PR: #2873
  • Assign mcore-oncall instead of user by @Phlip79 :: PR: #2879
  • tests: Disable Mamba MOE model test after 43b4471 by @ko3n1g :: PR: #2886
  • Fix mamba moe unit test after commit reversion by @jon-barker :: PR: #2888
  • Fix inference server to make nemogym work. by @yobibyte :: PR: #2887
  • Use DynamicInferenceCoordinator for text generation server by @santhnm2 :: PR: #1910
  • Improve error messages in mamba moe unit test by @jon-barker :: PR: #2889
  • [training migration] add RNG config dataclass by @maanug-nv :: PR: #2347
  • [training migration] Add RerunStateMachineConfig dataclass by @maanug-nv :: PR: #2436
  • Add retry loop with exponential backoff in dataloader as a form of in-application fault tolerance by @deepakn94 :: PR: #2836
  • [training migration] Add SchedulerConfig dataclass by @maanug-nv :: PR: #2400
  • RL: Fix cu_seqlens construction for PackedSeqParams by @mathemakitten :: PR: #2883
  • [training migration] Add ProfilingConfig dataclass by @maanug-nv :: PR: #2393
  • [MoE] Apply grouped gemm bias before unpadding for FP8 by @cuichenx :: PR: #2817
  • Update Slack user group when oncall changes by @Phlip79 :: PR: #2859
  • Remove unused FlashAttention3 args by @santhnm2 :: PR: #2898
  • Use different token for assign logic by @Phlip79 :: PR: #2893
  • chore: Add --no-container-mount-home to script by @ko3n1g :: PR: #2906
  • build: Bump deps by @ko3n1g :: PR: #2911
  • Fix RL sequence packing bin size by @tdene :: PR: #2909
  • feat: m4 leftover changes by @yaoyu-33 :: PR: #2506
  • Revert "Remove unused FlashAttention3 args (#2898)" by @chtruong814 :: PR: #2916
  • ci: Skip broken tests after dependency bump by @chtruong814 :: PR: #2934
  • Ko3n1g/build/downgrade flashinfer by @ko3n1g :: PR: #2937
  • ci: Skip unit test cleanup by @chtruong814 :: PR: #2940
  • build: 26.02 dependency bump main by @ko3n1g :: PR: #2923
  • RL refit pipelining support by @wdykas :: PR: #2878
  • [MAIN][NVFP4][MOE] 128 Zero Padding for Grouped Quantization kernels and Cuda Graph Support by @zhongbozhu :: PR: #2655
  • Support DDP overlap for models with repeated parameters by @deepakn94 :: PR: #2837
  • Add muon and layerwise distributed optimizer by @FDecaYed :: PR: #2241
  • Revert "[dev] Add assertion for mxfp8 params without dp overlap (#2270)" by @ko3n1g :: PR: #2901
  • Unit test for model_provider to model_builder coupling by @AAnoosheh :: PR: #2925
  • ci: Onboard GB200 by @ko3n1g :: PR: #2847
  • Install slack-sdk using uv by @Phlip79 :: PR: #2948
  • Inference | Evict overflow paused requests from context. by @lmcafee-nvidia :: PR: #2926
  • Enable training cudagraphs for RL by @mathemakitten :: PR: #2452
  • feat(moe): Support placing MTP layers into standalone stages by @BestJuly :: PR: #2136
  • Various fixes to in-job restarter and better time accounting of startup operations by @hexinw-nvidia :: PR: #2698
  • Fix minor README wording and capitalization by @Deepak-J0shi :: PR: #2928
  • ci: Restore grpo tests by @ko3n1g :: PR: #2952
  • Fix GitHub GRPO resharding functional test by @tdene :: PR: #2927
  • cp: ci(fix): GB200 racecondition (2962) into main by @ko3n1g :: PR: #2963
  • Add out-of-SLA link by @Phlip79 :: PR: #2903
  • feat(moe): Fine-grained activation offloading by @lhb8125 :: PR: #1913
  • Fix broken mamba-moe unit test by @jon-barker :: PR: #2970
  • ci: Fix GB200 change by @ko3n1g :: PR: #2969
  • Update golden values for reshard test by @tdene :: PR: #2971
  • chore: Update golden values by @ko3n1g :: PR: #2973
  • Pass through --trust-remote-code and add this to all Nemotron model configs by @ChenhanYu :: PR: #2939
  • Cuda 13 UVM by @wdykas :: PR: #2957
  • Enable phase transition iterations by @jkamalu :: PR: #2938
  • add missing import in rl_utils.py by @jon-barker :: PR: #2915
  • Add sequence packing support for hybrid model by @duncanriach :: PR: #2913
  • [Main] Partial CUDA Graph support for EP Overlap by @Wohox :: PR: #2184
  • docs(megatron-fsdp): add Megatron-FSDP user guide by @xuwchen :: PR: #2396
  • DeepSeek V3.2 support by @kunlunl :: PR: #2440
  • fully remove zarr support by @dimapihtar :: PR: #2944
  • chore: Standardize setuptools version by @ko3n1g :: PR: #2975
  • ci: Run functional tests on main by @ko3n1g :: PR: #2983
  • ci(fix): CI_COMMIT_BRANCH on forks by @ko3n1g :: PR: #2982
  • [main] feat(moe): Support gated delta net for Qwen3-Next (1/4) by @yuzhongw-nvidia :: PR: #1989
  • ci: Add more gb200 nightly tests by @ko3n1g :: PR: #2981
  • [main] feat(moe): Support apply wd to qk layernorm for Qwen3-Next (4/4) by @yuzhongw-nvidia :: PR: #2753
  • Re-submit "Various fixes to in-job restarter and better time accounting of startup operations" by @hexinw-nvidia :: PR: #2954
  • Use slack-sdk in a different manner by @Phlip79 :: PR: #2950
  • Hybrid Context Parallel Feature by @parthmannan :: PR: #2282
  • Inference | Move assert active_request_count > 0. by @lmcafee-nvidia :: PR: #2958
  • Set token_dtype_code init value in GPTDatasetConfig to fix CI by @asolergi-nv :: PR: #2912
  • [main] ci(moe): Add --apply-wd-to-qk-layernorm flag to the gdn test case by @yuzhongw-nvidia :: PR: #2995
  • ci: Disable step time on `gpt3_moe_mcore_te_tp2_pp2_ep4_etp1_no_mtp_n… by @ko3n1g :: PR: #2991
  • ci: Fix workflows on main by @ko3n1g :: PR: #2990
  • Make Megatron-FSDP torch.compile compatible by @shjwudp :: PR: #2425
  • [Megatron-FSDP] Test FP8 activations + parameter sharding with Megatron-FSDP fully-shard. Update README. by @cspades :: PR: #2894
  • chore: Escape special chars by @ko3n1g :: PR: #3014
  • Improve memory logging by @deepakn94 :: PR: #2839
  • Add a wrapper function for FA3 _flash_attn_forward call by @santhnm2 :: PR: #2933
  • chore: Set umask 0002 by @ko3n1g :: PR: #3027
  • Make attn mask inversion in-place instead of allocating it again by @mathemakitten :: PR: #3019
  • [Megatron-FSDP] Fix incorrect gradient scaling target. by @cspades :: PR: #3023
  • Replaces ModuleSpec with Protocols for some of the inputs to SelfAttention/CrossAttention by @nschank :: PR: #2761
  • Various CUDA graph improvements on capture time, replay time, memory footprint by @jiemingz :: PR: #2572
  • Update oncall schedule by @Phlip79 :: PR: #3017
  • Ensure that last prefill chunk is handled correctly by Mamba models by @santhnm2 :: PR: #2897
  • Add script for batch running CI tests across distinct nodes by @jon-barker :: PR: #3047
  • Refit EP support by @wdykas :: PR: #2972
  • Catch case of negative tokens to generate by @tdene :: PR: #2985
  • Sync GitHub and Slack teams by @Phlip79 :: PR: #3037
  • ci: Remove Github transition comment from CI by @chtruong814 :: PR: #2881
  • Support custom Router implementations in MoELayer by @nschank :: PR: #2891
  • ci: Override N_REPEAT by @ko3n1g :: PR: #3051
  • Update type hints and doc strings for moe_utils.py by @JavaZeroo :: PR: #2821
  • Supporting inference when called within an asyncio loop by @shanmugamr1992 :: PR: #2816
  • Remove calculation of padding token in moe routing loss by @HaochenYuan :: PR: #2142
  • Bug fix with --no-use-tokenizer-from-checkpoint-args by @jon-barker :: PR: #3049
  • Revert "Bug fix with --no-use-tokenizer-from-checkpoint-args (#3049)" by @thomasdhc :: PR: #3057
  • Add health endpoint to dynamic text gen server by @santhnm2 :: PR: #3009
  • ci: Skip test_precision_aware_optimizer by @thomasdhc :: PR: #3062
  • Support multimodule communication by @yaoyu-33 :: PR: #2031
  • Revert "Support multimodule communication (#2031)" by @ko3n1g :: PR: #3068
  • Revert "Remove calculation of padding token in moe routing loss (#2142)" by @ko3n1g :: PR: #3069
  • Add ability to save wgrads and dgrads by @deepakn94 :: PR: #3032
  • ci: Mark test_mode_partial_cudagraph unit tests as flaky by @chtruong814 :: PR: #3064
  • Keep FSDP'''s and DDP'''s finish_grad_sync API identical by @deepakn94 :: PR: #3070
  • (REPLAY) Bug fix with --no-use-tokenizer-from-checkpoint-args by @jon-barker :: PR: #3059
  • Optimizing post-processing of requests by @sidsingh-nvidia :: PR: #2920
  • Fix broken functional tests in #2920 by @sidsingh-nvidia :: PR: #3071
  • fix ep weight gradnorm/num_zero calculation error for muon by @FDecaYed :: PR: #3024
  • [training migration] Add LoggerConfig dataclass by @maanug-nv :: PR: #2414
  • Added --ft-num-warmup-iters option. by @hexinw-nvidia :: PR: #3052
  • Reapply "Various CUDA graph improvements on capture time, replay time, memory footprint (#2572)" by @jiemingz :: PR: #3056
  • fix(fsdp): add CLI argument for outer_dp_sharding_strategy by @liuyun7345 :: PR: #3053
  • ci: Log node name by @ko3n1g :: PR: #3081
  • docs: Release docs by @ko3n1g :: PR: #3055
  • Support NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 FP8/NVFP4 PTQ in example by @ChenhanYu :: PR: #3079
  • add all_gather process-group for overlapping in fsdp disributed training by @jeffnvidia :: PR: #2663
  • Add router replay for MoE models by @litianjian :: PR: #2101
  • ci: Disable gpt_dynamic_inference_tp1_pp1_dp8_583m_throughputtest_zmq by @ko3n1g :: PR: #3099
  • ci: Repeat func tests, save logs of unit tests and lessen debug output by @ko3n1g :: PR: #3089
  • ci: Update improvement of step-time by @ko3n1g :: PR: #3104
  • ci: Add GPU health checks by @ko3n1g :: PR: #3100
  • Harden GRPO functional tests by @jon-barker :: PR: #3065
  • build: Bump to TE2.12 by @ko3n1g :: PR: #3086
  • Inference functional tests: Write outputs to INFERENCE_OUTPUT_PATH instead of TENSORBOARD_PATH by @mathemakitten :: PR: #3061
  • Update moe readme. by @Victarry :: PR: #2830
  • Logging cleanup (only log on rank 0 if possible) by @deepakn94 :: PR: #3036
  • Move all bert and t5 tests to nightly by @Phlip79 :: PR: #3106
  • Create greptile.json by @Phlip79 :: PR: #3087
  • Fix bug of reuse_grad_buf_for_mxfp8_param_ag by @kunlunl :: PR: #2802
  • Fix for Hybrid CP by @parthmannan :: PR: #3091
  • Fix GRPO re-fit functional test by @jon-barker :: PR: #3113
  • Minimize README contents by @megnvidia :: PR: #3020
  • Add end-to-end tests for M-FSDP and ND-Parallel by @shjwudp :: PR: #3031
  • [M-FSDP] Fix double buffering not working with activation recompute by @shjwudp :: PR: #2689
  • Fix Multimodal Dockerfile by @faradawn :: PR: #3006
  • [training migration] Add CheckpointConfig dataclass by @maanug-nv :: PR: #2431
  • [training migration] Add StragglerDetectionConfig dataclass by @maanug-nv :: PR: #2435
  • Standardize RL unit tests by @tdene :: PR: #3088
  • Use the latest hybrid-ep by @Autumn1998 :: PR: #3093
  • remove retro by @dimapihtar :: PR: #3001
  • ci: Mark test_compatible_with_nd_parallel as flaky by @ko3n1g :: PR: #3122
  • build: Use merge-commit-sha for container by @ko3n1g :: PR: #3123
  • Refactor rl_offload_kv_cache_during_training to offload KV cache to CPU while retaining fixed virtual address by @mathemakitten :: PR: #3048
  • Disable Greptile status comments by @Phlip79 :: PR: #3127
  • ci: Add unit tests to merge queue by @ko3n1g :: PR: #3125
  • Create CodeRabbit config by @Phlip79 :: PR: #3131
  • build: Explicitly set minimum torch version to >= 2.6.0 by @chtruong814 :: PR: #3085
  • Move kitchen extension file to private kitchen repository by @kwyss-nvidia :: PR: #2779
  • Revert "Fix RL optimizer offload (#3112)" by @ko3n1g :: PR: #3141
  • Revise and move KD docs by @AAnoosheh :: PR: #3108
  • build: Bump FLA by @ko3n1g :: PR: #3139
  • ci: Add job timeouts by @ko3n1g :: PR: #3142
  • Multiturn rollout support prep by @yobibyte :: PR: #2966
  • ci: Set NODE_RANK by @ko3n1g :: PR: #3143
  • Reapply 3955c49 by @jon-barker :: PR: #3146
  • Revert "Multiturn rollout support prep (#2966)" by @ko3n1g :: PR: #3153
  • Fix coderabbit instructions error by @Phlip79 :: PR: #3150
  • Force input ids generated by mock dataset are < vocab_size by @asolergi-nv :: PR: #2945
  • Add a check to make sure we are distributing all the layers when using --decoder-first-pipeline-num-layers & --decoder-last-pipeline-num-layers by @asolergi-nv :: PR: #2947
  • Automatically choose available ports in ZMQ by @tdene :: PR: #2278
  • Generate arguments from TransformerConfig by @maanug-nv :: PR: #2896
  • Fix for PR-2142 by @HaochenYuan :: PR: #3165
  • ci: Onboard more GB200 tests by @ko3n1g :: PR: #3145
  • ci(hotfix): Alert for GB200 by @ko3n1g :: PR: #3168
  • Fix SFTDataset truncation bug by @duncanriach :: PR: #3158
  • Vitalyk/multiturn v2 by @yobibyte :: PR: #3167
  • ci: Disable the api check for now by @chtruong814 :: PR: #3157
  • ci: Add DSv3 proxy by @ko3n1g :: PR: #3169
  • Nvshmem refit by @wdykas :: PR: #2696
  • [Community][Main] fix(moe): Fix theoretical memory calculation of layernorm. by @1195343015 :: PR: #2434
  • fix: Set --refit-method default to gloo by @wdykas :: PR: #3172
  • [fix] Bug fix for offloading in evaluate() by @lhb8125 :: PR: #3043
  • cp: Fix: nccl-ub in ddp path (3181) into main by @ko3n1g :: PR: #3182
  • Miscellaneous inference cleanup by @santhnm2 :: PR: #2955
  • ci: Fix DSv3 by @ko3n1g :: PR: #3188
  • Fix missing argument in MoELayer.forward() by @jiemingz :: PR: #3133
  • Fix H2D stream synchronization in optimizer offload by @tgkyrie :: PR: #3140
  • Add MTP support for hybrid models by @rkarimimahab :: PR: #2363
  • docs: improve Megatron-LM and Megatron Core descriptions by @sbhavani :: PR: #3115
  • Handle step key correctly in checkpoint save with --optimizer-cpu-offload by @ahmadki :: PR: #2874
  • cp: ci: Checkpoint retention (3205) into core_r0.16.0 by @ko3n1g :: PR: #3222
  • cp: Fix uv install for GH actions (#3259) by @ko3n1g :: PR: #3275
  • cp: Fix missing PackedSeqParams import (3214) into core_r0.16.0 by @ko3n1g :: PR: #3236
  • cp: fix: numpy overflow (3306) into core_r0.16.0 by @ko3n1g :: PR: #3328
  • Missing import fix (#3241) by @parthmannan :: PR: #3298
  • cp: fix: T5 dataset (#3307) by @ko3n1g :: PR: #3329
  • cp: build: Bump TE on 2.12 by @ko3n1g :: PR: #3372
  • cp: Improved parallel logging of learning rate by @ko3n1g :: PR: #3367
  • cp: ci: Update release-docs workflow to use FW-CI-templates v0.72.0 (3438) into core_r0.16.0 by @ko3n1g :: PR: #3454
  • cp: ci: Remove environments (3462) into core_r0.16.0 by @ko3n1g :: PR: #3481
  • cp: Update release workflow to include changelog and publish docs (#3472) by @chtruong814 :: PR: #3480
  • chore(beep boop 🤖): Bump uv.lock (core_r0.16.0) (2026-02-19) by @svcnvidia-nemo-ci :: PR: #3502
  • docs: Update docs for 0.16.0 by @chtruong814 :: PR: #3505
  • chore(beep boop 🤖): Bump uv.lock (core_r0.16.0) (2026-02-23) by @svcnvidia-nemo-ci :: PR: #3533
  • docs: Update docs version picker for 0.16.0 to include nightly by @chtruong814 :: PR: #3547
  • cp: ci: Test docs build (#3583) by @ko3n1g :: PR: #3593
  • cp: Changes of CICD workflow by @ko3n1g :: PR: #3603

Don't miss a new Megatron-LM release

NewReleases is sending notifications on new releases.