NVIDIA/Megatron-LM core_v0.14.0 on GitHub

Features
- Inference
  - Add async support for DynamicInferenceEngine (MR !3187)
  - Pad input tensors and enable FP8 weights for FP8 inference (MR !3341)
  - Force inference to always gather logits with tensor parallelism (MR !3442)
  - Multi batch size CUDA Graphs for Dynamic Inference (MR !3402)
- Post-training
  - ModelOpt updates (MR !3268)
    - Add speculative decoding AR validation feature
    - Add DeepSeek and Qwen model configs
- Performance
  - ModelCommProcessGroup integration (MR !3391)
  - Add HyperCommGrid: N-Dimensional Communication Grid for Model Parallelism (MR !3398)
    - Flexible creation and management of communication groups
  - Add support for Spike No More embedding initializations and weight decay skipping (MR !3500)
- MoE
  - We're actively optimizing large-scale fine-grained MoE performance on Blackwell Platform.
  - Features:
    - Support Expert Parallel A2A Overlapping (MR !3470; MR !3074)
    - Support CP and recompute for MTP (MR !3330)
    - Add support for global aux loss (MR !3318)
  - Memory Optimization
    - Support recomputation for FP8 layernorm/moe_act/shared_experts (MR !3465)
    - Support optimizer offloading for DSV3 FP8 training (MR !3659)
  - Performance Optimization
    - Add MoE router fusion (MR !3809)
    - Updates for MoE cudagraph (MR !3631)
  - Bug fixes:
    - Fix router input jitter dtype (MR !3774)
- Model support
  - Add MiMo video VLM train example (MR !3543)
  - Add AVLM for MIMO (MR !3624)
- Ease of use
  - Add uv support for source installs (MR !3615)
  - Automated weekly prereleases (MR !3574)
Bug fixes
- Use mscale_all_dim for softmax_factor (MR !2800)
- Fix FP8 param blockwise scaling unit test (MR !3480)
- Fix unit test blockwise scaling (MR !3491)
- Optimize prefill for token-less requests (MR !3499)
- Add default values for Fp8Padding and Fp8Unpadding (MR !3501)
- Fix CUDA graph logic for flexible pp layout (MR !3505)
- Load FP8 models with strict=False (MR !3508)
- Skip rope check for torch < 1.4.0 (MR !3528)
- Disable Apex tests for stability (MR !3539)
- Fix typo in parallel_state expert parallelism (MR !3548)
- Guard modelopt on macOS (MR !3549)
- Retry on CUDA function failure (MR !3554)
- Fix NCCL mem pool creation error (MR !3557)
- Fix get_rotary_seq_len return type (MR !3559)
- Retry on CUDA function failure (MR !3560)
- Fix NCCL allocator attribute error (MR !3565)
- Ensure multi-prompt inference works (MR !3568)
- Fix MD5 on FIPS systems (MR !3577)
- Fixes dynamic context and inference bugs (MR !3582)
- Fix TE version for interleaved fused RoPE (MR !3586)
- Fix MTP with MoE and TP logging (MR !3594)
- Guard TE import fix (MR !3596)
- Add assertion for NCCL UB case (MR !3599)
- Remove Encoder PP related Functions (MR !3604)
- Fix segfaults in tests (MR !3605)
- Fix TE error in distributed optimizer (MR !3625)
- Remove redundant barrier in checkpoint flow (MR !3626)
- Support VPP MTP, fix logging (MR !3630)
- Retry mechanism for free(): invalid pointer errors (MR !3632)
- Fix test_replication.py issues (MR !3633)
- Fix typo in parallel_state (MR !3634)
- Fix CUDA graph logic determination (MR !3635)
- Fix TE installation error (MR !3636)
- Ensure correct sharding type in local tests (MR !3643)
- Fix cudagraphed backward buffer reuse for last layer (MR !3645)
- Set default for packed_seq_params in get_rotary_seq_len (MR !3651)
- Fix dynamic example script errors (MR !3653)
- Guard TE import fix (MR !3666)
Breaking changes:
- megatron.core.distributed.custom_fsdp refactored as breaking change to megatron.core.distributed.fsdp.src.megatron_fsdp
Known issues

NVIDIA/Megatron-LM core_v0.14.0 NVIDIA Megatron Core 0.14.0 on GitHub

NVIDIA/Megatron-LM core_v0.14.0
NVIDIA Megatron Core 0.14.0

on GitHub