NVIDIA/Megatron-LM core_v0.13.0 on GitHub

Support bf16 dtype for optimizer states to use precision-aware optimizer in TransformerEngine
MoE
- Features:
  - Flexible Asymmetric Virtual Pipeline Parallelism with Custom Pipeline Layout (--pipeline-model-parallel-layout)
  - Add support to pass custom parallelism groups to MoE modules.
  - Add Hybrid Shard Data-Parallel support for MoE models (--num-distributed-optimizer-instances)
  - Support EP + custom FSDP training for DeepSeek-V3
  - FP8 support for Multi-Token-Prediction
- Memory Optimization
  - Fine-grained recomputation to reduce activation memory. (--recompute-modules with --recompute-granularity selective)
  - Memory efficient token permutation by moving the probs multiplication from unpermutation to activation function of GroupedMLP.
- Performance Optimization
  - MLA RoPE fusion kernel and YARN embedding cache.
  - FP8 padding optimization of MoE models by padding the routing map.
- Bug fixes:
  - Fix the aux loss calculation when expert_bias or group limited routing is used. This leads to load_balancing_loss values change compared to the previous version.
  - Fix packed sequence support for MLA
- Known Issues:
  - MTP is not compatible with flexible pipeline layout, will be fixed at !3594.
  - MTP convergence issue with TP2, will be fixed at !3594.

NVIDIA/Megatron-LM core_v0.13.0 NVIDIA Megatron Core 0.13.0 on GitHub