- Support bf16 dtype for optimizer states to use precision-aware optimizer in TransformerEngine
- MoE
- Features:
- Flexible Asymmetric Virtual Pipeline Parallelism with Custom Pipeline Layout (--pipeline-model-parallel-layout)
- Add support to pass custom parallelism groups to MoE modules.
- Add Hybrid Shard Data-Parallel support for MoE models (--num-distributed-optimizer-instances)
- Support EP + custom FSDP training for DeepSeek-V3
- FP8 support for Multi-Token-Prediction
- Memory Optimization
- Fine-grained recomputation to reduce activation memory. (--recompute-modules with --recompute-granularity selective)
- Memory efficient token permutation by moving the probs multiplication from unpermutation to activation function of GroupedMLP.
- Performance Optimization
- MLA RoPE fusion kernel and YARN embedding cache.
- FP8 padding optimization of MoE models by padding the routing map.
- Bug fixes:
- Fix the aux loss calculation when expert_bias or group limited routing is used. This leads to load_balancing_loss values change compared to the previous version.
- Fix packed sequence support for MLA
- Known Issues:
- MTP is not compatible with flexible pipeline layout, will be fixed at !3594.
- MTP convergence issue with TP2, will be fixed at !3594.
- Features: