- Add multi datacenter training support though N/S connection
- MoE
- Features
- Support DeepSeek-V3 fine-tuning
- Aux-loss-free load balancing strategy
- Node-limited routing and Device-limited routing support.
- Tensor Parallelism support for MLA and Sequence Auxiliary Loss
- MTP (with TP and PP support) is coming soon.
- Permutation / Unpermutation fusion kernel from TransformerEngine.
- Uneven virtual pipeline parallel split support in first and last PP stage.
- Support DeepSeek-V3 fine-tuning
- Bug fixes:
- Fix the grad scale when TP != expert-TP and average_in_collective is enabled in DDP.
- Fix TEGroupedMLP distckpt compatibility issue with FP8 padding/unpadding.
- Known Issues:
- When training the Dense+MoE hybrid model, the process will hang if any PP rank does not have expert params.
- Features