- Uneven pipeline parallelism
- Enable pipeline parallelism where first and last ranks have fewer transformer layers than the intermediate ranks
- Per layer CUDAGraph support for GPT training with Transformer Engine modules
- Enable different TP sizes for the vision encoder
- Enable pipeline parallelism for T5 & Llava models
- Support multi-tile multi-image input in Llava models
- MoE
- FP8 support
- Runtime upcycling support
- Dispatcher implementation optimizations
- Shared expert support with overlapping optimizations
- Qwen Model support
- Mamba Hybrid
- Main branch is no longer compatible with released checkpoints (use ssm branch)
- Add distributed checkpointing
- Fix bugs related to inference
- Add unit tests
- Known Issues
- When using sequence parallel, during the transformer block forward pass, dropout is not using the appropriate rng context.