DeepSpeed 0.1.0 Release Notes

Features

Distributed Training with Mixed Precision
- 16-bit mixed precision
- Single-GPU/Multi-GPU/Multi-Node
Model Parallelism
- Support for Custom Model Parallelism
- Integration with Megatron-LM
Memory and Bandwidth Optimizations
- Zero Redundancy Optimizer (ZeRO) stage 1 with all-reduce
- Constant Buffer Optimization (CBO)
- Smart Gradient Accumulation
Training Features
- Simplified training API
- Gradient Clipping
- Automatic loss scaling with mixed precision
Training Optimizers
- Fused Adam optimizer and arbitrary torch.optim.Optimizer
- Memory bandwidth optimized FP16 Optimizer
- Large Batch Training with LAMB Optimizer
- Memory efficient Training with ZeRO Optimizer
Training Agnostic Checkpointing
Advanced Parameter Search
- Learning Rate Range Test
- 1Cycle Learning Rate Schedule
Simplified Data Loader
Performance Analysis and Debugging