🌟 Summary
Better, faster multi-GPU training: v8.3.218 enables true multi-GPU validation during training with correct cross-GPU metric aggregation and a new contiguous sampler for stable evaluation. 🚀
📊 Key Changes
- Multi-GPU validation during training ✅
- Validation DataLoader and Validator are now created on all ranks for proper DistributedDataParallel (DDP) execution.
- Rank-aware device selection ensures each process validates on its own GPU.
- New ContiguousDistributedSampler 🧩
- Preserves dataset ordering by assigning contiguous, batch-aligned chunks per GPU.
- Automatically used when
shuffle=False
(e.g., rect/size-grouped evaluation) to prevent interleaved indices. - Falls back to PyTorch’s
DistributedSampler
whenshuffle=True
.
- Correct cross-GPU metric aggregation 📈
- Validation losses are reduced across GPUs.
- Detection/classification validators gather stats from all ranks and compute results on rank 0 only.
- EMA buffers are synchronized from rank 0 to all GPUs to keep validation consistent.
- Trainer flow improvements 🛠️
- Validation is executed outside the inner training step for cleaner DDP behavior.
- Final evaluation flow streamlined; only necessary work is done on rank 0, with safe synchronization for others.
- Documentation update 📚
- Added reference docs for
ContiguousDistributedSampler
.
- Added reference docs for
Links:
- See the implementing PR: Enable multi-GPU validation during training (#22377)
- Issues addressed: Multi-GPU val during train, Cross-GPU aggregation, Sampler ordering issues
🎯 Purpose & Impact
- More reliable multi-GPU results ✅
- Proper aggregation means metrics and losses now reflect the full distributed dataset, avoiding misleading per-rank results.
- Faster and more stable validation ⚡
- Contiguous sampling avoids mixing image sizes across GPUs, which reduces padding/overhead and improves determinism—especially with
rect=True
.
- Contiguous sampling avoids mixing image sizes across GPUs, which reduces padding/overhead and improves determinism—especially with
- Seamless distributed training 🧠
- Users can train with multiple GPUs and get accurate, consistent validation without extra setup.
- Backward compatible ✔️
- Single-GPU behavior is unchanged; most users don’t need to modify their scripts.
Quick tip to run distributed training and benefit from these improvements:
- CLI:
yolo detect train data=coco128.yaml model=yolo11n.pt devices=0,1,2,3
- Python:
from ultralytics import YOLO model = YOLO("yolo11n.pt") model.train(data="coco128.yaml", devices=[0, 1], imgsz=640, epochs=50)
Happy training and validating across GPUs! 🎉
What's Changed
Full Changelog: v8.3.217...v8.3.218