π Summary
A stability-focused release that automatically recovers from NaN/Inf training issues and cleans up resume logic, plus faster Objects365 setup and unified dataset download URLs. ππβ‘
π Key Changes
-
Training robustness and resume improvements (primary change)
- Auto NaN/Inf detection and recovery from last checkpoint during training; includes DDP-aware broadcasting and capped retries (up to 3). π‘οΈ
- Detects fitness collapse and treats it like NaN for safe recovery.
- Validates checkpoint weights to prevent reloading corrupted EMA states.
- Centralized checkpoint loading via
_load_checkpoint_state()
now used byresume_training()
to reduce duplication and state drift. - Resets scheduler state correctly after recovery to keep learning rate schedules consistent.
- New test
test_nan_recovery
injects a NaN to verify recovery path. β - Details in the primary PR: NaN epoch recovery by @glenn-jocher.
-
Dataset YAMLs: unified asset URLs
- Replaced hardcoded URLs with
ASSETS_URL
across VOC, COCO, COCO-Pose, VisDrone, and LVIS for maintainable, consistent downloads. π - See: Use ASSETS_URL in dataset YAMLs.
- Replaced hardcoded URLs with
-
Objects365 setup speedups
- Parallelized downloads, image moves, and label generation using
ThreadPoolExecutor
. - Increased threads and refactored annotation processing for higher throughput. π
- See: Improve Objects365.yaml.
- Parallelized downloads, image moves, and label generation using
-
CI and test updates
- Temporarily disabled Jetson JetPack 5 Docker build due to NaN training errors while the new recovery stabilizes. π§ͺ
- Skips training tests on Jetson/Raspberry Pi since edge devices arenβt intended for training workloads.
-
Version bump
ultralytics
now8.3.213
.
π― Purpose & Impact
-
More resilient training at scale
- Avoids failed runs by recovering automatically from NaN/Inf losses or sudden metric collapse.
- Safer resumes with consistent loading of optimizer, scaler, EMA, and best fitness.
- Protects users from corrupted checkpoints and prevents silent state inconsistencies.
-
Faster dataset preparation
- Significant time savings when preparing Objects365 thanks to multithreading across multiple steps.
-
More reliable downloads
- Centralized asset hosting means fewer broken links and easier mirror/CDN changes, with no user changes required.
-
Clearer platform expectations
- CI/test tweaks clarify that training on edge devices (e.g., Jetson, Raspberry Pi) is not a target workflow, reducing false failures.
Quick start
- Upgrade:
pip install -U ultralytics
- Recovery is automatic; no configuration needed. If a NaN is detected, training will restore from
last.pt
and continue.
What's Changed
- Use
ASSETS_URL
in dataset YAMLs by @glenn-jocher in #22361 - Improve Objects365.yaml by @glenn-jocher in #22362
ultralytics 8.3.213
NaN epoch recovery by @glenn-jocher in #22352
Full Changelog: v8.3.212...v8.3.213