pypi ultralytics 8.3.213
v8.3.213 - `ultralytics 8.3.213` NaN epoch recovery (#22352)

4 hours ago

🌟 Summary

A stability-focused release that automatically recovers from NaN/Inf training issues and cleans up resume logic, plus faster Objects365 setup and unified dataset download URLs. πŸš‘πŸ”βš‘

πŸ“Š Key Changes

  • Training robustness and resume improvements (primary change)

    • Auto NaN/Inf detection and recovery from last checkpoint during training; includes DDP-aware broadcasting and capped retries (up to 3). πŸ›‘οΈ
    • Detects fitness collapse and treats it like NaN for safe recovery.
    • Validates checkpoint weights to prevent reloading corrupted EMA states.
    • Centralized checkpoint loading via _load_checkpoint_state() now used by resume_training() to reduce duplication and state drift.
    • Resets scheduler state correctly after recovery to keep learning rate schedules consistent.
    • New test test_nan_recovery injects a NaN to verify recovery path. βœ…
    • Details in the primary PR: NaN epoch recovery by @glenn-jocher.
  • Dataset YAMLs: unified asset URLs

    • Replaced hardcoded URLs with ASSETS_URL across VOC, COCO, COCO-Pose, VisDrone, and LVIS for maintainable, consistent downloads. πŸ”—
    • See: Use ASSETS_URL in dataset YAMLs.
  • Objects365 setup speedups

    • Parallelized downloads, image moves, and label generation using ThreadPoolExecutor.
    • Increased threads and refactored annotation processing for higher throughput. πŸš€
    • See: Improve Objects365.yaml.
  • CI and test updates

    • Temporarily disabled Jetson JetPack 5 Docker build due to NaN training errors while the new recovery stabilizes. πŸ§ͺ
    • Skips training tests on Jetson/Raspberry Pi since edge devices aren’t intended for training workloads.
  • Version bump

    • ultralytics now 8.3.213.

🎯 Purpose & Impact

  • More resilient training at scale

    • Avoids failed runs by recovering automatically from NaN/Inf losses or sudden metric collapse.
    • Safer resumes with consistent loading of optimizer, scaler, EMA, and best fitness.
    • Protects users from corrupted checkpoints and prevents silent state inconsistencies.
  • Faster dataset preparation

    • Significant time savings when preparing Objects365 thanks to multithreading across multiple steps.
  • More reliable downloads

    • Centralized asset hosting means fewer broken links and easier mirror/CDN changes, with no user changes required.
  • Clearer platform expectations

    • CI/test tweaks clarify that training on edge devices (e.g., Jetson, Raspberry Pi) is not a target workflow, reducing false failures.

Quick start

  • Upgrade: pip install -U ultralytics
  • Recovery is automatic; no configuration needed. If a NaN is detected, training will restore from last.pt and continue.

What's Changed

Full Changelog: v8.3.212...v8.3.213

Don't miss a new ultralytics release

NewReleases is sending notifications on new releases.