pypi ultralytics 8.4.87
v8.4.87 - Clean-sheet device selection: stop mutating `CUDA_VISIBLE_DEVICES` (#25021)

6 hours ago

🌟 Summary

πŸš€ Ultralytics v8.4.87 delivers a cleaner, safer GPU device-selection system, plus stability and performance fixes for training, inference, tracking, exports, and dataset checks.

πŸ“Š Key Changes

  • Clean-sheet CUDA device selection 🧭

    • Added parse_device() to normalize device inputs such as cuda:0, 0,1, lists/tuples, torch.device, and -1 idle-GPU auto-selection.
    • select_device() no longer mutates CUDA_VISIBLE_DEVICES, making device selection predictable across repeated calls and long-running Python processes.
    • Explicit single-GPU requests now use torch.cuda.set_device() instead of environment-variable remapping.
    • Trainer, DDP setup, validation, autobatch, and distributed barriers now consistently use resolved CUDA device indices.
    • Added documentation for ultralytics.utils.torch_utils.parse_device.
  • Stronger GPU training tests πŸ§ͺ

    • Added a cold-process nonzero-GPU training test to better match real CLI and Ultralytics Platform training behavior.
    • Verifies that training on GPUs like device=1 or higher works correctly from a fresh process without relying on previous CUDA initialization.
  • Fixed DataLoader worker cleanup at training shutdown 🧹

    • Added a close() method to InfiniteDataLoader.
    • Training now explicitly shuts down persistent train and validation workers before Python exits.
    • Helps prevent end-of-run DataLoader worker ... killed by signal: Terminated errors after results are already saved.
  • Improved inference warmup for standard NMS ⚑

    • AutoBackend.warmup() now preloads torchvision for non-end-to-end models.
    • This helps later non-max suppression calls use faster torchvision NMS when appropriate, reducing first-inference latency after warmup.
  • Corrected dataset file-speed reporting πŸ’Ύ

    • Fixed an inverted condition in check_file_speeds().
    • Slow storage, such as network-mounted datasets, should now trigger the intended warning instead of being incorrectly reported as β€œFast image access βœ…β€.
  • Tracking ReID device alignment 🎯

    • Trackers now pass the predictor device into ReID encoders.
    • ReID models are initialized and run on the same device as prediction where applicable, improving consistency for tracking workflows.
  • Export reliability improvements πŸ“¦

    • TensorFlow SavedModel export now distinguishes CUDA vs non-CUDA export paths more carefully.
    • CPU exports hide TensorFlow GPUs where possible to avoid unnecessary GPU memory use.
    • ONNX Runtime and Paddle dependency checks now better handle interchangeable CPU/GPU package variants to avoid unnecessary or conflicting installs.
    • Paddle export now uses the actual export device to decide whether GPU Paddle is needed.

🎯 Purpose & Impact

  • More reliable GPU behavior πŸš€

    • Users should see fewer surprises when training, validating, predicting, or exporting repeatedly in the same Python session.
    • This is especially important for notebooks, services, CI, distributed training, and production systems where changing CUDA_VISIBLE_DEVICES mid-process can cause hard-to-debug issues.
  • Better support for nonzero GPU training πŸ–₯️

    • Training on GPUs beyond CUDA:0 is now more robust, including cold-start CLI usage common in production and Ultralytics Platform environments.
  • Cleaner shutdowns after training βœ…

    • Persistent DataLoader workers are now cleaned up explicitly, reducing noisy shutdown crashes and improving confidence that completed runs exit cleanly.
  • Lower latency after warmup ⚑

    • Standard detection workflows can benefit from smoother post-warmup inference performance by ensuring faster NMS paths are ready when needed.
  • More accurate dataset diagnostics πŸ“Š

    • Users with slow disks or network storage will receive correct warnings, helping them identify dataset I/O bottlenecks that can slow training.
  • More consistent tracking and export workflows πŸ”„

    • ReID tracking components now better follow the selected prediction device.
    • Export paths are less likely to allocate unwanted GPU memory or install conflicting runtime packages.

What's Changed

  • Add cold-process nonzero-device GPU train test by @glenn-jocher in #25019
  • Fix inverted read-speed condition in dataset file speed check by @ahmet-f-gumustas in #25025
  • Fix leaked dataloader workers at end of training (atexit killed by signal: Terminated crash) by @Bovey0809 in #25024
  • Preload torchvision during warmup for non-end2end NMS path by @Y-T-G in #25023
  • Clean-sheet device selection: stop mutating CUDA_VISIBLE_DEVICES by @glenn-jocher in #25021

Full Changelog: v8.4.86...v8.4.87

Don't miss a new ultralytics release

NewReleases is sending notifications on new releases.