π Summary
π Ultralytics v8.4.87 delivers a cleaner, safer GPU device-selection system, plus stability and performance fixes for training, inference, tracking, exports, and dataset checks.
π Key Changes
-
Clean-sheet CUDA device selection π§
- Added
parse_device()to normalize device inputs such ascuda:0,0,1, lists/tuples,torch.device, and-1idle-GPU auto-selection. select_device()no longer mutatesCUDA_VISIBLE_DEVICES, making device selection predictable across repeated calls and long-running Python processes.- Explicit single-GPU requests now use
torch.cuda.set_device()instead of environment-variable remapping. - Trainer, DDP setup, validation, autobatch, and distributed barriers now consistently use resolved CUDA device indices.
- Added documentation for
ultralytics.utils.torch_utils.parse_device.
- Added
-
Stronger GPU training tests π§ͺ
- Added a cold-process nonzero-GPU training test to better match real CLI and Ultralytics Platform training behavior.
- Verifies that training on GPUs like
device=1or higher works correctly from a fresh process without relying on previous CUDA initialization.
-
Fixed DataLoader worker cleanup at training shutdown π§Ή
- Added a
close()method toInfiniteDataLoader. - Training now explicitly shuts down persistent train and validation workers before Python exits.
- Helps prevent end-of-run
DataLoader worker ... killed by signal: Terminatederrors after results are already saved.
- Added a
-
Improved inference warmup for standard NMS β‘
AutoBackend.warmup()now preloadstorchvisionfor non-end-to-end models.- This helps later non-max suppression calls use faster
torchvisionNMS when appropriate, reducing first-inference latency after warmup.
-
Corrected dataset file-speed reporting πΎ
- Fixed an inverted condition in
check_file_speeds(). - Slow storage, such as network-mounted datasets, should now trigger the intended warning instead of being incorrectly reported as βFast image access β β.
- Fixed an inverted condition in
-
Tracking ReID device alignment π―
- Trackers now pass the predictor device into ReID encoders.
- ReID models are initialized and run on the same device as prediction where applicable, improving consistency for tracking workflows.
-
Export reliability improvements π¦
- TensorFlow SavedModel export now distinguishes CUDA vs non-CUDA export paths more carefully.
- CPU exports hide TensorFlow GPUs where possible to avoid unnecessary GPU memory use.
- ONNX Runtime and Paddle dependency checks now better handle interchangeable CPU/GPU package variants to avoid unnecessary or conflicting installs.
- Paddle export now uses the actual export device to decide whether GPU Paddle is needed.
π― Purpose & Impact
-
More reliable GPU behavior π
- Users should see fewer surprises when training, validating, predicting, or exporting repeatedly in the same Python session.
- This is especially important for notebooks, services, CI, distributed training, and production systems where changing
CUDA_VISIBLE_DEVICESmid-process can cause hard-to-debug issues.
-
Better support for nonzero GPU training π₯οΈ
- Training on GPUs beyond
CUDA:0is now more robust, including cold-start CLI usage common in production and Ultralytics Platform environments.
- Training on GPUs beyond
-
Cleaner shutdowns after training β
- Persistent DataLoader workers are now cleaned up explicitly, reducing noisy shutdown crashes and improving confidence that completed runs exit cleanly.
-
Lower latency after warmup β‘
- Standard detection workflows can benefit from smoother post-warmup inference performance by ensuring faster NMS paths are ready when needed.
-
More accurate dataset diagnostics π
- Users with slow disks or network storage will receive correct warnings, helping them identify dataset I/O bottlenecks that can slow training.
-
More consistent tracking and export workflows π
- ReID tracking components now better follow the selected prediction device.
- Export paths are less likely to allocate unwanted GPU memory or install conflicting runtime packages.
What's Changed
- Add cold-process nonzero-device GPU train test by @glenn-jocher in #25019
- Fix inverted read-speed condition in dataset file speed check by @ahmet-f-gumustas in #25025
- Fix leaked dataloader workers at end of training (atexit
killed by signal: Terminatedcrash) by @Bovey0809 in #25024 - Preload torchvision during warmup for non-end2end NMS path by @Y-T-G in #25023
- Clean-sheet device selection: stop mutating
CUDA_VISIBLE_DEVICESby @glenn-jocher in #25021
Full Changelog: v8.4.86...v8.4.87