🌟 Summary
Optional torch.compile acceleration lands across train/val/predict for up to ~30% faster runs, plus dataloader throughput boosts, CoreML export reliability, unified device handling, and smoother setup/CI/paths. ⚡️🧩
📊 Key Changes
-
torch.compile acceleration (primary)
- New
compile
arg in train/val/predict (default False) with end-to-end wiring via config/CLI/API. - New helpers:
attempt_compile(...)
to safely enable compile anddisable_dynamo(...)
to opt out specific code paths. - Integrations:
- Trainer compiles the model after initializing loss, marks dynamic tensors for stability, and unwraps models for EMA/checkpointing.
- Validator can compile for standalone val; training-time final eval avoids compile for speed/stability.
- Predictor supports
compile=True
for accelerated inference.
- Utility rename:
de_parallel
➝unwrap_model
(handles both parallel and compiled models). - Docs updated for the new args and torch utils.
- New
-
Faster data loading
- Doubled default
prefetch_factor
to 4 whennum_workers > 0
and seamlessly removed it for older PyTorch (<2.0) to avoid errors. - Safer
drop_last
behavior during compile-enabled training to improve shape stability.
- Doubled default
-
Unified, safer device handling
- Centralized “move batch tensors to device” logic across detection/pose/segment/YOLOE to reduce CPU/GPU mismatches and duplicated code.
-
CoreML export robustness
- Cleanup of CoreML NMS pipeline: direct use of spec outputs, explicit shape setting when needed, consistent IO names, and simpler wiring for more reliable exports on macOS/Linux/Windows.
-
Plotting stability
- Added
@plt_settings()
tofeature_visualization(...)
for backend-safe, non-blocking feature map plots.
- Added
-
Config directory resolution
- Smarter
get_user_config_dir()
: honorsYOLO_CONFIG_DIR
, follows OS conventions (XDG on Linux), and gracefully falls back to writable paths like/tmp
.
- Smarter
-
Compatibility and CI polish
- TorchVision matrix updated for PyTorch 2.8/0.23 and 2.9/0.24.
- GitHub Actions bumps: setup-python v6 and actions/stale v10.
- Minor typo fix in a deprecation warning.
🎯 Purpose & Impact
- Speedups you can feel
- Training can be up to ~30% faster with
torch.compile
; inference can also benefit on supported devices (CUDA/CPU/MPS). 🚀
- Training can be up to ~30% faster with
- One-line opt-in
- CLI:
yolo train model=yolo11n.pt data=coco8.yaml epochs=100 compile=True
- Python:
from ultralytics import YOLO model = YOLO("yolo11n.pt") model.train(data="coco8.yaml", epochs=100, compile=True) # Also works for val/predict: model.val(compile=True) model.predict("img.jpg", compile=True)
- CLI:
- Fewer runtime hiccups
- Centralized device transfer cuts down on “tensor on CPU vs GPU” issues. ✅
- Dataloader tweaks reduce bottlenecks and avoid PyTorch version pitfalls. 🧠
- Better exports and environments
- CoreML exports are more consistent across platforms, improving deployment on Apple ecosystems. 🍎
- Updated PyTorch–TorchVision checks reduce install/runtime confusion on new stacks. 🧩
- Smoother headless/CI and container use
- Plotting and config dir improvements prevent blocked UIs, backend errors, and permission issues in constrained environments. 🛡️
Primary PR: “ultralytics 8.3.196 torch.compile acceleration” by @glenn-jocher (adds compile flag, utilities, and engine integrations).
What's Changed
- Add
@plt_settings()
decorator tofeature_visualization()
by @glenn-jocher in #21973 - Cleanup CoreML NMS pipeline code by @Y-T-G in #21970
- Double default Dataloader
prefetch_factor
to 4 by @glenn-jocher in #21974 - Update torchvision compat matrix with 2.8 and 2.9 by @glenn-jocher in #21978
- Fix overly verbose USER_CONFIG_DIR checks by @glenn-jocher in #21980
- Fix missing Tensors on device in Trainer and Validator
preprocess_batch
methods by @glenn-jocher in #21981 - Bump actions/setup-python from 5 to 6 in /.github/workflows by @dependabot[bot] in #21984
- Bump actions/stale from 9 to 10 in /.github/workflows by @dependabot[bot] in #21983
- Fix typo in
deprecation_warn
method by @RizwanMunawar in #21987 ultralytics 8.3.196
torch.compile
acceleration for 30% faster training by @glenn-jocher in #21975
Full Changelog: v8.3.195...v8.3.196