🌟 Summary
v8.4.65 is mainly a mobile inference performance release 📱⚡, with the biggest upgrade improving QNN exports for Snapdragon NPUs and speeding up semantic segmentation outputs by moving costly work inside the model graph.
📊 Key Changes
-
🚀 Major QNN export performance upgrade in PR #24790 by @glenn-jocher
- QNN exports now use channel-last (NHWC) input, which better matches Qualcomm Hexagon NPU hardware and camera-native image layouts.
- This avoids extra data reordering at inference time, reducing overhead on both the app side and the NPU side.
- Implemented through new export wrappers like
QNNModelinstead of fragile post-export graph editing.
-
🧠 Semantic segmentation exports are much faster on-device
- For QNN and CoreML semantic models, export now embeds ArgMax / class-map generation directly in the graph.
- Instead of returning large floating-point logits that apps must decode afterward, exported models can now return a compact per-pixel class map directly.
- New
ClassMapModelwrapper handles this behavior during export.
-
📱 Better mobile runtime compatibility
- Semantic predict/validation code was updated so Ultralytics can correctly handle both:
- traditional logits outputs
- new exported class-map outputs
- This makes the faster export behavior work more cleanly across deployment and evaluation workflows.
- Semantic predict/validation code was updated so Ultralytics can correctly handle both:
-
📉 Measured impact highlighted in docs
- QNN semantic performance became much more stable, replacing previously erratic decoding times with far more predictable behavior.
- CoreML semantic exports also saw notable end-to-end speedups from in-graph class-map generation.
-
🛠️ RKNN export improvements
- RKNN export now properly supports FP16 intermediate ONNX export.
- Added RKNN INT8 export test coverage to improve reliability.
- RKNN export docs/tables were updated accordingly.
-
🐳 Docker build reliability improvements
- Dockerfiles now use more cache-conscious install patterns and extra cleanup steps to reduce disk pressure during builds.
- The Docker workflow was simplified by removing an extra assistant-trigger step.
-
📚 Docs and export guidance improvements
- Expanded and refreshed docs for CoreML and QNN, including clearer deployment guidance and performance notes.
- Export docs now better explain YOLO26 end-to-end detection output formats.
- Tracking docs now clarify support for OBB models as well.
- Rust inference docs were updated to
ultralytics-inferenceversion0.0.19.
-
✅ CI and workflow fixes
- Docs redeploys now trigger more reliably when documentation changes land on
main. - CUDA training tests are now explicitly skipped on Jetson devices, making CI behavior cleaner and easier to maintain.
- Docs redeploys now trigger more reliably when documentation changes land on
🎯 Purpose & Impact
-
⚡ Faster mobile inference on Snapdragon devices
Users deploying YOLO on Qualcomm hardware should see lower overhead and better real-world efficiency, especially in camera-based apps where image buffers are already channel-last. -
🧩 Less app-side postprocessing work
By returning semantic class maps directly from exported models, apps no longer need to spend as much CPU time decoding huge segmentation outputs. -
📈 Better and more stable semantic segmentation performance
This is especially important for real-time mobile experiences, where unpredictable postprocessing delays can cause lag or jitter. -
🔒 More robust export pipeline
Replacing manual ONNX graph surgery with clean wrapper modules makes exports easier to maintain and less error-prone over time. -
📱🍎 Benefits extend beyond QNN
CoreML semantic exports also gain from the same in-graph class-map idea, so Apple-device deployment gets a speed boost too. -
🛠️ Improved reliability for deployment workflows
Better RKNN testing, cleaner Docker builds, and more dependable docs publishing all reduce friction for developers working across edge and production environments. -
👥 Broad user impact
- Mobile and edge developers get the biggest benefit from this release.
- Semantic segmentation users should see the clearest speed and usability gains.
- General users also benefit from clearer docs, cleaner CI, and more reliable export behavior.
In short: v8.4.65 is a strong deployment-focused release 🎉, with the standout improvement being faster, more hardware-friendly QNN exports and smarter semantic segmentation outputs for mobile AI.
What's Changed
- Improve skipping training on Jetson Test CIs by @lakshanthad in #24557
- Trigger docs redeploys on changes by @glenn-jocher in #24784
- Replace Enterprise '50+ members' with custom team size in Platform docs by @mykolaxboiko in #24785
- Normalize YAML docs links by @glenn-jocher in #24788
- Fix docs deploy change detection by @glenn-jocher in #24787
- Improve docs, comments, and correctness by @glenn-jocher in #24789
- Enable RKNN FP16 export and add RKNN INT8 test by @Laughing-q in #24762
- Remove Docker workflow assistant trigger by @glenn-jocher in #24783
- docs: ultralytics-inference version to 0.0.19 in rust docs by @onuralpszr in #24793
- QNN channel-last + semantic ArgMax performance improvements by @glenn-jocher in #24790
Full Changelog: v8.4.64...v8.4.65