- Update to DCGM 4.5.3 and DCGM Exporter 4.8.2.
- Improve GPU health metrics, including reporting GPU-wide health incidents such as fallen-off-bus XIDs.
- Make
/debug/pprofprofiling endpoints opt-in via--enable-pprof/DCGM_EXPORTER_ENABLE_PPROF. - Add PodMapper informer caching for Kubernetes pod mapping (#626) (@jaeeyoungkim).
- Add per-process GPU metrics for time-sharing and MIG (#594) (@krystiancastai).
- Make Helm
priorityClassNameconfigurable with explicit defaults (#444) (@runzhliu). - Add MIG device support for HPC job labels (#602) (@jay-mckay).
- Update go-dcgm field metadata handling, deprecated field alias resolution, health constants, policy registration handling, and version info APIs.
- Document IPv6 address formats for remote hostengine and metrics listen addresses.
- Refresh dependencies, container base images, Docker image references, Helm chart values, Kubernetes manifests, and tests for this release.