github NVIDIA/k8s-dra-driver-gpu v25.12.0

9 hours ago

This release marks the general availability of the GPU allocation plugin.

Important: When upgrading from a previous release, please follow the upgrade instructions below.

Highlights

  • Dynamic MIG device management (requires at least Kubernetes 1.34, and setting the new DynamicMIG feature gate).
  • Significant reduction of the ComputeDomain formation time in large-scale clusters (~10 seconds for domains comprised of thousands of nodes).

Improvements

  • Added preliminary support for VFIO passthrough devices in the GPU plugin (not enabled by default, set the PassthroughSupport feature gate, see #668).
  • Added the memory addressingMode as an attribute to announced GPU devices (#717).
  • Added support for GPU health checks (not enabled by default, use the NVMLDeviceHealthCheck feature gate, see #689).
  • Enhanced robustness of the ComputeDomain controller in view of deliberately or accidentally running replicas (#868).
  • Made the ComputeDomain kubelet plugin crash in view of obvious MNNVL fabric configuration errors or degraded fabric health (#844, #865).
  • Added a networkPolicy parameter to the Helm chart to support clusters with restricted networks (#708).
  • Tuned binary search paths for widening Linux distribution support (#706).

All commits since last release: v25.8.1...v25.12.0

Upgrades

First, update CRDs by running

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-dra-driver-gpu/refs/tags/v25.12.0/deployments/helm/nvidia-dra-driver-gpu/crds/resource.nvidia.com_computedomains.yaml

Only then update the chart by running helm upgrade -i ... (instead of helm install).

New feature gates and known limitations

  • For now, enabling DynamicMIG is mutually exclusive with enabling any of MPSSupport, NVMLDeviceHealthCheck, and PassthroughSupport.
  • The new fail-fast behavior in the CD kubelet plugin can be disabled with the new CrashOnNVLinkFabricErrors feature gate.
  • The scalability improvements for ComputeDomains come with a number of architectural changes under the hood. These can be disabled to restore 25.8.x behavior by disabling the ComputeDomainCliques feature gate.

Don't miss a new k8s-dra-driver-gpu release

NewReleases is sending notifications on new releases.