github kubernetes-sigs/nvidia-dra-driver-gpu v25.12.0

one month ago

This release marks the general availability of the GPU allocation plugin.

When upgrading from a previous release, please follow the upgrade instructions below.

Highlights

  • Dynamic MIG device management (demo, see remarks below).
  • Significant reduction of the ComputeDomain formation time in large-scale clusters (~10 seconds for domains comprised of thousands of nodes).

Improvements

  • Added preliminary support for VFIO passthrough devices in the GPU plugin (not enabled by default, set the PassthroughSupport feature gate, see #668).
  • Added the memory addressingMode as an attribute to announced GPU devices (#717).
  • Added support for GPU health checks (not enabled by default, use the NVMLDeviceHealthCheck feature gate, see #689).
  • Enhanced robustness of the ComputeDomain controller in view of deliberately or accidentally running replicas (#868).
  • Made the ComputeDomain kubelet plugin crash in view of obvious MNNVL fabric configuration errors or degraded fabric health (#844, #865).
  • Added a networkPolicy parameter to the Helm chart to support clusters with restricted networks (#708).
  • Tuned binary search paths for widening Linux distribution support (#706).

All commits since last release: v25.8.1...v25.12.0

Upgrades

First, update CRDs by running

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-dra-driver-gpu/refs/tags/v25.12.0/deployments/helm/nvidia-dra-driver-gpu/crds/resource.nvidia.com_computedomains.yaml

Only then update the chart by running helm upgrade -i ... (instead of helm install).

Dynamic MIG

We'd like your feedback on this alpha feature. Use it in a controlled environment and at your own risk, and please report any issues you observe.

Make sure you:

  • Use H100 GPUs or newer (A100 is not supported in this first release because it does not support enabling/disabling MIG mode freely).
  • Run Kubernetes with DRAPartitionableDevices support enabled (ideally v1.34+).
  • Enable the DynamicMIG feature gate when deploying this driver.

Please keep these key concepts in mind:

  • When the feature gate is enabled, the kubelet plugin assumes full ownership of the MIG configuration on the node it runs on. On startup, it will attempt to tear down any unexpected MIG devices it finds.
  • Disable any other mechanism that creates, modifies, or manages MIG devices.
  • Correctly managing MIG state in a long-running cluster requires invasive cleanup routines. Please help us identify bugs in the current implementation and share feedback so we can evolve the cleanup design with confidence.

New feature gates and known limitations

  • For now, enabling DynamicMIG is mutually exclusive with enabling any of MPSSupport, NVMLDeviceHealthCheck, and PassthroughSupport.
  • The new fail-fast behavior in the CD kubelet plugin can be disabled with the new CrashOnNVLinkFabricErrors feature gate.
  • The scalability improvements for ComputeDomains come with a number of architectural changes under the hood. These can be disabled to restore 25.8.x behavior by disabling the ComputeDomainCliques feature gate.

Don't miss a new nvidia-dra-driver-gpu release

NewReleases is sending notifications on new releases.