github kubernetes-sigs/dra-driver-nvidia-gpu v0.4.0

3 hours ago

This is the first release of DRA Driver for NVIDIA GPUs as a part of a Kubernetes SIG community at kubernetes-sigs/dra-driver-nvidia-gpu. This release also updates the versioning scheme to Semantic Versioning (SemVer), starting at v0.4.0.

Project move

The DRA Driver for NVIDIA GPUs has moved from the NVIDIA org to kubernetes-sigs as part of its donation to CNCF. The new identifiers are:

Artifact Identifier
Repository kubernetes-sigs/dra-driver-nvidia-gpu
Go module sigs.k8s.io/dra-driver-nvidia-gpu
Container images registry.k8s.io/dra-driver-nvidia/dra-driver-nvidia-gpu
Helm chart oci://registry.k8s.io/dra-driver-nvidia/charts/dra-driver-nvidia-gpu:0.4.0

Change to Semantic Versioning with v0.4.0

This release also updates the versioning scheme to Semantic Versioning (SemVer), starting at v0.4.0. This change was to make it easier to build and publish client API bindings for ComputeDomains to align with Go module semantic versioning for importing dependencies.

Refer to the following issues for more details around this change: #988, #715, and #1046.

Helm chart location and name change

Starting with v0.4.0, the Helm chart is published to two locations:

  • NGC (continuing): https://helm.ngc.nvidia.com/nvidia/charts/dra-driver-nvidia-gpu. Note that this is a different name from the previous release. Refer to the Upgrade section for details on upgrading to the new chart name.
  • Kubernetes registry (new): oci://registry.k8s.io/dra-driver-nvidia/charts/dra-driver-nvidia-gpu.

Users can choose to use either chart.

The Helm chart naming is also updated from nvidia-dra-driver-gpu to dra-driver-nvidia-gpu in v0.4.0. Users can continue to use their existing component names by passing the nameOverride=nvidia-dra-driver-gpu flag when upgrading. Refer to the Upgrade section for commands and details about required flags.

Action required

  • Starting in v0.4.0, the Helm chart follows Semantic Versioning. To upgrade, you must pass --version 0.4.0 when using helm upgrade. See the Upgrade section for details and commands.
  • If you are switching from the NGC chart to the Kubernetes registry chart, pass --set nameOverride=nvidia-dra-driver-gpu on your first upgrade to keep existing Kubernetes resource names stable. The override is not required for subsequent releases. Refer to the Upgrade section for exact commands.
  • Users who hit "device cannot be reprepared" after a host reboot prior to v0.4.0 (issue #951) must remove the kubelet plugin checkpoint file manually before upgrading. The new BootID-aware checkpoint format (#1066) only invalidates checkpoints with a recorded BootID. Legacy checkpoints written by older versions are assumed valid.

Feature gate changes

The following feature gates changed in v0.4.0. See pkg/featuregates/featuregates.go for the complete list of gates and their current defaults.

Feature Gate Change Stage Default Required K8s Feature Gate PRs
DeviceMetadata New Alpha false None. Driver-side only (KEP-5304). Framework support is Alpha in K8s 1.36+ #1000
PassthroughSupport Behavior change Alpha false None beyond core DRA. IOMMUFD backend additionally requires a host kernel with IOMMUFD enabled. IOMMUFD backend added (#1036), persistence-mode toggling during vfio prep (#1038), plugin startup, GPU tracking, and validation fixes (#994)
NVMLDeviceHealthCheck Behavior change Alpha false DRADeviceTaints (KEP-5055) Alpha (default off) in K8s 1.34 and 1.35, Beta (default on) in K8s 1.36. Informational taints additionally require K8s ≥ 1.35. Unhealthy GPUs are now retained in the ResourceSlice with a DeviceTaint attached. The v25.12.0 behavior of removing unhealthy GPUs from the slice (which required a driver restart to re-add after recovery) is replaced. Non-fatal XIDs surface as informational taints (#983)
IMEXDaemonsWithDNSNames Behavior change Beta true None When enabled, numNodes in the ComputeDomain API is now optional (#1081)

This release introduces enhanced validation logic for feature-gate flags:

  • The driver now enforces mutual exclusivity between PassthroughSupport and NVMLDeviceHealthCheck during startup (#994).
  • Enabling DeviceMetadata functionality now requires that the PassthroughSupport feature gate is also active (#1000).

New features

  • Leader election for the compute-domain-controller. This adds high availability when running multiple controller replicas. Disabled by default. Enable by setting controller.replicas: 2 (or more) and controller.leaderElection.enabled: true in your Helm values (#851).
  • Prometheus metrics. The GPU kubelet plugin, ComputeDomain plugin, and DRA controller now expose optional Prometheus metrics under the nvidia_gpu_dra_* prefix. Enable via controller.metrics.enabled and kubeletPlugin.metrics.enabled (#964).
  • GPU health taints are now used to track health status. When the NVMLDeviceHealthCheck feature gate is enabled, unhealthy GPUs are tainted via Kubernetes DeviceTaints and remain in the ResourceSlice, replacing the v25.12.0 behavior of removing unhealthy GPUs from the slice (which required a driver restart to re-add them after recovery). Non-fatal XID errors are surfaced as informational taints for observability without affecting scheduling (#983).
  • IOMMUFD-backed VFIO passthrough. VFIO passthrough now supports the IOMMUFD kernel backend in addition to the legacy IOMMU interface. Workloads opt in via a new VfioDeviceConfig opaque config on the ResourceClaim (apiVersion: resource.nvidia.com/v1beta1) with iommu.backendPolicy: PreferIommuFD; the driver falls back to the legacy backend if IOMMUFD is not available on the node. A companion iommu.enableAPIDevice field controls whether the IOMMU API device is injected into the container. Defaults preserve v25.12.0 behavior — the default vfio.gpu.nvidia.com DeviceClass ships with no config and existing workloads continue to use the legacy backend unchanged. Requires the PassthroughSupport feature gate (Alpha, default off) and, for IOMMUFD, a host kernel with IOMMUFD support enabled (#1036).
  • Device metadata downward API (KEP-5304). When the DeviceMetadata feature gate is enabled, the kubelet plugin writes a DeviceMetadata JSON file (apiVersion metadata.resource.k8s.io/v1alpha1) per claim and injects it via CDI, exposing device attributes such as pciBusID to the workload (#1000).
  • numNodes in the ComputeDomain API is now optional. With IMEXDaemonsWithDNSNames=true (default), the field can be omitted. The default value is 0. The field remains deprecated and will be removed in a future release. When running with IMEXDaemonsWithDNSNames=false, set numNodes explicitly. The API server no longer rejects a missing value (#1081).
  • imagePullSecrets propagation to the ComputeDomain daemon. Secrets configured on the controller are now passed through to dynamically created CD daemon DaemonSet pods, resolving ImagePullBackOff against private registries on Kubernetes 1.35+ (#1033).
  • Upstream NFD GPU labels for kubelet plugin scheduling. The kubelet plugin DaemonSet now selects nodes using upstream Node Feature Discovery GPU labels in addition to NVIDIA-specific labels, allowing the chart to be used with upstream NFD without overrides (#1122).
  • ExtendedResources examples. Sample DeviceClasses and Pod specs for ExtendedResources requests are now included (#940).
  • Kubernetes 1.36 support added.
  • OpenShift 4.21 support added.
  • Plugin pods now use a higher startup-probe rate, providing faster readiness on healthy nodes (#872).
  • The controller now tolerates node-role.kubernetes.io/master, allowing it to schedule on older control planes (#899).

Bug fixes

  • Fixed checkpoint is corrupted errors when upgrading the GPU kubelet plugin from v25.12.0. New checkpoint fields added in #994 (ShareID, Metadata, ParentPCIBusID, PciBusID) are now annotated with omitempty so older checkpoints remain readable. A diff between on-disk and re-marshaled checkpoint contents is now logged when checksum verification fails, making upgrade failures easier to debug (#1119).
  • Fixed an issue where devices could not be re-prepared after a host reboot. The kubelet plugin now records the Linux BootID in its checkpoint file and invalidates the checkpoint on BootID mismatch. See the Action required section for users hit by this prior to v0.4.0 (#1066).
  • Fixed an issue where the GPU DRA driver attempted to toggle MIG mode on Ampere (A100) GPUs at allocation time, which requires a GPU reset. The driver now respects the pre-configured MIG state on A100s and continues to support reset-less MIG mode toggling on Hopper and newer GPUs (#1018).
  • Fixed an issue where the kubelet plugin failed to start on nodes where every GPU is in passthrough mode. nvml.Init() previously required at least one GPU bound to the NVIDIA driver (#994).
  • Fixed vfio device preparation failures caused by nvidia-persistenced holding the GPU. The driver now toggles GPU persistence mode during vfio preparation under the PassthroughSupport feature gate (#1038).
  • Fixed DynamicMIG enablement on non-MIG-capable GPUs. The driver now correctly skips non-MIG-capable GPUs on mixed-GPU nodes (#1113).
  • Fixed mps-control-daemon chroot shell execution error when nvidiaDriverRoot is set (#889).
  • Fixed MPS shared-memory directory mount path (#978).
  • Fixed cleanup of partially-prepared claims in the GPU kubelet plugin when DynamicMIG is disabled. A missing early return previously caused unprepare-time errors (#1040).
  • Fixed RBAC duplication when the driver's install namespace is also listed in the ADDITIONAL_NAMESPACES environment variable (#995).
  • Feature-gate flag validation. Unrecognized values for PassthroughSupport and NVMLDeviceHealthCheck are now rejected at startup, and the mutual-exclusivity rules between PassthroughSupport, DynamicMIG, NVMLDeviceHealthCheck, and MPSSupport are enforced (#994).
  • Improved validation messages for GPU sharing. Errors now list the supported values and what action to take (#996).

Known limitations

  • Downgrades from v0.4.0 to v25.12.0 are not supported. Two changes prevent downgrade:
    • The kubelet plugin checkpoint format added a BootID field (#1066) that older binaries cannot unmarshal.
    • The ComputeDomain API now allows numNodes to be omitted (#1081). Older controllers cannot read or reverse this schema change. Users running with IMEXDaemonsWithDNSNames=false (non-default) should continue to set numNodes explicitly.
  • PassthroughSupport feature is not supported on NVIDIA Grace-based machines (#1042).

Installation

Helm install from the Kubernetes registry.

helm install dra-driver-nvidia-gpu oci://registry.k8s.io/dra-driver-nvidia/charts/dra-driver-nvidia-gpu \
  --version 0.4.0 \
  --namespace dra-driver-nvidia-gpu \
  --create-namespace \
  --set gpuResourcesEnabledOverride=true

Upgrade

In this release, there were several updates to release artifacts including an update to the versioning schema, an update to the Helm chart, and publishing the release to a Kubernetes repository. Refer to the above sections for full details.

Upgrade to v0.4.0 and switch to using the Kubernetes registry Helm chart:

helm upgrade -i nvidia-dra-driver-gpu oci://registry.k8s.io/dra-driver-nvidia/charts/dra-driver-nvidia-gpu \
  --version 0.4.0 \
  --namespace nvidia-dra-driver-gpu \
  --set gpuResourcesEnabledOverride=true \
  --set nameOverride=nvidia-dra-driver-gpu

Required flags for the first upgrade to v0.4.0 or later

  • --version 0.4.0: Starting in v0.4.0 the Helm chart follows Semantic Versioning, and the --version flag is required on helm install and helm upgrade. There is no implicit "latest" channel.
  • --set nameOverride=nvidia-dra-driver-gpu: Keeps your components deployed under the previous chart's resource names. Without this override, the new chart will create duplicate Kubernetes manifests (kubelet plugin DaemonSet, controller Deployment, RBAC, etc.) with different names.

Note that downgrades from v0.4.0 to v25.12.0 are not supported. Refer to the Known limitations section for details.

Container images

  • registry.k8s.io/dra-driver-nvidia/dra-driver-nvidia-gpu:v0.4.0

Contributors

37  @jgehrcke 
25  @dims 
20  @shivamerla 
13  @guptaNswati 
11  @varunrsekar 
 9  @shengnuo 
 7  @tariq1890 
 5  @kasia-kujawa 
 2  @xingyug
 2  @anishbista60
 1  @herb-duan 
 1  @kannon92 
 1  @leiyiz 
 1  @rajatchopra
 1  @empovit
 1  @takonomura 

Thanks to everyone who contributed to this release.

Full changelog

Compare v25.12.0...v0.4.0

New Contributors

Full Changelog: v25.12.0...v0.4.0

Don't miss a new dra-driver-nvidia-gpu release

NewReleases is sending notifications on new releases.