This is the first release of DRA Driver for NVIDIA GPUs as a part of a Kubernetes SIG community at kubernetes-sigs/dra-driver-nvidia-gpu. This release also updates the versioning scheme to Semantic Versioning (SemVer), starting at v0.4.0.
Project move
The DRA Driver for NVIDIA GPUs has moved from the NVIDIA org to kubernetes-sigs as part of its donation to CNCF. The new identifiers are:
| Artifact | Identifier |
|---|---|
| Repository | kubernetes-sigs/dra-driver-nvidia-gpu
|
| Go module | sigs.k8s.io/dra-driver-nvidia-gpu
|
| Container images | registry.k8s.io/dra-driver-nvidia/dra-driver-nvidia-gpu
|
| Helm chart | oci://registry.k8s.io/dra-driver-nvidia/charts/dra-driver-nvidia-gpu:0.4.0
|
Change to Semantic Versioning with v0.4.0
This release also updates the versioning scheme to Semantic Versioning (SemVer), starting at v0.4.0. This change was to make it easier to build and publish client API bindings for ComputeDomains to align with Go module semantic versioning for importing dependencies.
Refer to the following issues for more details around this change: #988, #715, and #1046.
Helm chart location and name change
Starting with v0.4.0, the Helm chart is published to two locations:
- NGC (continuing): https://helm.ngc.nvidia.com/nvidia/charts/dra-driver-nvidia-gpu. Note that this is a different name from the previous release. Refer to the Upgrade section for details on upgrading to the new chart name.
- Kubernetes registry (new):
oci://registry.k8s.io/dra-driver-nvidia/charts/dra-driver-nvidia-gpu.
Users can choose to use either chart.
The Helm chart naming is also updated from nvidia-dra-driver-gpu to dra-driver-nvidia-gpu in v0.4.0. Users can continue to use their existing component names by passing the nameOverride=nvidia-dra-driver-gpu flag when upgrading. Refer to the Upgrade section for commands and details about required flags.
Action required
- Starting in v0.4.0, the Helm chart follows Semantic Versioning. To upgrade, you must pass
--version 0.4.0when usinghelm upgrade. See the Upgrade section for details and commands. - If you are switching from the NGC chart to the Kubernetes registry chart, pass
--set nameOverride=nvidia-dra-driver-gpuon your first upgrade to keep existing Kubernetes resource names stable. The override is not required for subsequent releases. Refer to the Upgrade section for exact commands. - Users who hit "device cannot be reprepared" after a host reboot prior to v0.4.0 (issue #951) must remove the kubelet plugin checkpoint file manually before upgrading. The new BootID-aware checkpoint format (#1066) only invalidates checkpoints with a recorded BootID. Legacy checkpoints written by older versions are assumed valid.
Feature gate changes
The following feature gates changed in v0.4.0. See pkg/featuregates/featuregates.go for the complete list of gates and their current defaults.
| Feature Gate | Change | Stage | Default | Required K8s Feature Gate | PRs |
|---|---|---|---|---|---|
| DeviceMetadata | New | Alpha | false | None. Driver-side only (KEP-5304). Framework support is Alpha in K8s 1.36+ | #1000 |
| PassthroughSupport | Behavior change | Alpha | false | None beyond core DRA. IOMMUFD backend additionally requires a host kernel with IOMMUFD enabled. | IOMMUFD backend added (#1036), persistence-mode toggling during vfio prep (#1038), plugin startup, GPU tracking, and validation fixes (#994) |
| NVMLDeviceHealthCheck | Behavior change | Alpha | false | DRADeviceTaints (KEP-5055) Alpha (default off) in K8s 1.34 and 1.35, Beta (default on) in K8s 1.36. Informational taints additionally require K8s ≥ 1.35. | Unhealthy GPUs are now retained in the ResourceSlice with a DeviceTaint attached. The v25.12.0 behavior of removing unhealthy GPUs from the slice (which required a driver restart to re-add after recovery) is replaced. Non-fatal XIDs surface as informational taints (#983) |
| IMEXDaemonsWithDNSNames | Behavior change | Beta | true | None | When enabled, numNodes in the ComputeDomain API is now optional (#1081)
|
This release introduces enhanced validation logic for feature-gate flags:
- The driver now enforces mutual exclusivity between
PassthroughSupportandNVMLDeviceHealthCheckduring startup (#994). - Enabling
DeviceMetadatafunctionality now requires that thePassthroughSupportfeature gate is also active (#1000).
New features
- Leader election for the compute-domain-controller. This adds high availability when running multiple controller replicas. Disabled by default. Enable by setting
controller.replicas: 2(or more) andcontroller.leaderElection.enabled: truein your Helm values (#851). - Prometheus metrics. The GPU kubelet plugin, ComputeDomain plugin, and DRA controller now expose optional Prometheus metrics under the
nvidia_gpu_dra_*prefix. Enable viacontroller.metrics.enabledandkubeletPlugin.metrics.enabled(#964). - GPU health taints are now used to track health status. When the
NVMLDeviceHealthCheckfeature gate is enabled, unhealthy GPUs are tainted via KubernetesDeviceTaintsand remain in theResourceSlice, replacing the v25.12.0 behavior of removing unhealthy GPUs from the slice (which required a driver restart to re-add them after recovery). Non-fatal XID errors are surfaced as informational taints for observability without affecting scheduling (#983). - IOMMUFD-backed VFIO passthrough. VFIO passthrough now supports the IOMMUFD kernel backend in addition to the legacy IOMMU interface. Workloads opt in via a new
VfioDeviceConfigopaque config on theResourceClaim(apiVersion: resource.nvidia.com/v1beta1) withiommu.backendPolicy: PreferIommuFD; the driver falls back to the legacy backend if IOMMUFD is not available on the node. A companioniommu.enableAPIDevicefield controls whether the IOMMU API device is injected into the container. Defaults preserve v25.12.0 behavior — the defaultvfio.gpu.nvidia.comDeviceClass ships with no config and existing workloads continue to use the legacy backend unchanged. Requires thePassthroughSupportfeature gate (Alpha, default off) and, for IOMMUFD, a host kernel with IOMMUFD support enabled (#1036). - Device metadata downward API (KEP-5304). When the
DeviceMetadatafeature gate is enabled, the kubelet plugin writes aDeviceMetadataJSON file (apiVersionmetadata.resource.k8s.io/v1alpha1) per claim and injects it via CDI, exposing device attributes such aspciBusIDto the workload (#1000). numNodesin theComputeDomainAPI is now optional. WithIMEXDaemonsWithDNSNames=true(default), the field can be omitted. The default value is0. The field remains deprecated and will be removed in a future release. When running withIMEXDaemonsWithDNSNames=false, setnumNodesexplicitly. The API server no longer rejects a missing value (#1081).imagePullSecretspropagation to the ComputeDomain daemon. Secrets configured on the controller are now passed through to dynamically created CD daemon DaemonSet pods, resolvingImagePullBackOffagainst private registries on Kubernetes 1.35+ (#1033).- Upstream NFD GPU labels for kubelet plugin scheduling. The kubelet plugin DaemonSet now selects nodes using upstream Node Feature Discovery GPU labels in addition to NVIDIA-specific labels, allowing the chart to be used with upstream NFD without overrides (#1122).
- ExtendedResources examples. Sample DeviceClasses and Pod specs for ExtendedResources requests are now included (#940).
- Kubernetes 1.36 support added.
- OpenShift 4.21 support added.
- Plugin pods now use a higher startup-probe rate, providing faster readiness on healthy nodes (#872).
- The controller now tolerates
node-role.kubernetes.io/master, allowing it to schedule on older control planes (#899).
Bug fixes
- Fixed checkpoint is corrupted errors when upgrading the GPU kubelet plugin from v25.12.0. New checkpoint fields added in #994 (
ShareID,Metadata,ParentPCIBusID,PciBusID) are now annotated withomitemptyso older checkpoints remain readable. A diff between on-disk and re-marshaled checkpoint contents is now logged when checksum verification fails, making upgrade failures easier to debug (#1119). - Fixed an issue where devices could not be re-prepared after a host reboot. The kubelet plugin now records the Linux BootID in its checkpoint file and invalidates the checkpoint on BootID mismatch. See the Action required section for users hit by this prior to v0.4.0 (#1066).
- Fixed an issue where the GPU DRA driver attempted to toggle MIG mode on Ampere (A100) GPUs at allocation time, which requires a GPU reset. The driver now respects the pre-configured MIG state on A100s and continues to support reset-less MIG mode toggling on Hopper and newer GPUs (#1018).
- Fixed an issue where the kubelet plugin failed to start on nodes where every GPU is in passthrough mode.
nvml.Init()previously required at least one GPU bound to the NVIDIA driver (#994). - Fixed vfio device preparation failures caused by
nvidia-persistencedholding the GPU. The driver now toggles GPU persistence mode during vfio preparation under thePassthroughSupportfeature gate (#1038). - Fixed DynamicMIG enablement on non-MIG-capable GPUs. The driver now correctly skips non-MIG-capable GPUs on mixed-GPU nodes (#1113).
- Fixed
mps-control-daemonchroot shell execution error whennvidiaDriverRootis set (#889). - Fixed MPS shared-memory directory mount path (#978).
- Fixed cleanup of partially-prepared claims in the GPU kubelet plugin when DynamicMIG is disabled. A missing early return previously caused unprepare-time errors (#1040).
- Fixed RBAC duplication when the driver's install namespace is also listed in the
ADDITIONAL_NAMESPACESenvironment variable (#995). - Feature-gate flag validation. Unrecognized values for
PassthroughSupportandNVMLDeviceHealthCheckare now rejected at startup, and the mutual-exclusivity rules betweenPassthroughSupport,DynamicMIG,NVMLDeviceHealthCheck, andMPSSupportare enforced (#994). - Improved validation messages for GPU sharing. Errors now list the supported values and what action to take (#996).
Known limitations
- Downgrades from v0.4.0 to v25.12.0 are not supported. Two changes prevent downgrade:
- The kubelet plugin checkpoint format added a BootID field (#1066) that older binaries cannot unmarshal.
- The ComputeDomain API now allows
numNodesto be omitted (#1081). Older controllers cannot read or reverse this schema change. Users running withIMEXDaemonsWithDNSNames=false(non-default) should continue to setnumNodesexplicitly.
PassthroughSupportfeature is not supported on NVIDIA Grace-based machines (#1042).
Installation
Helm install from the Kubernetes registry.
helm install dra-driver-nvidia-gpu oci://registry.k8s.io/dra-driver-nvidia/charts/dra-driver-nvidia-gpu \
--version 0.4.0 \
--namespace dra-driver-nvidia-gpu \
--create-namespace \
--set gpuResourcesEnabledOverride=trueUpgrade
In this release, there were several updates to release artifacts including an update to the versioning schema, an update to the Helm chart, and publishing the release to a Kubernetes repository. Refer to the above sections for full details.
Upgrade to v0.4.0 and switch to using the Kubernetes registry Helm chart:
helm upgrade -i nvidia-dra-driver-gpu oci://registry.k8s.io/dra-driver-nvidia/charts/dra-driver-nvidia-gpu \
--version 0.4.0 \
--namespace nvidia-dra-driver-gpu \
--set gpuResourcesEnabledOverride=true \
--set nameOverride=nvidia-dra-driver-gpuRequired flags for the first upgrade to v0.4.0 or later
--version 0.4.0: Starting in v0.4.0 the Helm chart follows Semantic Versioning, and the--versionflag is required onhelm installandhelm upgrade. There is no implicit "latest" channel.--set nameOverride=nvidia-dra-driver-gpu: Keeps your components deployed under the previous chart's resource names. Without this override, the new chart will create duplicate Kubernetes manifests (kubelet plugin DaemonSet, controller Deployment, RBAC, etc.) with different names.
Note that downgrades from v0.4.0 to v25.12.0 are not supported. Refer to the Known limitations section for details.
Container images
registry.k8s.io/dra-driver-nvidia/dra-driver-nvidia-gpu:v0.4.0
Contributors
37 @jgehrcke
25 @dims
20 @shivamerla
13 @guptaNswati
11 @varunrsekar
9 @shengnuo
7 @tariq1890
5 @kasia-kujawa
2 @xingyug
2 @anishbista60
1 @herb-duan
1 @kannon92
1 @leiyiz
1 @rajatchopra
1 @empovit
1 @takonomura
Thanks to everyone who contributed to this release.
Full changelog
New Contributors
- @kasia-kujawa made their first contribution in #889
- @herb-duan made their first contribution in #851
- @leiyiz made their first contribution in #968
- @visheshtanksale made their first contribution in #965
- @dims made their first contribution in #1003
- @kannon92 made their first contribution in #1016
- @xingyug made their first contribution in #1039
- @takonomura made their first contribution in #1053
- @RobertNorthard made their first contribution in #1054
- @anishbista60 made their first contribution in #1094
Full Changelog: v25.12.0...v0.4.0