github NVIDIA/k8s-dra-driver-gpu v25.8.0

21 hours ago

This release introduces substantial improvements to the operational ergonomics and fault tolerance of ComputeDomains.

Installation instructions can be found here.

When upgrading from a previous release, please follow the upgrade instructions below.

Highlights

  • Elasticity and fault tolerance: ComputeDomains were always described as following the workload, in terms of node placement. In 25.3.x releases, though, a ComputeDomain remained static after initial workload scheduling: it could not expand or shrink, and it could not incorporate a replacement node upon node failure. Now, a ComputeDomain dynamically responds to workload placement changes at runtime. For example, when a workload pod fails and subsequently gets scheduled to a new node (which was previously not part of the ComputeDomain), the domain now dynamically expands to the new node.
  • Ergonomics: With ComputeDomains now being elastic, the numNodes parameter for ComputeDomains lost relevance and can always be set to 0. Thus, one no longer needs a priori knowledge of the number of nodes required for a workload when creating a ComputeDomain. For details and caveats, carefully review the current numNodes specification. This field will be removed in a future version of the API provided by this DRA driver.
  • Scheduling latency improvement: individual workload pods in a ComputeDomain now get released (started) much faster: underneath a ComputeDomain, individual IMEX daemons now come online as soon as they are ready (without waiting for the entire compute domain to be formed). Specifically, an individual workload pods comes online when its corresponding local IMEX daemon is ready. This effectively removes a barrier from the system that your workload must have an internal replacement for (ensuring that Multi-Node NVLink communication is only attempted when all relevant components are online). To restore the previous barrier behavior, see below.

Improvements

  • A new allocationMode parameter can be used for the channel specification when creating a ComputeDomain. When set to all, all (currently 2048) IMEX channels get injected into each workload pod (#468, #506). See here for a usage example.
  • For larger-scale deployments, startup/liveness/readiness probes were adjusted for predictable and fast scheduling, and the kubelet plugins' rollingUpdate.maxUnavailable setting was changed to allow for smooth upgrades (#616).
  • For long-running deployments, a new cleanup routine was introduced to clean up after partially prepared claims (#672).
  • Following the principle of least privilege, we have further minimized the privileges assigned to individual components (#666).
  • A new chart-wide logVerbosity parameter was introduced to control overall verbosity of all driver components (#633).
  • The GPU index and minor number are now not advertised anymore, see discussion in #624 and #563 for details.
  • Support was added for a validating admission webhook to inspect opaque configs (such as on ResourceClaims, see #461).
  • More diagnostics output is provided around NVLink clique ID determination (#630).
  • Individual components now systematically log their startup config (#658), and more output is shown during shutdown (#644).

Fixes

  • Stderr/out emitted by probes based on nvidia-imex-ctl is now displayed as part of kubectl describe output (#636).
  • Logs are now flushed on component shutdown (#661).

Other changes

More changes were made that have not yet been explicitly called out above. All pull requests that made it into this milestone are listed here. All commits since v25.3.2 can be reviewed here.

Upgrades

Procedure

We invested into making upgrades from version 25.3.2 of this driver work smoothly, without having to tear down workload.

First, make sure to upgrade to version 25.3.2 first. This is required (and can be verified with for example helm list -A).

Then, follow this procedure:

  1. Upgrade the CRD:
    kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-dra-driver-gpu/refs/tags/v25.8.0/deployments/helm/nvidia-dra-driver-gpu/crds/resource.nvidia.com_computedomains.yaml
    
  2. Upgrade the Helm chart by running helm upgrade -i nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu --version="25.8.0" --create-namespace --namespace nvidia-dra-driver-gpu [...] (followed by arguments for setting chart parameters as usual.

Restoring IMEX daemon ensemble barrier behavior

With the new IMEX daemon orchestration architecture, workload pods start when their local IMEX daemon is ready -- the underlying IMEX domain may not have fully formed yet.

Previously, all workload pods were released (started) only after the full IMEX domain had formed (according to the numNodes parameter). If your workload relies on this barrier , you can for now --set featureGates.IMEXDaemonsWithDNSNames=false upon Helm chart installation/upgrade. It is advised to make your workload cooperate with the new mode of operation though; as the old behavior now has limited support and will eventually be phased out.

Don't miss a new k8s-dra-driver-gpu release

NewReleases is sending notifications on new releases.