The NVIDIA GPU Device Plugin v0.15.0 release includes the following major changes:
Consolidated the NVIDIA GPU Device Plugin and NVIDIA GPU Feature Discovery repositories
Since the NVIDIA GPU Device Plugin and GPU Feature Discovery (GFD) components are often used together, we have consolidated the repositories. The primary goal was to streamline the development and release process and functionality remains unchanged. The user facing changes are as follows:
- The two components will use the same version, meaning that the GFD version jumps from
v0.8.2
tov0.15.0
. - The two components use the same container image, meaning that instead of
nvcr.io/nvidia/gpu-feature-discovery
is to be usednvcr.io/nvidia/k8s-device-plugin
. Note that this may mean that thegpu-feature-discovery
command needs to be explicitly specified.
In order to facilitate the transition for users that rely on a standalone GFD deployment, this release includes a gpu-feature-discovery
helm chart in the device plugin helm repository.
Added experimental support for GPU partitioning using MPS.
This release of the NVIDIA GPU Device Plugin includes experiemental support for GPU sharing using CUDA MPS. Feedback on this feature is appreciated.
This functionality is not production ready and includes a number of known issues including:
- The device plugin may show as started before it is ready to allocate shared GPUs while waiting for the CUDA MPS control daemon to come online.
- There is no synchronization between the CUDA MPS control daemon and the GPU Device Plugin under restarts or configuration changes. This means that workloads may crash if they lose access to shared resources controlled by the CUDA MPS control daemon.
- MPS is only supported for full GPUs.
- It is not possible to "combine" MPS GPU requests to allow for access to more memory by a single container.
Deprecation Notice
The following table shows a set of new CUDA driver and runtime version labels and their existing equivalents. The existing labels should be considered deprecated and will be removed in a future release.
New Label | Deprecated Label |
---|---|
nvidia.com/cuda.driver-version.major
| nvidia.com/cuda.driver.major
|
nvidia.com/cuda.driver-version.minor
| nvidia.com/cuda.driver.minor
|
nvidia.com/cuda.driver-version.revision
| nvidia.com/cuda.driver.rev
|
nvidia.com/cuda.driver-version.full
| |
nvidia.com/cuda.runtime-version.major
| nvidia.com/cuda.runtime.major
|
nvidia.com/cuda.runtime-version.minor
| nvidia.com/cuda.runtime.minor
|
nvidia.com/cuda.runtime-version.full
|
Full Changelog: v0.14.0...v0.15.0
Changes since v0.15.0-rc.2
- Moved
nvidia-device-plugin.yml
static deployment at the root of the repository todeployments/static/nvidia-device-plugin.yml
. - Simplify PCI device clases in NFD worker configuration.
- Update CUDA base image version to 12.4.1.
- Switch to Ubuntu22.04-based CUDA image for default image.
- Add new CUDA driver and runtime version labels to align with other NFD version labels.
- Update NFD dependency to v0.15.3.
v0.15.0-rc.2
- Bump CUDA base image version to 12.3.2
- Add
cdi-cri
device list strategy. This uses the CDIDevices CRI field to request CDI devices instead of annotations. - Set MPS memory limit by device index and not device UUID. This is a workaround for an issue where
these limits are not applied for devices if set by UUID. - Update MPS sharing to disallow requests for multiple devices if MPS sharing is configured.
- Set mps device memory limit by index.
- Explicitly set sharing.mps.failRequestsGreaterThanOne = true.
- Run tail -f for each MPS daemon to output logs.
- Enforce replica limits for MPS sharing.
v0.15.0-rc.1
- Import GPU Feature Discovery into the GPU Device Plugin repo. This means that the same version and container image is used for both components.
- Add tooling to create a kind cluster for local development and testing.
- Update
go-gpuallocator
dependency to migrate away from the deprecatedgpu-monitoring-tools
NVML bindings. - Remove
legacyDaemonsetAPI
config option. This was only required for k8s versions < 1.16. - Add support for MPS sharing.
- Bump CUDA base image version to 12.3.1