kubernetes-sigs/kueue v0.14.0 on GitHub

Changes since v0.13.0:

Urgent Upgrade Notes

(No, really, you MUST read this before you upgrade)

ProvisioningRequest: Remove setting deprecated ProvisioningRequest annotations on Kueue-managed Pods:
- cluster-autoscaler.kubernetes.io/consume-provisioning-request
- cluster-autoscaler.kubernetes.io/provisioning-class-name
If you are implementing a ProvisioningRequest reconciler used by Kueue you should
make sure the new annotations are supported:
- autoscaling.x-k8s.io/consume-provisioning-request
- autoscaling.x-k8s.io/provisioning-class-name (#6381, @kannon92)
Rename kueue-metrics-certs to kueue-metrics-cert cert-manager.io/v1 Certificate name in cert-manager manifests when installing Kueue using the Kustomize configuration.

If you're using cert-manager and have deployed Kueue using the Kustomize configuration, you must delete the existing kueue-metrics-certs cert-manager.io/v1 Certificate before applying the new changes to avoid conflicts. (#6345, @mbobrovskyi)

Replace "DeactivatedXYZ" "reason" label values with "Deactivated" and introduce "underlying_cause" label to the following metrics:
"pods_ready_to_evicted_time_seconds"
"evicted_workloads_total"
"local_queue_evicted_workloads_total"
"evicted_workloads_once_total"

If you rely on the "DeactivatedXYZ" "reason" label values, you can migrate to the "Deactivated" "reason" label value and the following "underlying_cause" label values:

""
"WaitForStart"
"WaitForRecovery"
"AdmissionCheck"
"MaximumExecutionTimeExceeded"
"RequeuingLimitExceeded" (#6590, @mykysha)
TAS: Enforce a stricter value of the kueue.x-k8s.io/podset-group-name annotation in the creation webhook.

Make sure the values of the kueue.x-k8s.io/podset-group-name annotation are not numbers.` (#6708, @kshalot)

Upgrading steps

1. Back Up Topology Resources (skip if you are not using Topology API):

kubectl get topologies.kueue.x-k8s.io -o yaml > topologies.yaml

2. Update apiVersion in Backup File (skip if not using Topology API):

Replace v1alpha1 with v1beta1 in topologies.yaml for all resources:

sed -i -e 's/v1alpha1/v1beta1/g' topologies.yaml

3. Delete Old CRDs:

kubectl delete crd topologies.kueue.x-k8s.io

4. Remove Finalizers from Topologies (skip if you are not using Topology API):

kubectl get topology.kueue.x-k8s.io -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' | while read -r name; do
kubectl patch topology.kueue.x-k8s.io "$name" -p '{"metadata":{"finalizers":[]}}' --type='merge'
done

5. Install Kueue v0.14.0:

Follow the instructions here to install.

6. Restore Topology Resources (skip if not using Topology API):

kubectl apply -f topologies.yaml

Changes by Kind

Deprecation

Stop serving the QueueVisibility feature, but keep APIs (.status.pendingWorkloadsStatus) to avoid breaking changes.
If you rely on the QueueVisibility feature (.status.pendingWorkloadsStatus in the ClusterQueue), you must migrate to VisibilityOndDemand
(https://kueue.sigs.k8s.io/docs/tasks/manage/monitor_pending_workloads/pending_workloads_on_demand). (#6631, @vladikkuzn)

API Change

TAS: Graduated TopologyAwareScheduling to Beta. (#6830, @mbobrovskyi)
TAS: Support multiple nodes for failure handling by ".status.unhealthyNodes" in Workload. The "alpha.kueue.x-k8s.io/node-to-replace" annotation is no longer used (#6648, @pajakd)

Feature

Add an alpha integration for Kubeflow Trainer to Kueue. (#6597, @kaisoz)
Add an exponential backoff for the TAS scheduler second pass. (#6753, @mykysha)
Added priority_class label for kueue_local_queue_admitted_workloads_total metric. (#6845, @vladikkuzn)
Added priority_class label for kueue_local_queue_evicted_workloads_total metric (#6898, @vladikkuzn)
Added priority_class label for kueue_local_queue_quota_reserved_workloads_total metric. (#6897, @vladikkuzn)
Added priority_class label for the following metrics:
- kueue_admitted_workloads_total
- kueue_evicted_workloads_total
- kueue_evicted_workloads_once_total
- kueue_quota_reserved_workloads_total
- kueue_admission_wait_time_seconds
- kueue_quota_reserved_wait_time_seconds
- kueue_admission_checks_wait_time_seconds (#6951, @mbobrovskyi)
Added priority_class to kueue_local_queue_admission_checks_wait_time_seconds (#6902, @vladikkuzn)
Added priority_class to kueue_local_queue_admission_wait_time_seconds (#6899, @vladikkuzn)
Added priority_class to kueue_local_queue_quota_reserved_wait_time_seconds (#6900, @vladikkuzn)
Added workload_priority_class label for optional metrics (if waitForPodsReady is enabled):
- kueue_ready_wait_time_seconds (Histogram)
- kueue_admitted_until_ready_wait_time_seconds (Histogram)
- kueue_local_queue_ready_wait_time_seconds (Histogram)
- kueue_local_queue_admitted_until_ready_wait_time_seconds (Histogram) (#6944, @IrvingMg)
DRA: Alpha support for Dynamic Resource Allocation in Kueue. (#5873, @alaypatel07)
ElasticJobs: Support in-tree RayAutoscaler for RayCluster (#6662, @VassilisVassiliadis)
KueueViz: Enhancing the following endpoint customizations and optimizations:
- The frontend and backend ingress no longer have hardcoded NGINX annotations. You can now set your own annotations in Helm’s values.yaml using kueueViz.backend.ingress.annotations and kueueViz.frontend.ingress.annotations
- The Ingress resources for KueueViz frontend and backend no longer require hardcoded TLS. You can now choose to use HTTP only by not providing kueueViz.backend.ingress.tlsSecretName and kueueViz.frontend.ingress.tlsSecretName
- You can set environment variables like KUEUEVIZ_ALLOWED_ORIGINS directly from values.yaml using kueueViz.backend.env (#6682, @Smuger)
MultiKueue: Support external frameworks.
Introduced a generic MultiKueue adapter to support external, custom Job-like workloads. This allows users to integrate custom Job-like CRDs (e.g., Tekton PipelineRuns) with MultiKueue for resource management across multiple clusters. This feature is guarded by the MultiKueueGenericJobAdapter feature gate. (#6760, @khrm)
Multikueue × ElasticJobs: The elastic batchv1/Job supports MultiKueue. (#6445, @ichekrygin)
ProvisioningRequest: Graduate ProvisioningACC feature to GA (#6382, @kannon92)
TAS: Graduated to Beta the following feature gates responsible for enabling and default configuration of the Node Hot Swap mechanism:
TASFailedNodeReplacement, TASFailedNodeReplacementFailFast, TASReplaceNodeOnPodTermination. (#6890, @mbobrovskyi)
TAS: Implicit mode schedules consecutive indexes as close as possible (rank-ordering). (#6615, @PBundyra)
TAS: introduce validation against using PodSet grouping and PodSet slicing for the same PodSet,
which is currently not supported. More precisely the kueue.x-k8s.io/podset-group-name annotation
cannot be set along with any of: kueue.x-k8s.io/podset-slice-size, kueue.x-k8s.io/podset-slice-required-topology. (#7051, @kshalot)
The following limits for ClusterQueue quota specification have been relaxed:
- the number of Flavors per ResourceGroup is increased from 16 to 64
- the number of Resources per Flavor, within a ResourceGroup, is increased from 16 to 64
We also provide the following additional limits:
- the total number of Flavors across all ResourceGroups is <= 256
- the total number of Resources across all ResourceGroups is <= 256
- the total number of (Flavor, Resource) pairs within a ResourceGroup is <= 512 (#6906, @LarsSven)
Visibility API: Adds support for Securing APIService. (#6798, @MaysaMacedo)
WorkloadRequestUseMergePatch: allows switching the Status Patch type from Apply to Merge for admission-related patches. (#6765, @mszadkow)

Bug or Regression

AFS: Fixed kueue-controller-manager crash when enabled AdmissionFairSharing feature gate without AdmissionFairSharing config. (#6670, @mbobrovskyi)
ElasticJobs: Fix the bug for the ElasticJobsViaWorkloadSlices feature where in case of Job resize followed by eviction
of the "old" workload, the newly created workload could get admitted along with the "old" workload.
The two workloads would overcommit the quota. (#6221, @ichekrygin)
ElasticJobs: Fix the bug that scheduling of the Pending workloads was not triggered on scale-down of the running
elastic Job which could result in admitting one or more of the queued workloads. (#6395, @ichekrygin)
ElasticJobs: workloads correctly trigger workload preemption in response to a scale-up event. (#6973, @ichekrygin)
FS: Fix the algorithm bug for identifying preemption candidates, as it could return a different
set of preemption target workloads (pseudo random) in consecutive attempts in tie-break scenarios,
resulting in excessive preemptions. (#6764, @PBundyra)
FS: Fix the following FairSharing bugs:
- Incorrect DominantResourceShare caused by rounding (large quotas or high FairSharing weight)
- Preemption loop caused by zero FairSharing weight (#6925, @gabesaba)
FS: Fixing a bug where a preemptor ClusterQueue was unable to reclaim its nominal quota when the preemptee ClusterQueue can borrow a large number of resources from the parent ClusterQueue / Cohort (#6617, @pajakd)
FS: Validate FairSharing.Weight against small values which lose precision (0 < value <= 10^-9) (#6986, @gabesaba)
Fix accounting for the evicted_workloads_once_total metric:
- the metric wasn't incremented for workloads evicted due to stopped LocalQueue (LocalQueueStopped reason)
- the reason used for the metric was "Deactivated" for workloads deactivated by users and Kueue, now the reason label can have the following values: Deactivated, DeactivatedDueToAdmissionCheck, DeactivatedDueToMaximumExecutionTimeExceeded, DeactivatedDueToRequeuingLimitExceeded. This approach aligns the metric with evicted_workloads_total.
- the metric was incremented during preemption before the preemption request was issued. Thus, it could be incorrectly over-counted in case of the preemption request failure.
- the metric was not incremented for workload evicted due to NodeFailures (TAS)
The existing and introduced DeactivatedDueToXYZ reason label values will be replaced by the single "Deactivated" reason label value and underlying_cause in the future release. (#6332, @mimowo)
Fix bug in workload usage removal simulation that results in inaccurate flavor assignment (#7077, @gabesaba)
Fix support for PodGroup integration used by external controllers, which determine the
the target LocalQueue and the group size only later. In that case the hash would not be
computed resulting in downstream issues for ProvisioningRequest.
Now such an external controller can indicate the control over the PodGroup by adding
the kueue.x-k8s.io/pod-suspending-parent annotation, and later patch the Pods by setting
other metadata, like the kueue.x-k8s.io/queue-name label to initiate scheduling of the PodGroup. (#6286, @pawloch00)
Fix the bug for the StatefulSet integration which would occasionally cause a StatefulSet
to be stuck without workload after renaming the "queue-name" label. (#7028, @IrvingMg)
Fix the bug that a workload going repeatedly via the preemption and re-admission cycle would accumulate the
"Previously" prefix in the condition message, eg: "Previously: Previously: Previously: Preempted to accommodate a workload ...". (#6819, @amy)
Fix the bug which could occasionally cause workloads evicted by the built-in AdmissionChecks
(ProvisioningRequest and MultiKueue) to get stuck in the evicted state which didn't allow re-scheduling.
This could happen when the AdmissionCheck controller would trigger eviction by setting the
Admission check state to "Retry". (#6283, @mimowo)
Fix the validation messages when attempting to remove the queue-name label from a Deployment or StatefulSet. (#6715, @Panlq)
Fixed a bug that prevented adding the kueue- prefix to the secretName field in cert-manager manifests when installing Kueue using the Kustomize configuration. (#6318, @mbobrovskyi)
HC: When multiple borrowing flavors are available, prefer the flavor which
results in borrowing more locally (closer to the ClusterQueue, further from the root Cohort).
This fixes the scenario where a flavor would be selected which required borrowing
from the root Cohort in one flavor, while in a second flavor, quota was
available from the nearest parent Cohort. (#7024, @gabesaba)
Helm: Fix a bug where the internal cert manager assumed that the helm installation name is 'kueue'. (#6869, @cmtly)
Helm: Fixed a bug preventing Kueue from starting after installing via Helm with a release name other than "kueue" (#6799, @mbobrovskyi)
Helm: Fixed bug where webhook configurations assumed a helm install name as "kueue". (#6918, @cmtly)
KueueViz: Fix CORS configuration for development environments (#6603, @yankay)
KueueViz: Fix a bug that only localhost is an executable domain. (#7011, @kincl)
Pod-integration now correctly handles pods stuck in the Terminating state within pod groups, preventing them from being counted as active and avoiding blocked quota release. (#6872, @ichekrygin)
ProvisioningRequest: Fix a bug that Kueue didn't recreate the next ProvisioningRequest instance after the
second (and consecutive) failed attempt. (#6322, @PBundyra)
Support disabling client-side ratelimiting in Config API clientConnection.qps with a negative value (e.g., -1) (#6300, @tenzen-y)
TAS: Fix a bug that the node failure controller tries to re-schedule Pods on the failure node even after the Node is recovered and reappears (#6325, @pajakd)
TAS: Fix a bug where new Workloads starve, caused by inadmissible workloads frequently requeueing due to unrelated Node LastHeartbeatTime update events. (#6570, @utam0k)
TAS: Fix the scenario when Node Hot Swap cannot find a replacement. In particular, if slices are used
they could result in generating invalid assignment, resulting in panic from TopologyUngater.
Now, such a workload is evicted. (#6914, @PBundyra)
TAS: Node Hot Swap allows replacing a node for workloads using PodSet slices,
ie. when the kueue.x-k8s.io/podset-slice-size annotation is used. (#6942, @pajakd)
TAS: fix the bug that Kueue is crashing when PodSet has size 0, eg. no workers in LeaderWorkerSet instance. (#6501, @mimowo)

Other (Cleanup or Flake)

Promote ConfigurableResourceTransformations feature gate to stable. (#6599, @mbobrovskyi)
Support for Kubernetes 1.34 (#6689, @mbobrovskyi)
TAS: stop setting the "kueue.x-k8s.io/tas" label on Pods.
In case the implicit TAS mode is used, then the kueue.x-k8s.io/podset-unconstrained-topology=true annotation
is set on Pods. (#6895, @mimowo)