kubernetes-sigs/kueue v0.14.0-rc.0 on GitHub

Changes since v0.14.0-devel:

Urgent Upgrade Notes

(No, really, you MUST read this before you upgrade)

ProvisioningRequest: Remove setting deprecated ProvisioningRequest annotations on Kueue-managed Pods:
- cluster-autoscaler.kubernetes.io/consume-provisioning-request
- cluster-autoscaler.kubernetes.io/provisioning-class-name
If you are implementing a ProvisioningRequest reconciler used by Kueue you should
make sure the new annotations are supported:
- autoscaling.x-k8s.io/consume-provisioning-request
- autoscaling.x-k8s.io/provisioning-class-name (#6381, @kannon92)
Rename kueue-metrics-certs to kueue-metrics-cert cert-manager.io/v1 Certificate name in cert-manager manifests when installing Kueue using the Kustomize configuration.

If you're using cert-manager and have deployed Kueue using the Kustomize configuration, you must delete the existing kueue-metrics-certs cert-manager.io/v1 Certificate before applying the new changes to avoid conflicts. (#6345, @mbobrovskyi)

Replace "DeactivatedXYZ" "reason" label values with "Deactivated" and introduce "underlying_cause" label to the following metrics:
"pods_ready_to_evicted_time_seconds"
"evicted_workloads_total"
"local_queue_evicted_workloads_total"
"evicted_workloads_once_total"

ACTION REQUIRED
If you rely on the "DeactivatedXYZ" "reason" label values, you can migrate to the "Deactivated" "reason" label value and the following "underlying_cause" label values:

""
"WaitForStart"
"WaitForRecovery"
"AdmissionCheck"
"MaximumExecutionTimeExceeded"
"RequeuingLimitExceeded" (#6590, @mykysha)

Changes by Kind

Deprecation

Stop serving the QueueVisibility feature, but keep APIs (.status.pendingWorkloadsStatus) to avoid breaking changes.
If you rely on the QueueVisibility feature (.status.pendingWorkloadsStatus in the ClusterQueue), you must migrate to VisibilityOndDemand
(https://kueue.sigs.k8s.io/docs/tasks/manage/monitor_pending_workloads/pending_workloads_on_demand). (#6631, @vladikkuzn)

API Change

Graduated TopologyAwareScheduling to Beta. (#6830, @mbobrovskyi)
TAS: Support multiple nodes for failure handling by ".status.unhealthyNodes" in Workload. The "alpha.kueue.x-k8s.io/node-to-replace" annotation is no longer used (#6648, @pajakd)

Feature

Alpha: Introduced a generic MultiKueue adapter to support external, custom Job-like workloads. This allows users to integrate custom Job-like CRDs (e.g., Tekton PipelineRuns) with MultiKueue for resource management across multiple clusters. This feature is guarded by the MultiKueueGenericJobAdapter feature gate. (#6760, @khrm)
Add an alpha integration for Kubeflow Trainer to Kueue. (#6597, @kaisoz)
Added workload_priority_class label for kueue_admission_wait_time_seconds metric. (#6885, @mbobrovskyi)
Added workload_priority_class label for kueue_admitted_workloads_total metric. (#6795, @mbobrovskyi)
Added workload_priority_class label for kueue_evicted_workloads_once_total metric. (#6876, @mbobrovskyi)
Added workload_priority_class label for kueue_evicted_workloads_total metric. (#6860, @mbobrovskyi)
Added workload_priority_class label for kueue_quota_reserved_wait_time_seconds metric. (#6887, @mbobrovskyi)
Added workload_priority_class label for kueue_quota_reserved_workloads_total metric. (#6882, @mbobrovskyi)
DRA: Alpha support for Dynamic Resource Allocation in Kueue. (#5873, @alaypatel07)
ProvisioningRequest: Graduate ProvisioningACC feature to GA (#6382, @kannon92)
Support in-tree RayAutoscaler for Elastic RayCluster objects (#6662, @VassilisVassiliadis)
TAS: Graduated to Beta the following feature gates responsible for enabling and default configuration of the Node Hot Swap mechanism:
TASFailedNodeReplacement, TASFailedNodeReplacementFailFast, TASReplaceNodeOnPodTermination. (#6890, @mbobrovskyi)
TAS: Implicit mode schedules consecutive indexes as close as possible (rank-ordering). (#6615, @PBundyra)

Bug or Regression

AFS: Fixed kueue-controller-manager crash when enabled AdmissionFairSharing feature gate without AdmissionFairSharing config. (#6670, @mbobrovskyi)
ElasticJobs: Fix the bug that scheduling of the Pending workloads was not triggered on scale-down of the running
elastic Job which could result in admitting one or more of the queued workloads. (#6395, @ichekrygin)
FS: Fix the algorithm bug for identifying preemption candidates, as it could return a different
set of preemption target workloads (pseudo random) in consecutive attempts in tie-break scenarios,
resulting in excessive preemptions. (#6764, @PBundyra)
FS: Fixing a bug where a preemptor ClusterQueue was unable to reclaim its nominal quota when the preemptee ClusterQueue can borrow a large number of resources from the parent ClusterQueue / Cohort (#6617, @pajakd)
Fix accounting for the evicted_workloads_once_total metric:
- the metric wasn't incremented for workloads evicted due to stopped LocalQueue (LocalQueueStopped reason)
- the reason used for the metric was "Deactivated" for workloads deactivated by users and Kueue, now the reason label can have the following values: Deactivated, DeactivatedDueToAdmissionCheck, DeactivatedDueToMaximumExecutionTimeExceeded, DeactivatedDueToRequeuingLimitExceeded. This approach aligns the metric with evicted_workloads_total.
- the metric was incremented during preemption before the preemption request was issued. Thus, it could be incorrectly over-counted in case of the preemption request failure.
- the metric was not incremented for workload evicted due to NodeFailures (TAS)
The existing and introduced DeactivatedDueToXYZ reason label values will be replaced by the single "Deactivated" reason label value and underlying_cause in the future release. (#6332, @mimowo)
Fix support for PodGroup integration used by external controllers, which determine the
the target LocalQueue and the group size only later. In that case the hash would not be
computed resulting in downstream issues for ProvisioningRequest.
Now such an external controller can indicate the control over the PodGroup by adding
the kueue.x-k8s.io/pod-suspending-parent annotation, and later patch the Pods by setting
other metadata, like the kueue.x-k8s.io/queue-name label to initiate scheduling of the PodGroup. (#6286, @pawloch00)
Fix the bug for the ElasticJobsViaWorkloadSlices feature where in case of Job resize followed by eviction
of the "old" workload, the newly created workload could get admitted along with the "old" workload.
The two workloads would overcommit the quota. (#6221, @ichekrygin)
Fix the bug that a workload going repeatedly via the preemption and re-admission cycle would accumulate the
"Previously" prefix in the condition message, eg: "Previously: Previously: Previously: Preempted to accommodate a workload ...". (#6819, @amy)
Fix the bug which could occasionally cause workloads evicted by the built-in AdmissionChecks
(ProvisioningRequest and MultiKueue) to get stuck in the evicted state which didn't allow re-scheduling.
This could happen when the AdmissionCheck controller would trigger eviction by setting the
Admission check state to "Retry". (#6283, @mimowo)
Fix the validation messages when attempting to remove the queue-name label from a Deployment or StatefulSet. (#6715, @Panlq)
Fixed a bug that prevented adding the kueue- prefix to the secretName field in cert-manager manifests when installing Kueue using the Kustomize configuration. (#6318, @mbobrovskyi)
Fixed bug where internal cert manager assumed that the helm installation name was kueue. (#6869, @cmtly)
Helm: Fixed a bug preventing Kueue from starting after installing via Helm with a release name other than "kueue" (#6799, @mbobrovskyi)
Helm: Fixed bug where webhook configurations assumed a helm install name as "kueue". (#6918, @cmtly)
KueueViz: Fix CORS configuration for development environments (#6603, @yankay)
Pod-integration now correctly handles pods stuck in the Terminating state within pod groups, preventing them from being counted as active and avoiding blocked quota release. (#6872, @ichekrygin)
ProvisioningRequest: Fix a bug that Kueue didn't recreate the next ProvisioningRequest instance after the
second (and consecutive) failed attempt. (#6322, @PBundyra)
Support disabling client-side ratelimiting in Config API clientConnection.qps with a negative value (e.g., -1) (#6300, @tenzen-y)
TAS: Fix a bug that the node failure controller tries to re-schedule Pods on the failure node even after the Node is recovered and reappears (#6325, @pajakd)
TAS: Fix a bug where new Workloads starve, caused by inadmissible workloads frequently requeueing due to unrelated Node LastHeartbeatTime update events. (#6570, @utam0k)
TAS: Fix the scenario when Node Hot Swap cannot find a replacement. In particular, if slices are used
they could result in generating invalid assignment, resulting in panic from TopologyUngater.
Now, such a workload is evicted. (#6914, @PBundyra)
TAS: fix the bug that Kueue is crashing when PodSet has size 0, eg. no workers in LeaderWorkerSet instance. (#6501, @mimowo)

Other (Cleanup or Flake)

Promote ConfigurableResourceTransformations feature gate to stable. (#6599, @mbobrovskyi)
Support for Kubernetes 1.34 (#6689, @mbobrovskyi)
TAS: stop setting the "kueue.x-k8s.io/tas" label on Pods.
In case the implicit TAS mode is used, then the kueue.x-k8s.io/podset-unconstrained-topology=true annotation
is set on Pods. (#6895, @mimowo)