kubernetes-sigs/kueue v0.16.0-rc.0 on GitHub

Changes since v0.15.0:

Urgent Upgrade Notes

(No, really, you MUST read this before you upgrade)

Removed FlavorFungibilityImplicitPreferenceDefault feature gate.
Configure flavor selection preference using the ClusterQueue field spec.flavorFungibility.preference instead. (#8134, @mbobrovskyi)
The short name "wl" for workloads has been removed to avoid potential conflicts with the in-tree workload object coming into Kubernetes (#8472, @kannon92)

Changes by Kind

API Change

Add field multiplyBy for ResourceTransformation (#7599, @calvin0327)
V1beta2: Use v1beta2 as storage version in v0.16
The v1beta1 API version will no longer be served in v0.17 (new resources cannot be created with v1beta1) and will be fully removed in v0.18.
Migrate all existing Kueue resources from kueue.x-k8s.io/v1beta1 to kueue.x-k8s.io/v1beta2 after upgrading to v0.16 and before upgrading to v0.17.
Kueue conversion webhooks handle structural changes automatically – the migration only updates the stored apiVersion.
Migration instructions (including the official script): #8018. (#8020, @mbobrovskyi)

Feature

Adds support for PodsReady when JobSet dependsOn is used. (#7889, @MaysaMacedo)
CLI: Support "kwl" and "kueueworkload" as a shortname for Kueue Workloads. (#8379, @kannon92)
ClusterQueues with both MultiKueue and ProvisioningRequest admission checks are now marked as inactive with reason "MultiKueueWithProvisioningRequest", as this configuration is invalid on manager clusters. (#8451, @IrvingMg)
Enable Pod-based integrations by default (#8096, @sohankunkerkar)
Logs now include replica-role field to identify Kueue instance roles (leader/follower/standalone). (#8107, @IrvingMg)
MultiKueue: trigger workload eviction on the management cluster when the corresponding workload is evicted
on the worker remote cluster. In particular this is fixing the issue with workloads using ProvisioningRequests,
which could get stuck in a worker cluster which does not have enough capacity to ever admit the workloads. (#8477, @mszadkow)
Observability: Add more details (the preemptionMode) to the QuotaReserved condition message,
and the related event, about the skipped flavors which were considered for preemption.
Before: "Quota reserved in ClusterQueue preempt-attempts-cq, wait time since queued was 9223372037s; Flavors considered: main: on-demand(Preempt;insufficient unused quota for cpu in flavor on-demand, 1 more needed)"
After: "Quota reserved in ClusterQueue preempt-attempts-cq, wait time since queued was 9223372037s; Flavors considered: main: on-demand(preemptionMode=Preempt;insufficient unused quota for cpu in flavor on-demand, 1 more needed)" (#8024, @mykysha)
Ray: Support RayJob InTreeAutoscaling by using the ElasticJobsViaWorkloadSlices feature. (#8082, @hiboyang)
TAS: extend the information in condition messages and events about nodes excluded from calculating the
assignment due to various recognized reasons like: taints, node affinity, node resource constraints. (#8043, @sohankunkerkar)

Bug or Regression

Add lws editer and viewer roles to kustomize and helm (#8513, @kannon92)
DRA: fix the race condition bug leading to undefined behavior due to concurrent operations
on the Workload object, manifested by the "WARNING: DATA RACE" in test logs. (#8073, @mbobrovskyi)
Fix ClusterQueue deletion getting stuck when pending workloads are deleted after being assumed by the scheduler. (#8543, @sohankunkerkar)
Fix EnsureWorkloadSlices to finish old slice when new is admitted as replacement (#8456, @sohankunkerkar)
Fix TrainJob controller not correctly setting the PodSet count value based on numNodes for the expected number of training nodes. (#8135, @kaisoz)
Fix a bug that WorkloadPriorityClass value changes do not trigger Workload priority updates. (#8442, @ASverdlov)
Fix a performance bug as some "read-only" functions would be taking unnecessary "write" lock. (#8181, @ErikJiang)
Fix the race condition bug where the kueue_pending_workloads metric may not be updated to 0 after the last
workload is admitted and there are no new workloads incoming. (#8037, @Singularity23x0)
Fixed a bug that Kueue's scheduler would re-evaluate and update already finished workloads, significantly
impacting overall scheduling throughput. This re-evaluation of a finished workload would be triggered when:
1. Kueue is restarted
2. There is any event related to LimitRange or RuntimeClass instances referenced by the workload (#8186, @mbobrovskyi)
Fixed the following bugs for the StatefulSet integration by ensuring the Workload object
has the ownerReference to the StatefulSet:
1. Kueue doesn't keep the StatefulSet as deactivated
2. Kueue marks the Workload as Finished if all StatefulSet's Pods are deleted
3. changing the "queue-name" label could occasionally result in the StatefulSet getting stuck (#4799, @mbobrovskyi)
HC: Avoid redundant requeuing of inadmissible workloads when multiple ClusterQueues in the same cohort hierarchy are processed. (#8441, @sohankunkerkar)
Integrations based on Pods: skip using finalizers on the Pods created and managed by integrations.
In particular we skip setting finalizers for Pods managed by the built in Serving Workloads Deployments,
StatefulSets, and LeaderWorkerSets.
This improves performance of suspending the workloads, and fixes occasional race conditions when a StatefulSet
could get stuck when deactivating and re-activating in a short interval. (#8530, @mbobrovskyi)
JobFramework: Fixed a bug that allowed a deactivated workload to be activated. (#8424, @chengjoey)
Kubeflow TrainJob v2: fix the bug to prevent duplicate pod template overrides when starting the Job is retried. (#8269, @j-skiba)
MultiKueue now waits for WorkloadAdmitted (instead of QuotaReserved) before deleting workloads from non-selected worker clusters. To revert to the previous behavior, disable the MultiKueueWaitForWorkloadAdmitted feature gate. (#8592, @IrvingMg)
MultiKueue via ClusterProfile: Fix the panic if the configuration for ClusterProfiles wasn't not provided in the configMap. (#8071, @mszadkow)
MultiKueue: Fix a bug that the priority change by mutating the kueue.x-k8s.io/priority-class label on the management cluster is not propagated to the worker clusters. (#8464, @mbobrovskyi)
MultiKueue: Fixed status sync for CRD-based jobs (JobSet, Kubeflow, Ray, etc.) that was blocked while the local job was suspended. (#8308, @IrvingMg)
MultiKueue: fix the bug that for Pod integration the AdmissionCheck status would be kept Pending indefinitely,
even when the Pods are already running.
The analogous fix is also done for the batch/Job when the MultiKueueBatchJobWithManagedBy feature gate is disabled. (#8189, @IrvingMg)
MultiKueue: fix the eviction when initiated by the manager cluster (due to eg. Preemption or WairForPodsReady timeout). (#8151, @mbobrovskyi)
ProvisioningRequest: Fixed a bug that prevented events from being updated when the AdmissionCheck state changed. (#8394, @mbobrovskyi)
Revert the changes in PR #8599 for transitioning
the QuotaReserved, Admitted conditions to False for Finished workloads. This introduced a regression,
because users lost the useful information about the timestamp of the last transitioning of these
conditions to True, without an API replacement to serve the information. (#8599, @mbobrovskyi)
Scheduling: fix a bug that evictions submitted by scheduler (preemptions and eviction due to TAS NodeHotSwap failing)
could result in conflict in case of concurrent workload modification by another controller.
This could lead to indefinite failing requests sent by scheduler in some scenarios when eviction is initiated by
TAS NodeHotSwap. (#7933, @mbobrovskyi)
Scheduling: fix the bug that setting (none -> some) a workload priority class label (kueue.x-k8s.io/priority-class) was ignored. (#8480, @andrewseif)
TAS NodeHotSwap: fixed the bug that allows workload to requeue by scheduler even if already deleted on TAS NodeHotSwap eviction. (#8278, @mbobrovskyi)
TAS: Fix handling of admission for workloads using the LeastFreeCapacity algorithm when the "unconstrained"
mode is used. In that case scheduling would fail if there is at least one node in the cluster which does not have
enough capacity to accommodate at least one Pod. (#8168, @PBundyra)
TAS: fix TAS resource flavor controller to extract only scheduling-relevant node updates to prevent unnecessary reconciliation. (#8452, @Ladicle)
TAS: fix a performance bug that continues reconciles of TAS ResourceFlavor (and related ClusterQueues)
were triggered by updates to Nodes' heartbeat times. (#8342, @PBundyra)
TAS: fix bug that when TopologyAwareScheduling is disabled, but there is a ResourceFlavor configured with topologyName, then preemptions fail with "workload requires Topology, but there is no TAS cache information". (#8167, @zhifei92)
TAS: fixed performance issue due to unncessary (empty) request by TopologyUngater (#8279, @mbobrovskyi)

Other (Cleanup or Flake)

Fix: Removed outdated comments incorrectly stating that deployment, statefulset, and leaderworkerset integrations require pod integration to be enabled. (#8053, @IrvingMg)
Improve error messages for validation errors regarding WorkloadPriorityClass changes in workloads. (#8334, @olekzabl)
MultiKueue: improve the MultiKueueCluster reconciler to skip attempting to reconcile and throw errors
when the corresponding Secret or ClusterProfile objects don't exist. The reconcile will be triggered on
creation of the objects. (#8144, @mszadkow)
Removes ConfigurableResourceTransformations feature gate. (#8133, @mbobrovskyi)