kubernetes-sigs/kueue v0.16.9 on GitHub

Changes since v0.16.8:

Actions Required Before Upgrading

(No, really, you MUST read this before you upgrade)

Minor releases: Review the .0 release notes for each new minor version you cross; see: v0.15.0, v0.16.0.
Patch releases: Review the patch release notes leading up to this version, but only within this minor release line; see: v0.16.1, v0.16.2, v0.16.3, v0.16.4, v0.16.5, v0.16.6, v0.16.7, v0.16.8.

Changes by Kind

Bug or Regression

ElasticJobsViaWorkloadSlices: Fix a bug where the workload-slice-name annotation was incorrectly set on all workloads when the ElasticJobsViaWorkloadSlices feature gate was enabled, instead of only on elastic workloads. (#11624, @sohankunkerkar)
ElasticJobsViaWorkloadSlices: Fix quota leak during elastic workload scale-up where old slice was finished before replacement slice was admitted. (#11558, @sohankunkerkar)
ElasticJobsViaWorkloadSlices: Fixed a bug where rapid elastic Job scaling could leave duplicate replacement Workload slices admitted indefinitely, causing quota to remain reserved. (#11554, @sohankunkerkar)
FeatureGates: Fix a bug that TAS-enhanced features can be enabled even if the dependent TopologyAwareScheduling or TASFailedNodeReplacement feature gates are disabled (#11793, @tenzen-y)
FeatureGates: Fixed a bug that user-specified feature gate parameters are not verified. (#11291, @MaysaMacedo)
Fix a bug where finished Workloads could remain stuck after object retention deletion if they still had Kueue's resource-in-use finalizer. (#11307, @ShaanveerS)
Fix waitForPodsReady timeout not triggering for pod groups with fast-admission when group members arrive after the first pod is Ready. (#11687, @sohankunkerkar)
Fixed a bug that prevented finishing the Workloads corresponding to Jobs deleted with --cascade=orphan.
The fix is behind the FinishOrphanedWorkloads feature gate (alpha, disabled by default). (#11682, @mbobrovskyi)
KueueViz: Fixed a bug where workload and local queue detail pages would get stuck on a loading spinner instead of showing connection/fetching error messages on failure. (#11710, @YadavAkhileshh)
MultiKueue: Add reconnect backoff guardrail to suppress redundant cluster reconciles for the MultiKueueCluster reconciler. (#11275, @reruno)
MultiKueue: Fixed a bug in the AllAtOnce dispatcher where workloads evicted from a
worker cluster could fail to be re-admitted. Kueue now waits for the ongoing eviction to
complete before starting a new nomination and re-admission cycle. (#11473, @mszadkow)
MultiKueue: Fixed a bug where a hung watch connection to one remote cluster could block
reconciliation of other MultiKueueClusters, leaving them inactive and preventing workload
admission. Kueue now applies a circuit-breaking timeout while establishing remote-cluster
watches: the timeout starts at 1 minute and backs off exponentially on consecutive failures,
up to 10 minutes. (#11329, @trilamsr)
MultiKueue: Fixed a bug where a lost connection to a worker cluster failed to trigger the "workerLostTimeout" delay mechanism for workload requeuing. (#11559, @yuluo-yx)
MultiKueue: Fixed a bug where one slow or unresponsive remote cluster could stall
reconciliation for other MultiKueueClusters, even when
controller.groupKindConcurrency["MultiKueueCluster.kueue.x-k8s.io"] was set above 1.
This could delay or block admission through other healthy clusters. (#11332, @trilamsr)
MultiKueue: Fixed admission for Kubernetes Jobs on Kubernetes 1.36 clusters by ensuring all Job status updates comply with the updated Kubernetes Job validation rules. See kubernetes/kubernetes#139281 for more details. (#11763, @olekzabl)
MultiKueue: Fixed unnecessary status.nominatedClusterNames updates from the AllAtOnce
dispatcher when the set of nominated clusters did not change. (#11507, @mszadkow)
ObjectRetentionPolicies: Fixed a bug that doesn't allow the deletion of orphaned finished workloads. (#11755, @mbobrovskyi)
Observability: fix the missing "replica-role" information from the logs generated by the controller managing the
MultiKueueCluster instances. (#11271, @reruno)
Scheduling: Fix a race condition within the admission process that could cause workloads waiting indefinitely for a preemption, causing head-of-line blocking of the affected ClusterQueues. (#11647, @kshalot)
Scheduling: Fixed a bug where Kueue could inject duplicate tolerations when a
ResourceFlavor toleration and a PodTemplate toleration differed only by operator: ""
versus operator: "Equal", which represent the same Kubernetes toleration. This could
cause update rejections and leave Pods scheduleGated. (#11758, @benkermani)
TAS: Fixed a scheduling bug where a workload with multiple PodSets could be admitted even when the combined PodSets exceeded node pod capacity. (#11331, @yuluo-yx)
TAS: ensure that Snapshot() does not perform update the list of workloads under the read-lock. (#11295, @mimowo)
TAS: fix over-subscription of nodes that belong to multiple ResourceFlavors sharing the same hostname-leaf Topology.
The TASHandleOverlappingFlavors is introduced as an alpha feature gate (disabled by default). (#11759, @tenzen-y)
Workloads: Fixed a bug where, with the FinishOrphanedWorkloads feature gate enabled,
Workloads could be marked Finished immediately after owner Job or JobSet creation. (#11535, @mbobrovskyi)