github kubernetes-sigs/kueue v0.16.9

latest releases: v0.18.0, v0.19.0-devel, v0.17.4...
4 hours ago

Changes since v0.16.8:

Actions Required Before Upgrading

(No, really, you MUST read this before you upgrade)

Changes by Kind

Bug or Regression

  • ElasticJobsViaWorkloadSlices: Fix a bug where the workload-slice-name annotation was incorrectly set on all workloads when the ElasticJobsViaWorkloadSlices feature gate was enabled, instead of only on elastic workloads. (#11624, @sohankunkerkar)
  • ElasticJobsViaWorkloadSlices: Fix quota leak during elastic workload scale-up where old slice was finished before replacement slice was admitted. (#11558, @sohankunkerkar)
  • ElasticJobsViaWorkloadSlices: Fixed a bug where rapid elastic Job scaling could leave duplicate replacement Workload slices admitted indefinitely, causing quota to remain reserved. (#11554, @sohankunkerkar)
  • FeatureGates: Fix a bug that TAS-enhanced features can be enabled even if the dependent TopologyAwareScheduling or TASFailedNodeReplacement feature gates are disabled (#11793, @tenzen-y)
  • FeatureGates: Fixed a bug that user-specified feature gate parameters are not verified. (#11291, @MaysaMacedo)
  • Fix a bug where finished Workloads could remain stuck after object retention deletion if they still had Kueue's resource-in-use finalizer. (#11307, @ShaanveerS)
  • Fix waitForPodsReady timeout not triggering for pod groups with fast-admission when group members arrive after the first pod is Ready. (#11687, @sohankunkerkar)
  • Fixed a bug that prevented finishing the Workloads corresponding to Jobs deleted with --cascade=orphan.
    The fix is behind the FinishOrphanedWorkloads feature gate (alpha, disabled by default). (#11682, @mbobrovskyi)
  • KueueViz: Fixed a bug where workload and local queue detail pages would get stuck on a loading spinner instead of showing connection/fetching error messages on failure. (#11710, @YadavAkhileshh)
  • MultiKueue: Add reconnect backoff guardrail to suppress redundant cluster reconciles for the MultiKueueCluster reconciler. (#11275, @reruno)
  • MultiKueue: Fixed a bug in the AllAtOnce dispatcher where workloads evicted from a
    worker cluster could fail to be re-admitted. Kueue now waits for the ongoing eviction to
    complete before starting a new nomination and re-admission cycle. (#11473, @mszadkow)
  • MultiKueue: Fixed a bug where a hung watch connection to one remote cluster could block
    reconciliation of other MultiKueueClusters, leaving them inactive and preventing workload
    admission. Kueue now applies a circuit-breaking timeout while establishing remote-cluster
    watches: the timeout starts at 1 minute and backs off exponentially on consecutive failures,
    up to 10 minutes. (#11329, @trilamsr)
  • MultiKueue: Fixed a bug where a lost connection to a worker cluster failed to trigger the "workerLostTimeout" delay mechanism for workload requeuing. (#11559, @yuluo-yx)
  • MultiKueue: Fixed a bug where one slow or unresponsive remote cluster could stall
    reconciliation for other MultiKueueClusters, even when
    controller.groupKindConcurrency["MultiKueueCluster.kueue.x-k8s.io"] was set above 1.
    This could delay or block admission through other healthy clusters. (#11332, @trilamsr)
  • MultiKueue: Fixed admission for Kubernetes Jobs on Kubernetes 1.36 clusters by ensuring all Job status updates comply with the updated Kubernetes Job validation rules. See kubernetes/kubernetes#139281 for more details. (#11763, @olekzabl)
  • MultiKueue: Fixed unnecessary status.nominatedClusterNames updates from the AllAtOnce
    dispatcher when the set of nominated clusters did not change. (#11507, @mszadkow)
  • ObjectRetentionPolicies: Fixed a bug that doesn't allow the deletion of orphaned finished workloads. (#11755, @mbobrovskyi)
  • Observability: fix the missing "replica-role" information from the logs generated by the controller managing the
    MultiKueueCluster instances. (#11271, @reruno)
  • Scheduling: Fix a race condition within the admission process that could cause workloads waiting indefinitely for a preemption, causing head-of-line blocking of the affected ClusterQueues. (#11647, @kshalot)
  • Scheduling: Fixed a bug where Kueue could inject duplicate tolerations when a
    ResourceFlavor toleration and a PodTemplate toleration differed only by operator: ""
    versus operator: "Equal", which represent the same Kubernetes toleration. This could
    cause update rejections and leave Pods scheduleGated. (#11758, @benkermani)
  • TAS: Fixed a scheduling bug where a workload with multiple PodSets could be admitted even when the combined PodSets exceeded node pod capacity. (#11331, @yuluo-yx)
  • TAS: ensure that Snapshot() does not perform update the list of workloads under the read-lock. (#11295, @mimowo)
  • TAS: fix over-subscription of nodes that belong to multiple ResourceFlavors sharing the same hostname-leaf Topology.
    The TASHandleOverlappingFlavors is introduced as an alpha feature gate (disabled by default). (#11759, @tenzen-y)
  • Workloads: Fixed a bug where, with the FinishOrphanedWorkloads feature gate enabled,
    Workloads could be marked Finished immediately after owner Job or JobSet creation. (#11535, @mbobrovskyi)

Don't miss a new kueue release

NewReleases is sending notifications on new releases.