Changes since v0.5.0
:
Changes by Kind
API Change
- Add the config field .waitForPodsReady.requeuingTimestamp to allow admins configure the timestamp used when sorting workloads that were evicted due to their Pods not becoming ready in time. (#1542, @nstogner)
- Extend the information returned for the pending workloads in cluster queue, that is used to determine the workload position, including the workload position itself. (#1362, @PBundyra)
- Extend visibility API by adding an endpoint that allows a user to fetch information about pending workloads and their position in LocalQueue. (#1365, @PBundyra)
- Introduces an on-demand API endpoint for fetching pending workloads in a cluster queue (#1251, @PBundyra)
- The OwnerReferences field in PendingWorkload's metadata is now filled with the information about the owning Job (#1378, @PBundyra)
- Visibility.PendingWorkload does not implement runtime.Object interface anymore (#1386, @PBundyra)
Feature
-
A
stopPolicy
field in the ClusterQueue allows to hold or drain a ClusterQueue (#1299, @trasc) -
Add events for transitions of the provisioning AdmissionCheck (#1271, @stuton)
-
Add prebuilt workload support for batch/job. (#1358, @trasc)
-
Add support for groups of plain Pods. (#1319, @achernevskii)
-
Add validation for clusterQueue: when cohort is empty, borrowingLimit must be nil. (#1525, @B1F030)
-
Allow configuring featureGates on helm charts. (#1314, @B1F030)
-
Allow decrease reclaimable pods to 0 for suspended job (#1277, @yaroslava-serdiuk)
-
At log level 6, the usage of ClusterQueues and cohorts is included in logs.
The status of the internal cache and queues is also logged on demand when a SIGUSR2 is sent to kueue, regardless of the log level. (#1528, @alculquicondor)
-
Basic implementation of MultiKueue for Job. This doesn't include support for live status updates. (#1313, @trasc)
-
Increase the default number of reconcilers for Pod and Workload objects to 5, each. (#1589, @alculquicondor)
-
Jobs preserve their position in the queue if the number of pods change before being admitted (#1223, @yaroslava-serdiuk)
-
Make the image build setting CGO_ENABLED configurable (#1391, @anishasthana)
-
RBAC to visibility into Local Queues is fixed (#1412, @PBundyra)
-
Support for a mechanism to suspend a running Job without requeueing (#1252, @vicentefb)
-
Support for retry of provisioning request.
When
ProvisioningACC
is enabled, and there are existing ProvisioningRequests, they are going to be recreated.
This may cause a job failures for some long-running jobs which were using the ProvisioningRequests. (#1351, @mimowo) -
The image gcr.io/k8s-staging-kueue/debug:main, along with the script ./hack/dump_cache.sh can be used to trigger a dump of the internal cache into the logs. (#1541, @alculquicondor)
-
The leaderElection field in the Configuration API is now defaulted.
Leader election is now enabled by default. (#1598, @astefanutti) -
The priority sorting within the cohort could be disabled by setting --prioritySortingWithinCohort to false (#1406, @yaroslava-serdiuk)
-
Visibility.PendingWorkload object has the metav1.CreationTimestamp field filled with the value of corresponding kueue.Workload (#1404, @PBundyra)
Bug or Regression
-
Add Missing RBAC on integration finalizers sub-resources (#1486, @astefanutti)
-
Add Mutating WebhookConfigurations for the AdmissionCheck, RayJob, and JobSet to helm charts (#1567, @B1F030)
-
Add Validating/Mutating WebhookConfigurations for the KubeflowJobs like PyTorchJob (#1460, @tenzen-y)
-
Added event for QuotaReserved and fixed event for Admitted to trigger when admission checks complete (#1436, @trasc)
-
Avoid recreating a Workload for a finished Job and finalize a job when the workload is declared finished. (#1383, @achernevskii)
-
Do not (re)create ProvReq is the state of admission check is Ready (#1617, @mimowo)
-
Fix a bug in the pod integration that unexpected errors will occur when the pod isn't found (#1512, @achernevskii)
-
Fix a bug that plain pods managed by kueue will remain a terminating condition forever. (#1342, @tenzen-y)
-
Fix client-go libraries bug that can not operate clusterScoped resources like ClusterQueue and ResourceFlavor. (#1294, @tenzen-y)
-
Fix fungibility policy
Preempt
where it was not able to utilize the next flavor if preemption was not possible. (#1366, @alculquicondor) -
Fix handling of preemption within a cohort when there is no borrowingLimit. In that case,
during preemption, the permitted resources to borrow were calculated as if borrowingLimit=0, instead of unlimited.As a consequence, when using
reclaimWithinCohort
, it was possible that a workload, scheduled to ClusterQueue with no borrowingLimit, would preempt more workloads than needed, even though it could fit by borrowing. (#1561, @mimowo) -
Fix the synchronization of the admission check state based on the second provisioning request (#1585, @mimowo)
-
Fixed fungibility policy
whenCanPreempt: Preempt
. The admission should happen in the flavor for which preemptions were issued. (#1332, @alculquicondor) -
Pending workload from StrictFIFO ClusterQueue doesn't block borrowing from other ClusterQueues (#1399, @yaroslava-serdiuk)
-
Remove finalizer from Workloads that are orphaned (have no owners). (#1523, @achernevskii)
-
Trigger an eviction for an admitted Job after an admission check changed state to Rejected. (#1562, @trasc)
-
Visibility endpoints return 404 code for non-existent queues (#1415, @PBundyra)
-
Webhooks are served in non-leading replicas (#1509, @astefanutti)