Changes since v0.16.1:
Changes by Kind
Feature
- KueueViz Helm: Add podSecurityContext and containerSecurityContext configuration options to KueueViz Helm chart for restricted pod security profile compliance (#9319, @ziadmoubayed)
- Observability: Increased the maximum finite bucket boundary for admission_wait_time_seconds histogram from ~2.84 hours to ~11.3 hours for better observability of long queue times. (#9507, @mukund-wayve)
Bug or Regression
- ElasticJobs: fix the temporary double-counting of quota during workload replacement.
In particular it was causing double-counting of quota requests for unchanged PodSets. (#9364, @benkermani) - FairSharing: workloads fitting within their ClusterQueue's nominal quota are now preferred over workloads that require borrowing, preventing heavy borrowing on one flavor from deprioritizing a CQ's nominal entitlement on another flavor. (#9532, @mukund-wayve)
- Fix non-deterministic workload ordering in ClusterQueue by adding UID tie-breaker to queue ordering function. (#9140, @sohankunkerkar)
- Fix serverName substitution in kustomize prometheus ServiceMonitor TLS patch for cert-manager deployments. (#9188, @IrvingMg)
- Fixed invalid field name in the
ClusterQueueprinter columns. The "Cohort" column will now correctly display the assigned cohort in kubectl, k9s, and other UI tools instead of being blank. (#9422, @polinasand) - Fixed the bug that prevented managing workloads with duplicated environment variable names in initContainers. This issue manifested when creating the Workload via the API. (#9126, @monabil08)
- FlavorFungability: fix the bug that the semantics for the
flavorFungability.preferenceenum values
(ie. PreemptionOverBorrowing and BorrowingOverPreemption) were swapped. (#9486, @tenzen-y) - LeaderWorkerSet: fix an occasional race condition resulting in workload deletion getting stuck during scale down. (#9135, @PannagaRao)
- MultiKueue: Fix a bug that the remote Job object was occasionally left by MultiKueue GC,
even when the corresponding Job object on the management cluster was deleted.
This issue was observed for LeaderWorkerSet. (#9310, @sohankunkerkar) - MultiKueue: for the StatefulSet integration copy the entire StatefulSet onto the worker clusters. This allows
for proper management (and replacements) of Pods on the worker clusters. (#9539, @IrvingMg) - Observability: Fix missing "replica-role" in the logs from the NonTasUsageReconciler. (#9456, @IrvingMg)
- Observability: Fix the stale "replica-role" value in scheduler logs after leader election. (#9431, @IrvingMg)
- Scheduling: Fix the bug where inadmissible workloads would be re-queued too frequently at scale.
This resulted in excessive processing, lock contention, and starvation of workloads deeper in the queue.
The fix is to throttle the process with a batch period of 1s per CQ or Cohort. (#9490, @gabesaba) - TAS: Fix a bug that LeaderWorkerSet with multiple PodTemplates (
.spec.leaderWorkerTemplate.leaderTemplateand.spec.leaderWorkerTemplate.workerTemplate), Pod indexes are not correctly evaluated during rank-based ordering assignments. (#9368, @tenzen-y) - TAS: fix a bug where NodeHotSwap may assign a Pod, based on rank-ordering, to a node which is already
occupied by another running Pod. (#9282, @j-skiba)