What's New
1. Support multi-scheduler
In Kubernetes cluster with multiple schedulers, different kinds of workloads should be mapped to certain scheduler sometimes. For example, K8s native workloads such as deployment in namespace kube-system
are mapped to default-sheduler while AI and Big data jobs are mapped to Volcano. This feature aims to implements that automaticallty. More details please refer to https://github.com/volcano-sh/volcano/blob/master/docs/design/multi-scheduler.md. (#1576, #1521, @huone1 @william-wang )
2. Support proportion of resources for GPU node
In order to make full use of scarce resources such as GPU, one solution is to bind them with other resources as shares. For example, it is common to see a lot of CPU-intensive workloads are scheduled to GPU nodes. When GPU-intensive workloads come, they cannot be scheduled because of lack of CPU or Memory in GPU nodes. If workloads requiring both GPU, CPU, Memory at certatin range can be scheduled to GPU nodes first, it is possible to make full use of GPUs. More details please refer to https://github.com/volcano-sh/volcano/blob/master/docs/design/proportional.md. (#1527, @king-jingxiang )
3. Support CPU NUMA-Aware scheduling
As to CPU-intensive workloads especially in AI, Big Data and HPC fileds, It will result in a significant performance improvement if CPU NUMA is enabled. More details please refer to https://github.com/volcano-sh/volcano/blob/master/docs/design/numa-aware.md. (#1493, @huone1 )
4. Provide framework of stress test
In this release, A kind of framework for Volcano stress test is provided. (#1516, @rudeigerc )
Other Notable Changes
- update roadmap for v1.5(#1708, @william-wang )
- refine scheduler framework(#1705, @wpeng102 )
- fix daily release job fails(#1701, @Thor-wl )
- optimization register controller log(#1683, @hwdef )
- lower ssh key access permission(#1682, @hwdef )
- automatically set GOMAXPROCS for scheduler & controller & webhook-manager(#1681, @SataQiu )
- add revive linter as drop-in replacement of archived golint(#1678, @gy95 )
- modify default scheduler cfg(#1673, @shinytang6 )
- expose detailed scheduling reason of pending tasks(#1672, @eggiter)
- format env key(#1660, @wpeng102 )
- not delete failed pods on last try(#1657, @wpeng102 )
- optimize function setOversubscription(#1653, @huone1 )
- upgrade controller tools to v0.6.0(#1649, @shinytang6 )
- enable configuring volcano webhooks(#1645, @hacker-qian)
- add user-provided rsa key-pair and mount rsa key-pair in InitContainers since we want to clone private git repository before run training task(#1644, @python279 )
- refactor the preempt function in plugin gang(#1643, @huone1 )
- delete bindingTasks from NodeInfo structure(#1636, @huone1 )
- scheduler plugin framework support SharedInformerFactory(#1635, @wpeng102 )
- add task-spec in pod label(#1626, @python279 )
- rename resource comparision functions(#1624, @Thor-wl )
- update resource comparision doc(#1622, @Thor-wl )
- add Equal function(#1621, @Thor-wl )
- add LessEqualPartly function(#1613, @Thor-wl )
- add LessEqualInAllDimension function and remove LessEqual/LessEqualStrict functions(#1611, @thor)
- doc: add resource quota plugin doc(#1583, @merryzhou )
- improve: clean unready status when job ready(#1582, @lowang-bh )
- improve: reduce calculation of total nodes' resource in stead of storing it when snapshot(#1578, @lowang-bh )
- add resource comparison doc(#1573, @Thor-wl )
- update Less function(#1569, @Thor-wl )
- support oversubscription in volcano framework(#1566, @wpeng102 )
- add multi-scheduling design doc(#1565, @huone1 )
- improve: preempt break out if intersection is null(#1563, @lowang-bh )
- add arch node selectors to the installer arm&amd64(#1556, @holdenk )
- optimizing openSession(#1541, @hacker-qian)
- improve: remove nodename from taskinfo when RemoveTask(#1517, @lowang-bh )
- the member Others should be copied in func (ni *NodeInfo) Clone() (#1512, @huone1 )
- use struct NodeScoreList better than HostPriorityList for score(#1510, @huone1 )
- add min success design(#1505, @zen-xu )
- change min resource to 0.1 in resource_info(#1489, @wpeng102 )
Bug Fixes
-
make 'existing pods anti-affinity rules' work(#1668, @eggiter)
-
add setting MinResources to pg for normal pod(#1666, @huone1 )
-
fix OOM will occur if pod info is sync before node info(#1662, @huone1 )
-
fix addmission parsing bug(#1656, @hacker-qian)
-
fix overused judgement when deal with allocate and proportion(#1637, @Thor-wl )
-
reset task.NodeName after call DeallocateFunc(#1618, @merryzhou )
-
fix a problem about equivalence ecache feature (#1593, @huone1 )
-
func FeasibleNodesToFind to use list with a centain order(#1574, @lowang-bh )
-
fix bug in predicates plugin(#1547, @hacker-qian)
-
fix(scheduler): reclaim action minus and comparison bug(#1540, @shinytang6 )
-
fix resource comparasion bug in task topology(#1546, @Thor-wl )
-
fix select wrong queue when
proportion
is disable(#1497, @zen-xu )