What's New
Add JobFlow to support lightweight workflow orchestration
The workflow orchestration engine is widely used in high-performance computing, AI biomedicine, image processing, beauty, game AGI, scientific computing and other scenarios, helping users simplify the management of multiple parallel tasks and dependencies, and greatly improving the overall computing efficiency.
JobFlow is a lightweight task flow orchestration engine that focuses on Volcano job orchestration. It provides Volcano with job probes, job completion dependencies, job failure rate tolerance, and other diverse job dependency types, and supports complex process control primitives. The specific capabilities are as follows:
- Support large-scale job management and complex task flow orchestration.
- Support real-time query of the running status and task progress of all associated jobs.
- Support automatic operation of jobs and scheduled start to release labor costs.
- Various action strategies can be set for different tasks, and corresponding actions can be triggered when the task meets certain conditions, such as timeout retry, node failure drift.
Refer to the links for more details. (JobFlow doc, @hwdef, @lowang-bh, @zhoumingcheng)
Support vGPU scheduling and isolation
Since the outbreak of ChatGPT, there have been more and more research and development of AI large models, and different types of AI large models have been launched one after another. In production environment, users have pain points such as low resource utilization and inflexible GPU resource allocation. They have to purchase a large amount of redundant heterogeneous computing power to meet business needs, and heterogeneous computing power itself is expensive. It has brought a great burden to the development of the enterprise.
Starting from version 1.8, Volcano provides an abstract general framework for sharing devices (GPU, NPU, FPGA...), developers can customize multiple types of shared devices based on this framework. Currently Volcano has supported GPU device multiplexing, resource isolation based on this framework, details are as follows:
- GPU sharing: Each task can apply to use part of the resources of a GPU card, and the GPU card can be shared among multiple tasks.
- Device memory control: GPU can be allocated according to device memory (for example: 3000M) or allocated in proportion (for example: 50%) to realize GPU virtualization resource isolation capability.
Refer to the links for more details.
- How to use vGPU function (@archlitchi)
- How to add a new heterogeneous computing power sharing strategy (@archlitchi)
Support the preemption capability for GPU and user-defined resources
Currently, Volcano supports CPU, Memory and other basic resource preemption. GPU resources and user self-managed resources such as NPU, network resources have not been supported yet.
In version 1.8, the predication is refactored to provide more detailed response such as Unschedulable and UnschedulableAndUnresolvable for different scenarios.
The GPU preemption function has been released based on the optimized framework, and the user developed scheduling plugins based on Volcano can be adapted and upgraded according to business scenarios.
Refer to the link for more details. (#2916, @wangyang0616)
Support ElasticSearch monitoring systems in node load-aware scheduling and rescheduling
The status of the kubernetes cluster changes in real time with the creation and termination of tasks. In some scenarios such as adding or deleting nodes, changing the affinity of Pods and Nodes, and dynamically changing the lifecycle of jobs, etc. The following problems will occur. Resource utilization is unbalanced, node performance bottlenecks are offline, etc. At this time, load aware scheduling and rescheduling can help user solve the above problems.
Prior to Volcano version 1.8, the load awareness scheduling and rescheduling only supports Prometheus. Starting from version 1.8, Volcano optimizes the monitoring index acquisition framework and adds support for ElasticSearch monitoring system.
Refer to the links for more details.
Optimize Volcano's ability to schedule microservices
Add Kubernetes default scheduler plugin enable and disable switch
Volcano is a unified integrated scheduling system that not only supports computing jobs such as AI and BigData, but also supports microservice workloads. It is compatible with scheduling plugins such as PodTopologySpread, VolumeZone, VolumeLimits, NodeAffinity, and PodAffinity of the Kubernetes default scheduler, and Kubernetes default scheduling plugins capabilities Enabled by default in Volcano.
Since Volcano 1.8, the Kubernetes default scheduling plugins can be freely selected to be turned on and off through the configuration file, and all of them are turned on by default. If you choose to turn off some plugins, such as: turn off the PodTopologySpread and VolumeZone plugins, you can set the corresponding values in the predicate plugin is false.
Refer to the links for more details. (#2748, @jiangkaihua)
Enhance scheduler to keep compatibility with ClusterAutoscaler
In the Kubernetes platform, Volcano is not only used as a scheduler for batch computing services, but also used as a scheduler for general services. Node horizontal scaling is one of the core functions of Kubernetes, which plays an important role in coping with the surge of user traffic and saving operating costs. Volcano optimizes job scheduling and other related logic, and enhances the compatibility and interaction with ClusterAutoscaler, mainly in the following two aspects:
- The pod that enters the pipeline state in the scheduling phase triggers capacity expansion in time.
- Candidate nodes are graded in gradients to reduce the impact of cluster terminating pods on scheduling load, and prevent pods from entering invalid pipeline states, resulting in cluster expansion by mistake.
Refer to the links for more details. (#2782, #3000, @wangyang0616)
Provide tolerance for exception of device plugin
When device plugin crashs or fails to report resouces for some reason and the total resource amount of the node is less than the allocated resource amount, Volcano considers that the node data is inconsistent, make the node as OutOfSync and isolates the node, and stops scheduling any new workload to the node. The isolocation machinism brought some impact to the cluster for example device plugin has no chance to be scheduled to the OutOfSync node. In Volcano v1.8, the machinism is enhanced to tolerate the exception of device plugin, the non-GPU workload like device plugin is still allowed to be scheduled to OutOfSync node.
Refer to the link for more details. (#2999, @Monokaix)
Add helm charts for Volcano
As Volcano is used in production environments and cloud environments with more and more users, simple and standard installation actions are crucial. Since version 1.8, Volcano has optimized charts package publishing and archiving actions, standardized the installation and use process, and completed the migration of historical versions v1.6 and v1.7 to the new helm warehouse.
Refer to the link for more details. (Volcano helm-charts, @wangyang0616)
Other Notable Changes
- rework device sharing in volcano(#2643, @archlitchi)
- style(resource_info): replace 0, -1 with Zero,Infinity(#2650, @kingeasternsun)
- perf(preempt): remove used copy(#2652, @kingeasternsun)
- Add podGroup completed phase(#2667, @waiterQ)
- delete redundant import alias(#2675, @Shoothzj)
- delete redundant type convetion(#2627, @Shoothzj)
- Extract MetricsClient and NodeMetrics to support other metrics platform(#2678, @Shoothzj)
- upgrade klog package version to latest (#2682, @waiterQ)
- Update how_to_use_gpu_sharing.md(#2686, @z2Zhang)
- Rename AddPrePredicateFn annotation(#2689, @zbbkeepgoing)
- Remove duplicate import in session.go(#2690, @zbbkeepgoing)
- Optimize e2e runtime: reduce pytorch-plugin image download time(#2691, @wangyang0616)
- Fix typo in tdm-plugin.md(#2692, @Shoothzj)
- volcano metrics source support elasticsearch (#2694, @Shoothzj)
- Skip stmt when tasks is empty (#2696, @zbbkeepgoing)
- Add rescheduling related location logs (#2698, @wangyang0616)
- introduce elasticsearch index name config (#2702, @Shoothzj)
- add the impact of updating and deleting jobflow and jobtemplate. Stat… (#2704, @zhoumingcheng)
- Decouple the scanning cycle from the processing cycle (#2705, @Shoothzj)
- Fix metrics source not read es addr (#2707, @Shoothzj)
- Support using other specified namespaces when installing with helm (#2709, @hwdef)
- Allow config es source username and password (#2712, @Shoothzj)
- Add method "ignorable" to Device interface (#2716, @archlitchi)
- Use the latest example-mpi mirror (#2722, @wangyang0616)
- Vgpu feature for volcano (#2724, @archlitchi)
- Allow config metrics source tls ignore (#2729, @Shoothzj)
- Update roadmap.md (#2730, @lixin963)
- Skip early when reclaimees is empty (#2732, @zbbkeepgoing)
- pod admission mutate patch affinity (#2737, @jiamin13579)
- Allow config hostname field name (#2742, @Shoothzj)
- optimize nodeorder logprint to easy find from pod; (#2743, @waiterQ)
- Add more enable switch for predicate plugin. (#2750, @jiangkaihua)
- Make errorCache configurable for predicate helper (#2756, @jinzhejz)
- Refactor namespace fairshare function. (#2757, @jiangkaihua)
- add label to service, name to port, to make it selectable via Service Monitors (#2767, @noyoshi)
- enhancement: add keyword about allocate/pipeline/evict information in log (#2776, @lowang-bh)
- Add more configuration options for helm chart (#2788, @Aakcht)
- verify-golangci-lint fails to run in CI, code merging is not allowed (#2808, @wangyang0616)
- Volcano automatically releases the corresponding helm charts package (#2823, @wangyang0616)
- added nodeSelector & affinity & tolerations (#2836, @medicharlachiranjeevi)
- modify code comment (#2847, @Yanping-io)
- abstract Bind() param to kubernetes.Interface (#2869, @@Monokaix)
- docs: make user more easily to build custom plugin (#2893, @xiao-jay)
- make log more clear so that easy to debug (#2902, @lowang-bh)
- Refactor PredicateFn for allocate and preempt actions (#2916, @wangyang0616)
- Let the make manifest work properly (#2917, @wangyang0616)
- Add jobflow-related content to Volcano automatic installation and deployment (#2936, @wangyang0616)
- Upgrade the setup-go and checkout versions in the action && fix Spark CI (#2938, @wangyang0616)
- use one helm command to install resource (#2985, @lowang-bh)
- support concurrent-podgroup-syncs (#2997, @WulixuanS)
- remove node out of sync state (#2998, @Monokaix)
- Optimize the jobflow architecture design diagram (#3016, @wangyang0616)
- enhancemant: make map with cap (#3020, @lowang-bh)
Bug Fixes
- fix: fix logic when deal with resource dimension not defined in ScalarResources of r(#2637, @kingeasternsun)
- Fix the problem exposed by golangci-lint static check(#2645, @wangyang0616)
- fix local-up-volcano.sh(#2648, @hwdef)
- add healthz for admission(#2651, @elinx)
- The allocate phase only updates the podgroup in the pending state to the inqueue state(#2658, @wangyang0616)
- little interface optimization for nodes others field(#2669, @waiterQ)
- Fix typo of nonblocking(#2688, @Shoothzj)
- fix incorrect camel case (#2703, @Shoothzj)
- Fix MPI example not working in IPv6 (#2714, @gengwg)
- fix daily release CI (#2715, @hwdef)
- Remove escape character to fix broken grafana dashboard configuration (#2718, @shaobo76)
- ginkgo upgraded from v2.3.0 to v2.8.3 (#2719, @wangyang0616)
- imagePullPolicy changed from IfNotPresent to Always (#2721, @wangyang0616)
- Avoid Panic when es has no data (#2726, @Shoothzj)
- Fix trunk CI grafana.yaml file parsing error (#2728, @wangyang0616)
- Fix: Multiplying Elasticsearch data by 100 (#2746, @Shoothzj)
- Fix the problem that the program crashes if the gpuShare Devices is empty (#2751, @wangyang0616)
- ignore no metrics server error log and lower actions, plugins enter leave log level (#2752, @waiterQ)
- Upgrade the ubuntu version to fix the problem that CI cannot run (#2769, @wangyang0616)
- Change the error message. (#2773, @gj199575)
- fix: high priority job of two tasks cannot preempt low priority job of one task (#2775, @wangyang0616)
- Use cases for blocking CI probabilistic failures (#2780 @wangyang0616)
- Ensure deletion of pods is handled (#2784, @Monokaix)
- Modify binpack score on nodes with releasing resources. (#2786, @jiangkaihua)
- fix: lint error (#2801, @lowang-bh)
- fix: undefined: buildJmpDirective on arm64 machine (#2805, @lowang-bh)
- fix: UT works fine and covers (#2811, @wangyang0616)
- fix some typos (#2817, @halegreen)
- fix: nodeinfo clone (#2827, @lowang-bh)
- fix issue that sa has no permit of leases when elect leader and enable leader elect by default (#2828, @lowang-bh)
- Fix: the admission-init error is reported when Volcano is installed repeatedly. (#2834, @wangyang0616)
- fix synchronous pvc/pv bind error when volcano schedule pod with pvc (#2844, @gj199575)
- fix goimport check error (#2849, @wangyang0616)
- Ensure delete event is handled (#2850, @WulixuanS)
- solve make error on darwin os (#2857, @lowang-bh)
- fix: log format param missing (#2862, @lowang-bh)
- Remove vendor (#2863, @xiao-jay)
- resolve unused variable warning (#2864, @lowang-bh)
- add jobflow rbac in controller's sa (#2868, @lowang-bh)
- make manifests and add crd about flow.volcano.sh (#2874, @lowang-bh)
- fix pods pending but vcjob is running (#2886, @renwenlong-github)
- install ginkgo bin if not exist (#2888, @lowang-bh)
- fix controller deadlook when wait dependson task (#2898, @renwenlong-github)
- fix local-up-volcano broken (#2905, @hwdef)
- fix PodToplogySpread plugin violates maxSkew (#2929, @Monokaix)
- fix some spelling issues reported by golint (#2942, @rayoluo)
- Fix panic issue with proportional scheduling (#2968, @Cdayz)
- fix scheduler metric e2e_scheduling_latency_milliseconds (#2970, @wulixuan)
- when Volcano is uninstalled, two resources will remain (#2980, @gj199575)
- add the job creation permission for jobflow controller (#2988, @william-wang)
- Fix: the pod pipeline status is incompatible with autoscaler capacity expansion (#3001, @wangyang0616)
- add the jobflows/status,jobs/finalizers,jobtemplates/status,jobtempla… (#3005 @Mufengzhe)
- Fix panic issue with job taskMinAvailable clone (#3008, @Tongruizhe)
- Fix the problem that resource info ut test fails probabilistically (#3028, @wangyang0616)