What's New
Welcome to the v1.13.0 release of Volcano! 🚀 🎉 📣
In this release, we have brought a series of significant enhancements that have been long-awaited by community users:
-
AI Training and Inference Enhancements
- Support LeaderWorkerSet for Large Model Inference Scenarios
- Introduce Cron VolcanoJob
- Support Label-based HyperNode Auto Discovery
- Add Native Ray Framework Support
- Introduce HCCL Plugin Support
-
Resource Management and Scheduling Enhancements
- Introduce ResourceStrategyFit Plugin
- Independent Scoring Strategy by Resource Type
- Scarce Resource Avoidance (SRA)
- Enhance NodeGroup Functionality
- Introduce ResourceStrategyFit Plugin
-
Colocation Enhancements
- Decouple Colocation from OS
- Support Custom OverSubscription Resource Names
Support LeaderWorkerSet for Large Model Inference Scenarios
LeaderWorkerSet (LWS) is an API for deploying a group of Pods on Kubernetes. It is primarily used to address multi-host inference in AI/ML inference workloads, especially scenarios that require sharding large language models (LLMs) and running them across multiple devices on multiple nodes.
Since its open-source release, Volcano has actively integrated with upstream and downstream ecosystems, building a comprehensive community ecosystem for batch computing such as AI and big data. In the v0.7 release of LWS, it natively integrated Volcano's AI scheduling capabilities. When used with the new version of Volcano, LWS automatically creates PodGroups, which are then scheduled and managed by Volcano, thereby implementing advanced capabilities like Gang scheduling for large model inference scenarios.
Looking ahead, Volcano will continue to expand its ecosystem integration capabilities, providing robust scheduling and resource management support for more projects dedicated to enabling distributed inference on Kubernetes.
Usage documentation: LeaderWorkerSet With Gang.
Related PRs: kubernetes-sigs/lws#496, kubernetes-sigs/lws#498, @JesseStutler
Introduce Cron VolcanoJob
This release introduces support for Cron Volcano Jobs. Users can now periodically create and run Volcano Jobs based on a predefined schedule, similar to native Kubernetes CronJobs, to achieve periodic execution of batch computing tasks like AI and big data. Detailed features are as follows:
- Scheduled Execution: Define the execution cycle of jobs using standard Cron expressions (
spec.schedule
). - Timezone Support: Set the timezone in
spec.timeZone
to ensure jobs execute at the expected local time. - Concurrency Policy: Control concurrent behavior via
spec.concurrencyPolicy
:AllowConcurrent
: Allows concurrent execution of multiple jobs (default).ForbidConcurrent
: Skips the current scheduled execution if the previous job has not completed.ReplaceConcurrent
: Terminates the previous job if it is still running and starts a new one.
- History Management: Configure the number of successful (
successfulJobsHistoryLimit
) and failed (failedJobsHistoryLimit
) job history records to retain; old jobs are automatically cleaned up. - Missed Schedule Handling: The
startingDeadlineSeconds
field allows tolerating scheduling delays within a certain timeframe; timeouts are considered missed executions. - Status Tracking: The CronJob status (
status
) tracks currently active jobs, the last scheduled time, and the last successful completion time for easier monitoring and management.
Related PRs: volcano-sh/apis#192, #4560, @GoingCharlie, @hwdef, @Monokaix
Usage example: Cron Volcano Job Example.
Support Label-based HyperNode Auto Discovery
Volcano officially launched network topology-aware scheduling capability in v1.12 and pioneered the UFM auto-discovery mechanism based on InfiniBand (IB) networks. However, for hardware clusters that do not support IB networks or use other network architectures (such as Ethernet), manually maintaining the network topology remains cumbersome.
To address this issue, the new version introduces a Label-based HyperNode auto-discovery mechanism. This feature provides users with a universal and flexible way to describe network topology, transforming complex topology management tasks into simple node label management.
This mechanism allows users to define the correspondence between topology levels and node labels in the volcano-controller-configmap. The Volcano controller periodically scans all nodes in the cluster and automatically performs the following tasks based on their labels:
- Automatic Topology Construction: Automatically builds multi-layer HyperNode topology structures from top to bottom (e.g., rack -> switch -> node) based on a set of labels on the nodes.
- Dynamic Maintenance: When node labels change, or nodes are added or removed, the controller automatically updates the members and structure of the HyperNodes, ensuring the topology information remains consistent with the cluster state.
- Support for Multiple Topology Types: Allows users to define multiple independent network topologies simultaneously to adapt to different hardware clusters (e.g., GPU clusters, NPU clusters) or different network partitions.
Configuration example:
# volcano-controller-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: volcano-controller-configmap
namespace: volcano-system
data:
volcano-controller.conf: |
networkTopologyDiscovery:
- source: label
enabled: true
interval: 10m # Discovery interval
config:
networkTopologyTypes:
# Define a topology type named topology-A
topology-A:
# Define topology levels, ordered from top to bottom
- nodeLabel: "volcano.sh/hypercluster" # Top-level HyperNode
- nodeLabel: "volcano.sh/hypernode" # Middle-level HyperNode
- nodeLabel: "kubernetes.io/hostname" # Bottom-level physical node
This feature is enabled by adding the label source to the Volcano controller's ConfigMap. The above configuration defines a three-layer topology structure named topology-A
:
- Top Level (Tier 2): Defined by the
volcano.sh/hypercluster
label. - Middle Level (Tier 1): Defined by the
volcano.sh/hypernode
label. - Bottom Level: Physical nodes, identified by the Kubernetes built-in
kubernetes.io/hostname
label.
When a node is labeled as follows, it will be automatically recognized and classified into the topology path cluster-s4 -> node-group-s0
:
# Labels for node node-0
labels:
kubernetes.io/hostname: node-0
volcano.sh/hypernode: node-group-s0
volcano.sh/hypercluster: cluster-s4
The label-based network topology auto-discovery feature offers excellent generality and flexibility. It is not dependent on specific network hardware (like IB), making it suitable for various heterogeneous clusters, and allows users to flexibly define hierarchical structures of any depth through labels. It automates complex topology maintenance tasks into simple node label management, significantly reducing operational costs and the risk of errors. Furthermore, this mechanism dynamically adapts to changes in cluster nodes and labels, maintaining the accuracy of topology information in real-time without manual intervention.
Usage documentation: HyperNode Auto Discovery.
Add Native Ray Framework Support
Ray is an open-source unified distributed computing framework whose core goal is to simplify parallel computing from single machines to large-scale clusters, especially suitable for scaling Python and AI applications. To manage and run Ray on Kubernetes, the community provides KubeRay—an operator specifically designed for Kubernetes. It acts as a bridge between Kubernetes and the Ray framework, greatly simplifying the deployment and management of Ray clusters and jobs.
Historically, running Ray workloads on Kubernetes primarily relied on the KubeRay Operator. KubeRay integrated Volcano in its v0.4.0 release (released in 2022) for scheduling and resource management of Ray Clusters, addressing issues like resource deadlocks in distributed training scenarios. With this new version of Volcano, users can now directly create and manage Ray clusters and submit computational tasks through native Volcano Jobs. This provides Ray users with an alternative usage scheme, allowing them to more directly utilize Volcano's capabilities such as Gang Scheduling, queue management and fair scheduling, and job lifecycle management for running Ray workloads.
Design documentation: Ray Framework Plugin Design Doc.
Usage documentation: Ray Plugin User Guide.
Introduce HCCL Plugin Support
The new version adds an HCCL Rank plugin (hcclrank
) to Volcano Jobs, used for automatically assigning HCCL Ranks to Pods in distributed tasks. This includes:
- New implementation of the
hcclrank
plugin for Volcano Jobs, supporting automatic calculation and injection of HCCL Rank into Pod annotations based on task type (master/worker) and index. - The plugin supports custom master/worker task names, allowing users to specify the master/worker roles in distributed tasks.
This feature enhances Volcano's native support for HCCL communication scenarios, such as Huawei Ascend, facilitating automatic management and assignment of Ranks in AI training tasks.
Related PR: #4524, @kingeasternsun
Enhance NodeGroup Functionality
In hierarchical queue structures, repeatedly configuring the same node group affinity (nodeGroupAffinity
) for each sub-queue as its parent queue leads to configuration redundancy and difficult maintenance.
To solve this problem, the Nodegroup plugin adds support for inheriting affinity within hierarchical queues. Once enabled, the scheduler resolves the effective affinity for a queue according to the following rules:
- Prioritize Self-Configuration: If the queue has defined
spec.affinity
, it uses this configuration directly. - Upward Inheritance: If the queue has not defined
spec.affinity
, it searches upward through its parents and inherits the affinity configuration defined by the nearest ancestor queue. - Override Capability: A child queue can override the inherited configuration by defining its own
spec.affinity
, ensuring flexibility.
This feature allows administrators to set unified node group affinity at a parent queue (e.g., department level), and all child queues (e.g., team level) will automatically inherit this setting, simplifying management.
For queues without NodeAffinity configuration, the "strict" parameter in the plugin controls scheduling behavior. When strict
is set to true
(the default value), tasks in these queues cannot be scheduled to any nodes. When strict
is set to false
, these tasks are allowed to be scheduled to regular nodes that do not have the volcano.sh/nodegroup-name
label.
In the nodegroup plugin parameters of the scheduler configuration file, setting enableHierarchy: true
enables hierarchical queue mode, and setting strict: false
configures non-strict mode. Example configuration is as follows:
actions: "allocate, backfill, preempt, reclaim"
tiers:
- plugins:
- name: nodegroup
arguments:
enableHierarchy: true # Enable hierarchical support
strict: false # Set to non-strict mode, allowing tasks in the queue to be scheduled to nodes without the "volcano.sh/nodegroup-name" label
Related PRs: #4455, @JesseStutler, @wuyueandrew
NodeGroup design documentation: NodeGroup Design.
NodeGroup usage documentation: NodeGroup User Guide.
Introduce ResourceStrategyFit Plugin
In the native Kubernetes noderesources
fit strategy, only a single aggregated (MostAllocated
) or dispersed (LeastAllocated
) strategy can be applied to all resources. This has limitations in complex heterogeneous computing environments (like AI/ML clusters). To meet differentiated scheduling requirements, Volcano introduces the enhanced ResourceStrategyFit
plugin.
This plugin now integrates two core features: Independent scoring strategies by resource type and Scarce Resource Avoidance (SRA).
Independent Scoring Strategy by Resource Type
This feature allows users to specify MostAllocated
(binpack) or LeastAllocated
(spread) strategies for different resources (e.g., cpu, memory, nvidia.com/gpu) independently, and assign different weights to them. The scheduler calculates the node score meticulously based on the independent configuration for each resource.
To simplify the management of resources within the same family (e.g., different model GPUs from the same vendor), this feature also supports suffix wildcard (*
) matching for resource names.
- Syntax Rules: Only suffix wildcards are supported, e.g.,
nvidia.com/gpu/*
. Patterns like*
orvendor.*/gpu
are considered invalid. - Matching Priority: Uses the "longest prefix match" principle. Exact matches have the highest priority; when no exact match exists, the wildcard pattern with the longest prefix is selected.
Configuration Example: The following configuration sets a high-priority binpack strategy for a specific V100 GPU model, a generic binpack strategy for all other NVIDIA GPUs, and a spread strategy for CPU resources. Pod-level resource scoring strategy configuration is also supported.
actions: "enqueue, allocate, backfill, reclaim, preempt"
tiers:
- plugins:
- name: resource-strategy-fit
arguments:
resourceStrategyFitWeight: 10
resources:
# Exact match, highest priority
nvidia.com/gpu-v100:
type: MostAllocated
weight: 3
# Wildcard match, applies to all other NVIDIA GPUs
nvidia.com/gpu/*:
type: MostAllocated
weight: 2
# Exact match for CPU resource
cpu:
type: LeastAllocated
weight: 1
Scarce Resource Avoidance (SRA)
SRA is a "soft" strategy designed to improve the overall utilization of expensive or scarce resources (like GPUs). It influences node scoring to guide ordinary tasks that do not require specific scarce resources (e.g., CPU-only tasks) to avoid nodes containing those resources where possible. This helps "reserve" scarce resource nodes for tasks that truly need them, thereby reducing resource contention and task waiting time.
Mechanism:
- Users define a set of "scarce resources" (e.g.,
nvidia.com/gpu
) in the configuration. - When scheduling a Pod that does not request any of the defined scarce resources, the SRA policy takes effect.
- The scheduler reduces the score of nodes that possess these scarce resources. The more types of scarce resources a node has, the lower its score.
- For Pods that do request scarce resources, the SRA policy does not negatively impact their scheduling decisions.
Configuration Example: The following configuration defines nvidia.com/gpu
as a scarce resource. When scheduling a CPU-only task, nodes with GPUs will have their scores reduced, making the task more likely to be scheduled onto nodes without GPUs.
actions: "enqueue, allocate, backfill, reclaim, preempt"
tiers:
- plugins:
- name: resource-strategy-fit
arguments:
# ... binpack/spread strategy configuration for resourceStrategyFit ...
resources:
nvidia.com/gpu:
type: MostAllocated
weight: 2
cpu:
type: LeastAllocated
weight: 1
# SRA policy configuration
sra:
enable: true
resources: "nvidia.com/gpu" # Define scarce resource list, comma-separated
weight: 10 # Weight of the SRA policy in the total score
resourceWeight:
nvidia.com/gpu: 1 # Define nvidia.com/gpu as a scarce resource and its weight
By combining the binpack/spread strategies of ResourceStrategyFit with the avoidance strategy of SRA, users can achieve more refined and efficient scheduling of heterogeneous resources.
Related PRs: #4391, #4454, #4512, @LY-today, @XbaoWu, @ditingdapeng, @kingeasternsun
Design documentation: ResourceStrategyFit Design
Usage documentation: ResourceStrategyFit User Guide
Decouple Colocation from OS
Volcano's co-location capability consists of two parts: application-level and kernel-level. Application-level co-location provides unified scheduling for online and offline workloads, dynamic resource overcommitment, node pressure eviction, etc. Kernel-level co-location involves QoS guarantees for resources like CPU, Memory, and Network at the kernel level, which typically requires support from a specific OS (like OpenEuler). In the new version, Volcano decouples the co-location capability from the OS. For users using an OS that does not support kernel-level co-location, they can choose to use Volcano's application-level co-location capabilities to achieve unified scheduling of online and offline tasks, dynamic resource overcommitment, and high-priority task guarantees.
Specific usage: When installing the Volcano agent, specify the --supported-features
parameter:
helm install volcano . --create-namespace -n volcano-system --set custom.colocation_enable=true --set "custom.agent_supported_features=OverSubscription\,Eviction\,Resources"
Related PRs: #4409, #4630, @ShuhanYan, @Monokaix
Colocation documentation: https://volcano.sh/en/docs/colocation/
Support Custom OverSubscription Resource Names
The Volcano co-location Agent adds parameters --extend-resource-cpu-name
and --extend-resource-memory-name
, allowing users to customize the names of overcommitted resources. This supports custom naming for CPU and memory resources (defaults are kubernetes.io/batch-cpu
and kubernetes.io/batch-memory
respectively), enhancing flexibility in setting overcommitted resource names.
Specific usage: When installing Volcano, specify the --extend-resource-cpu-name
and --extend-resource-memory-name
parameters:
helm install volcano . --create-namespace -n volcano-system --set custom.colocation_enable=true --set custom.agent_extend_resource_cpu_name=example.com/cpu --set custom.agent_extend_resource_memory_name=example.com/memory
Related PRs: #4413, #4630, @ShuhanYan, @Monokaix
Colocation documentation: https://volcano.sh/en/docs/colocation/
Add Kubernetes 1.33 Support
The Volcano version keeps pace with the Kubernetes community releases. v1.13 supports the latest Kubernetes v1.33 release, ensuring functionality and reliability through comprehensive UT and E2E test cases.
For participating in Volcano's adaptation work for new Kubernetes versions, refer to: adapt-k8s-todo.
Related PR: #4430, @mahdikhashan
Overall Changes
- Support topology aware in the preempt action by @bibibox in #4279
- Move InitCycleState from openSession to OpenSession by @ElectricFish7 in #4378
- fix: Node resource topology awareness, stop scheduling and notReady by @LY-today in #4373
- Support the allocation callback function provided by the extender. by @zhifei92 in #4377
- Delete secrets permission for volcano agent by @JesseStutler in #4389
- [Automation Enhancement] Update cherry-pick shell and add cherry-pick guide doc by @JesseStutler in #4366
- [Automation Enhancement]: Add bump version script; Make version release more automated by @JesseStutler in #4372
- fix incorrect scheduler log by @archlitchi in #4419
- fix: rolling restart admission webhooks after helm upgrade by @junzebao in #4396
- Only refresh podgroup to running when pods are scheduled by @JesseStutler in #4384
- Fix panic while
vcctl queue list
after podgroups' queue was deleted by @halcyon-r in #4428 - When some scalar resources are 0 in deserved, hierarychical queues validation can not pass by @JesseStutler in #4347
- Move kube-scheduler related metrics initilization to server.go to avoid panic by @JesseStutler in #4422
- fix: add ResourceStrategyFit plugin by @LY-today in #4391
- [feature] upgrade k8s to v1.33.2 by @mahdikhashan in #4430
- Update volcano author copyright header by @JesseStutler in #4379
- Add support for skipping handler registration and QoS manager initialization based on configuration in Volcano Agent by @ShuhanYan in #4409
- [doc] Jobflow parameter overrides by @mahdikhashan in #4411
- Support for configurable extended resource names in volcano agent by @ShuhanYan in #4413
- Fix: Sync Jobflow status with vcjob Status Terminated by @dongjiang1989 in #4443
- Correct step order to enable caching in E2E workflows by @GautamBytes in #4464
- fix enable node device score plugin by @coldzerofear in #4340
- Fix the issue where SelectBestNode returns nil when plugin scores are… by @guoqinwill in #4445
- [Cherry-pick master] Check queue spec to ensure the rationality of resource size by @XbaoWu in #4470
- fix: Volcano scheduler panic when scheduling Pods with delayed binding PVCs by @ouyangshengjia in #4484
- fix node count reconcile by @Monokaix in #4473
- Fix incorrect definition of
ReleaseNameEnvKey
by @ouyangshengjia in #4486 - Add missing capacity metrics in hierarchical queues by @JesseStutler in #4487
- Support configuring resources managed by the extender by @zhifei92 in #4482
- Set root capability only when user not set it by @houyuting in #4354
- Bump github.com/onsi/gomega from 1.35.1 to 1.38.0 by @dependabot[bot] in #4495
- inherit non-existent scalar resources from the parent queue by @lhlxc in #4504
- Fix volume binding e2e testing by @JesseStutler in #4483
- fix #4499: scheduler panic caused by pod using gpu-number scheduled by volcano with old version (e.g v1.7) by @linuxfhy in #4500
- test: add e2e test cases for hypernode by @Xu-Wentao in #3983
- fix#4497: hypernode update bug by @cyf-2002 in #4498
- Integrate with leaderworkerset: Volcano controller doesn't need to create podgroup for statefulset if statefulset pods have associated podgroup by @JesseStutler in #4478
- fix: update podGroup when statefulSet update by @Poor12 in #4517
- fix: deepcopy podgroup before update by @Poor12 in #4526
- Fix dra flaky test: use
exec env
instead logs by @JesseStutler in #4531 - ✨ add hcclrank job plugin by @kingeasternsun in #4524
- add kingeasternsun as reviewer by @kingeasternsun in #4561
- Fix panic in job controller's killPods action by @neo502721 in #4569
- Add permissions for managing namespaces in admission rules by @suyiiyii in #4590
- Revert "fix: Node resource topology awareness, stop scheduling and notReady" by @dafu-wu in #4575
- Minor docs changes by @hajnalmt in #4587
- Fix queue not counting volcano.sh/vgpu-memory and volcano.sh/vgpu-number properly by @archlitchi in #4520
- feat: support wildcard syntax in resource-strategy-fit plugin by @ditingdapeng in #4512
- use node.futureidle instead when pod has nominatedNodeName by @Monokaix in #4588
- network topology aware hard mode support hypernode binpack by @kingeasternsun in #4345
- Sync kube-scheduler: Improve CSILimits plugin accuracy by using VolumeAttachments by @guoqinwill in #4608
- 🐛 fix mpi job plugin panic when mpi job only has master task by @kingeasternsun in #4610
- Fix an issue in vc-scheduler about vgpu device memory allocated by @archlitchi in #4615
- feat: add detail msg for pg event by @Poor12 in #4544
- Cleanup: update MAINTAINERS.md with latest content location by @kevin-wangzefeng in #4622
- Expose more helm config for agent by @Monokaix in #4630
- Add hierarchical queue support for nodegroup plugin by @JesseStutler in #4455
- fix: address remaining review comments from PR #4457 for ResourceStrategyFit plugin by @ditingdapeng in #4635
- fix: fix ResourceStrategyFit plugin by @LY-today in #4457
- feat: add cron volcano job by @GoingCharlie in #4560
- feat: add ray plugin for job by @Wonki4 in #4581
- Add sra policy for ResourceStrategyFit Plugin by @XbaoWu in #4454
- Support identifying network topology from node labels and converted i… by @zhaoqi612 in #4629
- Enhance README formatting with note callouts by @huntersman in #4623
- Free up disk space by @Monokaix in #4642
- fix: report all scalar metrics for each queue by @hajnalmt in #4599
- Support configuring network-topology via pod annotations by @zhifei92 in #4583
- Fix panic on volcano-vgpu when allocating multiple containers in a pod by @archlitchi in #4633
- feat(resource-strategy-fit): add per-Pod scoring strategy by @kingeasternsun in #4641
- Fix bump version script by @JesseStutler in #4519
- Add nodegroup nonstrict mode by @JesseStutler in #4652
- add nodegroup nonstrict to fit 0nodes queue by @wuyueandrew in #4602
- Revise the user documentation for hypernode auto discovery by @zhaoqi612 in #4654
- Automated: Bump version to v1.13.0 by @JesseStutler in #4655
New Contributors
- @ElectricFish7 made their first contribution in #4378
- @zhifei92 made their first contribution in #4377
- @junzebao made their first contribution in #4396
- @ShuhanYan made their first contribution in #4409
- @GautamBytes made their first contribution in #4464
- @coldzerofear made their first contribution in #4340
- @houyuting made their first contribution in #4354
- @lhlxc made their first contribution in #4504
- @cyf-2002 made their first contribution in #4498
- @neo502721 made their first contribution in #4569
- @suyiiyii made their first contribution in #4590
- @dafu-wu made their first contribution in #4575
- @ditingdapeng made their first contribution in #4512
- @GoingCharlie made their first contribution in #4560
- @Wonki4 made their first contribution in #4581
- @zhaoqi612 made their first contribution in #4629
- @huntersman made their first contribution in #4623
Full Changelog: v1.12.2...v1.13.0