Summary
Volcano v1.14.0 establishes Volcano as a unified scheduling platform for diverse workloads at scale. This release introduces a scalable multi-scheduler architecture with dynamic node scheduling shard, enabling multiple schedulers to coordinate efficiently across large clusters. A new Agent Scheduler provides fast scheduling for latency-sensitive AI Agent workloads while working seamlessly with the Volcano batch scheduler. Network topology aware scheduling gains significant enhancements including HyperNode-level binpacking, SubGroup policies, and multi-level gang scheduling across Job and SubGroup scopes. Volcano Global integration advances with HyperJob for multi-cluster training and data-aware scheduling. Colocation now support generic operating systems with CPU Throttling, Memory QoS, and Cgroup V2. Additionally, integrated Ascend vNPU scheduling enables efficient sharing of Ascend AI accelerators.
What's New
Key Features Overview
- Scalable Multi-Scheduler with Dynamic Node Scheduling Shard (Alpha): Dynamically compute candidate node pools for schedulers with extensible strategies
- Fast Scheduling for AI Agent Workloads (Alpha): A new Agent Scheduler for latency-sensitive AI Agent workloads is introduced, working in coordination with Volcano batch scheduler to establish a unified scheduling platform
- Network Topology Aware Scheduling Enhancements: Support hyperNode-level binpacking, SubGroup level network topology aware scheduling, and multi-level gang scheduling across Job and SubGroup scopes for distributed workloads
- Volcano Global Enhancements: HyperJob for multi-cluster training and data-aware scheduling for federated environments
- Colocation for Generic OS: CPU Throttling, Memory QoS, CPU Burst with Cgroup V2 support on Ubuntu, CentOS, and other generic operating systems
- Ascend vNPU Scheduling: Integrated support for Ascend 310P/910 series vNPU scheduling with MindCluster and HAMi modes
Key Feature Details
Scalable Multi-Scheduler with Dynamic Node Scheduling Shard (Alpha)
Background and Motivation:
As Volcano evolves to support diverse scheduling workloads at massive scale, the single scheduler architecture faces significant challenges. Different workload types (batch training, AI agents, microservices) have distinct scheduling requirements and resource utilization patterns. A single scheduler becomes a bottleneck, and static resource allocation leads to inefficient cluster utilization.
The Sharding Controller introduces a scalable multi-scheduler architecture that dynamically computes candidate node pools for each scheduler. Unlike strict partitioning, the Sharding Controller calculates dynamic candidate node pools rather than enforcing hard isolation between schedulers. This flexible approach enables Volcano to serve as a unified scheduling platform for diverse workloads while maintaining high throughput and low latency.
Alpha Feature Notice: This feature is currently in alpha stage. The NodeShard CRD (Node Scheduling Shard) API structure and the underlying scheduling shard concepts are actively evolving.
Key Capabilities:
- Dynamic Node Scheduling Shard Strategies: Compute dynamic candidate node pools based on various policies. Currently supports scheduling shard by CPU utilization, with an extensible design to support more policies in the future.
- NodeShard CRD: Manages dynamic candidate node pools for specific schedulers.
- Large-scale Cluster Support: Architecture designed to support large-scale clusters by distributing load across multiple schedulers
- Scheduler Coordination: Enable seamless coordination among various scheduler combinations (e.g., multiple Batch Schedulers, or a mix of Agent and Batch Schedulers), establishing Volcano as a unified scheduling platform
Configuration:
# Sharding Controller startup flags
--scheduler-configs="volcano:volcano:0.0:0.6:false:2:100,agent-scheduler:agent:0.7:1.0:true:2:100"
--shard-sync-period=60s
--enable-node-event-trigger=true
# Config format: name:type:min_util:max_util:prefer_warmup:min_nodes:max_nodesRelated:
- PR: #4777
- Design Doc: Sharding Controller Design
- Contributors: @ssfffss, @Haoran, @qi-min
Fast Scheduling for AI Agent Workloads (Alpha)
Background and Motivation:
AI Agent workloads are latency-sensitive with frequent task creation, requiring ultra-fast scheduling with high throughput. The Volcano batch scheduler is optimized for batch workloads and processes pods at fixed intervals, which cannot guarantee low latency for Agent workloads. To establish Volcano as a unified scheduling platform for both batch and latency-sensitive workloads, we introduce a dedicated Agent Scheduler.
The Agent Scheduler works in coordination with the Volcano batch scheduler through the Sharding Controller (which is introduced in "Scalable Multi-Scheduler with Dynamic Node Scheduling Shard" feature). This architecture positions Volcano as a unified scheduling platform capable of handling diverse workload types.
Alpha Feature Notice: This feature is currently in alpha stage and under active development. The Agent Scheduler related APIs, configuration options, and scheduling algorithms may be refined in future releases.
Key Capabilities:
- Fast-Path Scheduling: Independent scheduler optimized for latency-sensitive workloads such as AI Agent workloads
- Multi-Worker Parallel Scheduling: Multiple workers process pods concurrently from the scheduling queue, increasing throughput
- Optimistic Concurrency Control: Conflict-Aware Binder resolves scheduling conflicts before executing real binding
- Optimized Scheduling Queue: Enhanced queue mechanism with urgent retry support
- Unified Platform Integration: Seamless coordination with Volcano batch scheduler via Sharding Controller
Related:
- Issue: #4722
- PRs: #4804, #4801, #4805
- Design Doc: Agent Scheduler Design
- Contributors: @qi-min, @JesseStutler, @handan-yxh
Network Topology Aware Scheduling Enhancements
Background and Motivation:
Volcano v1.14.0 brings significant enhancements to network topology aware scheduling, addressing the growing demands of distributed workloads including LLM training, HPC, and other network-intensive applications.
Key Enhancements:
- SubGroup Level Topology Awareness: Support fine-grained network topology constraints at the SubGroup/Partition level.
- Flexible Network Tier Configuration: Support
highestTierNamefor specifying maximum network tier constraints by name. - Multi-Level Gang Scheduling: Improved gang scheduling to support both Job-level and SubGroup-level consistency.
- Volcano Job Partitioning: Enable partitioning of Volcano Jobs for better resource management and fault isolation.
- HyperNode-Level Binpacking: Optimization for resource utilization across network topology boundaries.
Configuration Example - Volcano Job:
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: llm-training-job
spec:
# ...other fields
networkTopology:
mode: hard
highestTierAllowed: 2 # Job can cross up to Tier 2 HyperNodes
tasks:
- name: trainer
replicas: 8
partitionPolicy:
totalPartitions: 2 # Split into 2 partitions
partitionSize: 4 # 4 pods per partition
minPartitions: 2 # Minimum 2 partitions required
networkTopology:
mode: hard
highestTierAllowed: 1 # Each partition must stay within Tier 1
template:
spec:
containers:
- name: trainer
image: training-image:v1
resources:
requests:
nvidia.com/gpu: 8Configuration Example - PodGroup SubGroupPolicy:
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
name: llm-training-pg
spec:
minMember: 4
networkTopology:
mode: hard
highestTierAllowed: 2
subGroupPolicy:
- name: "trainer"
subGroupSize: 4
labelSelector:
matchLabels:
volcano.sh/task-spec: trainer
matchLabelKeys:
- volcano.sh/partition-id
networkTopology:
mode: hard
highestTierAllowed: 1Related:
- Issues: #4188, #4368, #4869
- PRs: #4721, #4810, #4795, #4785, #4889
- Design Doc: Network Topology Aware Scheduling
- Contributors: @ouyangshengjia, @3sunny, @zhaoqi, @wangyang0616, @MondayCha, @Tau721
Colocation for Generic OS
This release brings comprehensive improvements to Volcano's colocation capabilities, with a major milestone: support for generic operating systems (Ubuntu, CentOS, etc.) in addition to OpenEuler. This enables broader adoption of Volcano Agent for resource sharing between online and offline workloads.
New Features in v1.14.0:
- CPU Throttling (CPU Suppression)
The CPU usage of online pods dynamically changes. To better isolate online and offline workloads, the CPU quota allocated to offline pods needs to change dynamically according to the actual usage of online pods. When offline pods consume more CPU than their quota, CPU suppression is triggered; if not exceeded, their quota can gradually recover, enabling adaptive resource allocation.
Key design:
- Dynamically adjusts BestEffort root cgroup CPU quota based on node allocatable CPU and real-time usage
- Follows a "monitor-event-handler" architecture with conservative updates to avoid jitter
cpuThrottlingConfig:
enable: true
cpuThrottlingThreshold: 80 # Allow BE quota up to 80% of allocatable CPU
cpuJitterLimitPercent: 1 # Emit updates when quota changes by >=1%
cpuRecoverLimitPercent: 10 # Cap quota increases to 10% per update- Memory QoS (Cgroup V2)
Cgroup V2 based memory isolation for colocation environments. This feature introduces the ColocationConfiguration CRD, which allows users to define memory QoS policies for specific workloads. The system automatically applies these policies to matching pods, calculating Cgroup V2 settings (memory.high, memory.low, memory.min) based on the configured percentages and the pod's memory requests/limits.
Key capabilities:
- New API:
ColocationConfigurationCRD for defining memory isolation policies via label selectors - Dynamic Calculation:
memory.high=pod.limits.memory*highRatio%memory.low=pod.requests.memory*lowRatio%memory.min=pod.requests.memory*minRatio%
- Unified Interface: Robust detection and support for Cgroup V2 environment
Usage Example:
apiVersion: config.volcano.sh/v1alpha1
kind: ColocationConfiguration
metadata:
name: colo-config1
spec:
selector:
matchLabels:
app: offline-test
memoryQos:
# memory.high = memory.limits * 100% (No throttling if 100)
highRatio: 100
# memory.low = memory.requests * 50% (Protect 50% of request)
lowRatio: 50
# memory.min = memory.requests * 0% (No absolute guarantee)
minRatio: 0- CPU Burst for Generic OS
Extended CPU Burst support to generic operating systems (Ubuntu, CentOS, etc.), allowing latency-sensitive workloads to burst beyond their CPU limits when resources are available.
cpuBurstConfig:
enable: true- Cgroup V2 Full Support
Volcano Agent now fully supports Cgroup V2 environments, ensuring seamless operation on modern Linux distributions.
- Automatic Detection: Automatically detects cgroup version (v1/v2) and driver type (systemd/cgroupfs) without manual configuration.
- Unified Interface: Adapts internal resource managers (CPU, Memory) to work with both v1 and v2 interfaces.
Related:
- Issues: #4631, #4466, #4468, #4912
- PRs: #4632, #4945, #4913, #4984
- Design Docs: CPU Throttle Design, Agent Cgroup V2 Adaptation
- Contributors: @Haibara-Ai97, @JesseStutler, @ouyangshengjia
Ascend vNPU Scheduling
Background and Motivation:
Volcano v1.14.0 introduces integrated support for Ascend vNPU (virtual NPU) scheduling, enabling efficient sharing of Ascend AI processors across multiple workloads. This feature supports two modes to accommodate different deployment scenarios.
Supported Modes:
-
MindCluster Mode
- Integrated from the official Ascend cluster scheduling add-on
- Supports Ascend 310P series with dynamic virtualization
- Uses
huawei.com/npu-coreresource name - Supports vnpu-level (low/high) and vnpu-dvpp configurations
-
HAMi Mode
- Developed by the HAMi community
- Supports both Ascend 310 and 910 series
- Supports heterogeneous Ascend clusters (910A, 910B2, 910B3, 310P)
- Memory-based virtualization with automatic alignment
Scheduler Configuration:
# MindCluster Mode
- name: deviceshare
arguments:
deviceshare.AscendMindClusterVNPUEnable: true
# HAMi Mode
- name: deviceshare
arguments:
deviceshare.AscendHAMiVNPUEnable: true
deviceshare.SchedulePolicy: binpack # or spreadUsage Example (HAMi Mode):
apiVersion: v1
kind: Pod
metadata:
name: ascend-pod
spec:
schedulerName: volcano
containers:
- name: inference
image: ascend-image:v1
resources:
limits:
huawei.com/Ascend310P: "1"
huawei.com/Ascend310P-memory: "4096"Related:
- Issues: #4718
- PRs: #4656, #4717
- User Guide: How to Use vNPU
- Contributors: @JackyTYang, @DSFans2014
Volcano Global Enhancements
The new version of Volcano-global further enhances the ability of multi-cluster scheduling
Key Enhancements:
- Introduce HyperJob for large-scale cross-cluster training: Enables large-scale cross-cluster training by orchestrating jobs across multiple clusters.
- Data-Aware Scheduling: Optimizes workload placement based on data locality.
For detailed information, please refer to the Volcano Global v0.3.0 Release Notes.
Related:
- Design Doc: HyperJob Design
- Contributors: @JesseStutler, @fx147
Scheduler Stability and Performance
Reclaim Refactoring and Enhancements
The Reclaim mechanism has been significantly improved through a comprehensive refactor of the Reclaim Action and critical logic fixes in the Capacity Plugin. These changes collectively enhance the accuracy, stability, and performance of resource reclamation in multi-tenant clusters.
Key improvements:
- Reclaim Action Refactoring: The reclaim workflow has been restructured to improve code readability, maintainability, and test coverage.
- Enhanced Capacity Plugin Logic:
- Precise Victim Selection: Fixed
reclaimableFnto correctly handle scalar resources and ensure victims are selected based on all relevant resource dimensions, taking the reclaimer's requirements into account. - Accurate Resource Comparison: Corrected
preemptiveFnresource comparison logic to prevent incorrect preemption decisions when specific resource dimensions are zero.
- Precise Victim Selection: Fixed
- Improved Stability: Addressed edge cases in resource calculation to prevent scheduling loops and incorrect evictions.
Related:
Other Notable Changes
API Changes
New CRDs and API enhancements in v1.14.0:
1. NodeShard CRD (Alpha) (@ssfffss, @Haoran, @qi-min)
Node Scheduling Shard CRD. Manages dynamic candidate node pools for specific schedulers.
2. HyperJob CRD (@JesseStutler)
Multi-cluster training job abstraction for the Volcano Global enhancements.
3. DataSource/DataSourceClaim CRDs (@fx147)
Data-aware scheduling for federated environments.
4. SubGroupPolicy API (@wangyang0616)
Defines fine-grained network topology constraints and gang scheduling policies at the SubGroup level. See Network Topology Aware Scheduling section for details.
5. DequeueStrategy (@wangdongyang1)
New configuration for queue dequeue strategies.
6. Kubebuilder Validation Markers (@zhoujinyu)
Added CEL validation markers to batch, scheduling, topology, and flow APIs.
7. ColocationConfiguration CRD (@ouyangshengjia)
Introduced ColocationConfiguration CRD to configure colocation policies (currently supporting Memory QoS). This API allows users to define percentage-based coefficients (range 0-100) for highRatio, lowRatio, and minRatio, which are multiplied with pod memory limits or requests to dynamically calculate Cgroup V2 interfaces (memory.high, memory.low, memory.min).
Features and Enhancements
- ValidatingAdmissionPolicy/MutatingAdmissionPolicy: Optional enhanced admission mechanisms (#4596, @suyiiyii)
- GitHub Container Registry support: Added GHCR as an alternative to Docker Hub (#4845, #4846, @medyagh)
- In-place Pod Update: Support for updating pods without recreation (#3839, @guoqinwill)
- Scheduler performance dashboards: Added Grafana dashboards for scheduler metrics (#4885, @guoqinwill)
- Distributed Inference Support (Kthena): Enables role-based elasticity and Gang Scheduling for distributed inference workloads (#4781, @wangyang0616)
- E2E test improvements: Enhanced context collection on test failures (#4764, @neeraj542)
- Staging directory for APIs: Introduced staging directory structure for volcano.sh/apis (#4823, @PersistentJZH)
Bug Fixes
- Scheduling main loop blocked: Fixed timeout due to un-released PreBind lock (#4692, @guoqinwill)
- Scheduler panic on hierarchy queue: Fixed panic when child queue capability check fails (#4673, @hajnalmt, @zhengjiajin)
- minAvailable mismatch: Fixed job pending state caused by minAvailable/minPartitions mismatch (#4867, @zhengchenyu)
- Job aborting state transition: Fixed early transition to Aborted when terminating pods are removed (#4828, @kingeasternsun)
- Memory leak from metrics: Fixed memory leak caused by leftover metrics data due to race conditions (#4821, @halcyon-r)
- Queue hierarchy validation: Fixed cluster-wide scheduling failure due to missing validation for child queue guarantee sum (#4818, @zjj2wry)
- RestartTask policy on PodEvicted: Fixed unexpected behavior when pod is evicted (#4814, @ouyangshengjia)
- Task validity check failure: Fixed scheduling failures caused by task member changes in PodGroup (#4802, @ouyangshengjia)
- Scheduler memory leak: Fixed memory leak in volcano-scheduler deletedJobs queue (#4745, @fengruotj)
- VGPU multi-card memory bug: Fixed VGPU request when multiple cards have different total video memory (#4713, @weapons97)
- GPU resource calculation error: Fixed gpu resource error at allocate state in checkNodeGPUSharingPredicateAndScore (#4884, @ChenW66)
- Capacity plugin preemptiveFn: Fixed resource comparison preventing reclaim (#4658, @hajnalmt)
- Capacity plugin queue share: Fixed queue share calculation for best-effort queues (#4865, @guoqinwill)
- Colocation cgroup configuration: Fixed systemd driver with containerd runtime oversubscribed cgroups (#4912, @JesseStutler)
- NUMA topology snapshot panic: Fixed panic when numatopology is added and updated in short time before snapshot (#4891, @qi-min)
- FailedScheduling message: Fixed inaccurate message when podgroup is ready (#4978, @ruanwenjun)
- Nil pod handling: Fixed nil pod when AlreadyExists error on create (#4985, @aman-kumar)
- vNPU multi-type cluster: Fixed vNPU not working in multi-type NPU clusters (#4779, @kiritoxkiriko)
- Hami vNPU memory allocation: Fixed memory allocation restrictions in Hami vNPU scene (#4859, @haorenhui)
- MindCluster bugs: Fixed multiple bugs in MindCluster vNPU mode (#4775, @DSFans2014)
- PodGroup controller replicaset annotation: Fixed annotation handling for replicaset (#4792, @hajnalmt)
- Reclaimed resource calculation: Fixed to add node futureIdle resource first (#4774, @jiahuat)
- syncTask and retryResyncTask: Fixed cache communication issues (#4938, @hajnalmt)
Dependencies
Upgrade Instructions
To upgrade to Volcano v1.14.0:
# Using Helm
helm repo update
helm upgrade volcano volcano-sh/volcano --version 1.14.0
# Using kubectl
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/v1.14.0/installer/volcano-development.yamlNote: Review the new CRDs (NodeShard, HyperJob) if you plan to use the Agent Scheduler, Sharding Controller, or cross-cluster training features.
Contributors
Thank you to all contributors who made this release possible:
@3sunny, @3th4novo, @acsoto, @Aman-Cool, @archlitchi, @dafu-wu, @DSFans2014, @FAUST-BENCHOU, @fengruotj, @Freshwlnd, @goyalpalak18, @guoqinwill, @Haibara-Ai97, @hajnalmt, @halcyon-r, @handan-yxh, @JackyTYang, @JesseStutler, @jiahuat, @kingeasternsun, @kiritoxkiriko, @kube-gopher, @LiZhenCheng9527, @medyagh, @MondayCha, @Monokaix, @mvinchoo, @neeraj542, @nitindhiman314e, @ouyangshengjia, @PersistentJZH, @qi-min, @rhh777, @ruanwenjun, @RushabhMehta2005, @sailorvii, @ssfffss, @suyiiyii, @Tau721, @tunedev, @wangyang0616, @weapons97, @Wonki4, @zhaoqi612, @zhengchenyu, @zjj2wry