github volcano-sh/volcano v1.14.0

11 hours ago

Summary

Volcano v1.14.0 establishes Volcano as a unified scheduling platform for diverse workloads at scale. This release introduces a scalable multi-scheduler architecture with dynamic node scheduling shard, enabling multiple schedulers to coordinate efficiently across large clusters. A new Agent Scheduler provides fast scheduling for latency-sensitive AI Agent workloads while working seamlessly with the Volcano batch scheduler. Network topology aware scheduling gains significant enhancements including HyperNode-level binpacking, SubGroup policies, and multi-level gang scheduling across Job and SubGroup scopes. Volcano Global integration advances with HyperJob for multi-cluster training and data-aware scheduling. Colocation now support generic operating systems with CPU Throttling, Memory QoS, and Cgroup V2. Additionally, integrated Ascend vNPU scheduling enables efficient sharing of Ascend AI accelerators.

What's New

Key Features Overview

  • Scalable Multi-Scheduler with Dynamic Node Scheduling Shard (Alpha): Dynamically compute candidate node pools for schedulers with extensible strategies
  • Fast Scheduling for AI Agent Workloads (Alpha): A new Agent Scheduler for latency-sensitive AI Agent workloads is introduced, working in coordination with Volcano batch scheduler to establish a unified scheduling platform
  • Network Topology Aware Scheduling Enhancements: Support hyperNode-level binpacking, SubGroup level network topology aware scheduling, and multi-level gang scheduling across Job and SubGroup scopes for distributed workloads
  • Volcano Global Enhancements: HyperJob for multi-cluster training and data-aware scheduling for federated environments
  • Colocation for Generic OS: CPU Throttling, Memory QoS, CPU Burst with Cgroup V2 support on Ubuntu, CentOS, and other generic operating systems
  • Ascend vNPU Scheduling: Integrated support for Ascend 310P/910 series vNPU scheduling with MindCluster and HAMi modes

Key Feature Details

Scalable Multi-Scheduler with Dynamic Node Scheduling Shard (Alpha)

Background and Motivation:

As Volcano evolves to support diverse scheduling workloads at massive scale, the single scheduler architecture faces significant challenges. Different workload types (batch training, AI agents, microservices) have distinct scheduling requirements and resource utilization patterns. A single scheduler becomes a bottleneck, and static resource allocation leads to inefficient cluster utilization.

The Sharding Controller introduces a scalable multi-scheduler architecture that dynamically computes candidate node pools for each scheduler. Unlike strict partitioning, the Sharding Controller calculates dynamic candidate node pools rather than enforcing hard isolation between schedulers. This flexible approach enables Volcano to serve as a unified scheduling platform for diverse workloads while maintaining high throughput and low latency.

Alpha Feature Notice: This feature is currently in alpha stage. The NodeShard CRD (Node Scheduling Shard) API structure and the underlying scheduling shard concepts are actively evolving.

Key Capabilities:

  • Dynamic Node Scheduling Shard Strategies: Compute dynamic candidate node pools based on various policies. Currently supports scheduling shard by CPU utilization, with an extensible design to support more policies in the future.
  • NodeShard CRD: Manages dynamic candidate node pools for specific schedulers.
  • Large-scale Cluster Support: Architecture designed to support large-scale clusters by distributing load across multiple schedulers
  • Scheduler Coordination: Enable seamless coordination among various scheduler combinations (e.g., multiple Batch Schedulers, or a mix of Agent and Batch Schedulers), establishing Volcano as a unified scheduling platform

Configuration:

# Sharding Controller startup flags
--scheduler-configs="volcano:volcano:0.0:0.6:false:2:100,agent-scheduler:agent:0.7:1.0:true:2:100"
--shard-sync-period=60s
--enable-node-event-trigger=true

# Config format: name:type:min_util:max_util:prefer_warmup:min_nodes:max_nodes

Related:

Fast Scheduling for AI Agent Workloads (Alpha)

Background and Motivation:

AI Agent workloads are latency-sensitive with frequent task creation, requiring ultra-fast scheduling with high throughput. The Volcano batch scheduler is optimized for batch workloads and processes pods at fixed intervals, which cannot guarantee low latency for Agent workloads. To establish Volcano as a unified scheduling platform for both batch and latency-sensitive workloads, we introduce a dedicated Agent Scheduler.

The Agent Scheduler works in coordination with the Volcano batch scheduler through the Sharding Controller (which is introduced in "Scalable Multi-Scheduler with Dynamic Node Scheduling Shard" feature). This architecture positions Volcano as a unified scheduling platform capable of handling diverse workload types.

Alpha Feature Notice: This feature is currently in alpha stage and under active development. The Agent Scheduler related APIs, configuration options, and scheduling algorithms may be refined in future releases.

Key Capabilities:

  • Fast-Path Scheduling: Independent scheduler optimized for latency-sensitive workloads such as AI Agent workloads
  • Multi-Worker Parallel Scheduling: Multiple workers process pods concurrently from the scheduling queue, increasing throughput
  • Optimistic Concurrency Control: Conflict-Aware Binder resolves scheduling conflicts before executing real binding
  • Optimized Scheduling Queue: Enhanced queue mechanism with urgent retry support
  • Unified Platform Integration: Seamless coordination with Volcano batch scheduler via Sharding Controller

Related:

Network Topology Aware Scheduling Enhancements

Background and Motivation:

Volcano v1.14.0 brings significant enhancements to network topology aware scheduling, addressing the growing demands of distributed workloads including LLM training, HPC, and other network-intensive applications.

Key Enhancements:

  • SubGroup Level Topology Awareness: Support fine-grained network topology constraints at the SubGroup/Partition level.
  • Flexible Network Tier Configuration: Support highestTierName for specifying maximum network tier constraints by name.
  • Multi-Level Gang Scheduling: Improved gang scheduling to support both Job-level and SubGroup-level consistency.
  • Volcano Job Partitioning: Enable partitioning of Volcano Jobs for better resource management and fault isolation.
  • HyperNode-Level Binpacking: Optimization for resource utilization across network topology boundaries.

Configuration Example - Volcano Job:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: llm-training-job
spec:
  # ...other fields
  networkTopology:
    mode: hard
    highestTierAllowed: 2  # Job can cross up to Tier 2 HyperNodes
  tasks:
  - name: trainer
    replicas: 8
    partitionPolicy:
      totalPartitions: 2    # Split into 2 partitions
      partitionSize: 4      # 4 pods per partition
      minPartitions: 2      # Minimum 2 partitions required
      networkTopology:
        mode: hard
        highestTierAllowed: 1  # Each partition must stay within Tier 1
    template:
      spec:
        containers:
        - name: trainer
          image: training-image:v1
          resources:
            requests:
              nvidia.com/gpu: 8

Configuration Example - PodGroup SubGroupPolicy:

apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  name: llm-training-pg
spec:
  minMember: 4
  networkTopology:
    mode: hard
    highestTierAllowed: 2
  subGroupPolicy:
  - name: "trainer"
    subGroupSize: 4
    labelSelector:
      matchLabels:
        volcano.sh/task-spec: trainer
    matchLabelKeys:
    - volcano.sh/partition-id
    networkTopology:
      mode: hard
      highestTierAllowed: 1

Related:

Colocation for Generic OS

This release brings comprehensive improvements to Volcano's colocation capabilities, with a major milestone: support for generic operating systems (Ubuntu, CentOS, etc.) in addition to OpenEuler. This enables broader adoption of Volcano Agent for resource sharing between online and offline workloads.

New Features in v1.14.0:

  1. CPU Throttling (CPU Suppression)

The CPU usage of online pods dynamically changes. To better isolate online and offline workloads, the CPU quota allocated to offline pods needs to change dynamically according to the actual usage of online pods. When offline pods consume more CPU than their quota, CPU suppression is triggered; if not exceeded, their quota can gradually recover, enabling adaptive resource allocation.

Key design:

  • Dynamically adjusts BestEffort root cgroup CPU quota based on node allocatable CPU and real-time usage
  • Follows a "monitor-event-handler" architecture with conservative updates to avoid jitter
cpuThrottlingConfig:
  enable: true
  cpuThrottlingThreshold: 80      # Allow BE quota up to 80% of allocatable CPU
  cpuJitterLimitPercent: 1        # Emit updates when quota changes by >=1%
  cpuRecoverLimitPercent: 10      # Cap quota increases to 10% per update
  1. Memory QoS (Cgroup V2)

Cgroup V2 based memory isolation for colocation environments. This feature introduces the ColocationConfiguration CRD, which allows users to define memory QoS policies for specific workloads. The system automatically applies these policies to matching pods, calculating Cgroup V2 settings (memory.high, memory.low, memory.min) based on the configured percentages and the pod's memory requests/limits.

Key capabilities:

  • New API: ColocationConfiguration CRD for defining memory isolation policies via label selectors
  • Dynamic Calculation:
    • memory.high = pod.limits.memory * highRatio %
    • memory.low = pod.requests.memory * lowRatio %
    • memory.min = pod.requests.memory * minRatio %
  • Unified Interface: Robust detection and support for Cgroup V2 environment

Usage Example:

apiVersion: config.volcano.sh/v1alpha1
kind: ColocationConfiguration
metadata:
  name: colo-config1
spec:
  selector:
    matchLabels:
      app: offline-test
  memoryQos:
    # memory.high = memory.limits * 100% (No throttling if 100)
    highRatio: 100
    # memory.low = memory.requests * 50% (Protect 50% of request)
    lowRatio: 50
    # memory.min = memory.requests * 0% (No absolute guarantee)
    minRatio: 0
  1. CPU Burst for Generic OS

Extended CPU Burst support to generic operating systems (Ubuntu, CentOS, etc.), allowing latency-sensitive workloads to burst beyond their CPU limits when resources are available.

cpuBurstConfig:
  enable: true
  1. Cgroup V2 Full Support

Volcano Agent now fully supports Cgroup V2 environments, ensuring seamless operation on modern Linux distributions.

  • Automatic Detection: Automatically detects cgroup version (v1/v2) and driver type (systemd/cgroupfs) without manual configuration.
  • Unified Interface: Adapts internal resource managers (CPU, Memory) to work with both v1 and v2 interfaces.

Related:

Ascend vNPU Scheduling

Background and Motivation:

Volcano v1.14.0 introduces integrated support for Ascend vNPU (virtual NPU) scheduling, enabling efficient sharing of Ascend AI processors across multiple workloads. This feature supports two modes to accommodate different deployment scenarios.

Supported Modes:

  1. MindCluster Mode

    • Integrated from the official Ascend cluster scheduling add-on
    • Supports Ascend 310P series with dynamic virtualization
    • Uses huawei.com/npu-core resource name
    • Supports vnpu-level (low/high) and vnpu-dvpp configurations
  2. HAMi Mode

    • Developed by the HAMi community
    • Supports both Ascend 310 and 910 series
    • Supports heterogeneous Ascend clusters (910A, 910B2, 910B3, 310P)
    • Memory-based virtualization with automatic alignment

Scheduler Configuration:

# MindCluster Mode
- name: deviceshare
  arguments:
    deviceshare.AscendMindClusterVNPUEnable: true

# HAMi Mode
- name: deviceshare
  arguments:
    deviceshare.AscendHAMiVNPUEnable: true
    deviceshare.SchedulePolicy: binpack  # or spread

Usage Example (HAMi Mode):

apiVersion: v1
kind: Pod
metadata:
  name: ascend-pod
spec:
  schedulerName: volcano
  containers:
  - name: inference
    image: ascend-image:v1
    resources:
      limits:
        huawei.com/Ascend310P: "1"
        huawei.com/Ascend310P-memory: "4096"

Related:

Volcano Global Enhancements

The new version of Volcano-global further enhances the ability of multi-cluster scheduling

Key Enhancements:

  • Introduce HyperJob for large-scale cross-cluster training: Enables large-scale cross-cluster training by orchestrating jobs across multiple clusters.
  • Data-Aware Scheduling: Optimizes workload placement based on data locality.

For detailed information, please refer to the Volcano Global v0.3.0 Release Notes.

Related:

Scheduler Stability and Performance

Reclaim Refactoring and Enhancements

The Reclaim mechanism has been significantly improved through a comprehensive refactor of the Reclaim Action and critical logic fixes in the Capacity Plugin. These changes collectively enhance the accuracy, stability, and performance of resource reclamation in multi-tenant clusters.

Key improvements:

  • Reclaim Action Refactoring: The reclaim workflow has been restructured to improve code readability, maintainability, and test coverage.
  • Enhanced Capacity Plugin Logic:
    • Precise Victim Selection: Fixed reclaimableFn to correctly handle scalar resources and ensure victims are selected based on all relevant resource dimensions, taking the reclaimer's requirements into account.
    • Accurate Resource Comparison: Corrected preemptiveFn resource comparison logic to prevent incorrect preemption decisions when specific resource dimensions are zero.
  • Improved Stability: Addressed edge cases in resource calculation to prevent scheduling loops and incorrect evictions.

Related:

Other Notable Changes

API Changes

New CRDs and API enhancements in v1.14.0:

1. NodeShard CRD (Alpha) (@ssfffss, @Haoran, @qi-min)

Node Scheduling Shard CRD. Manages dynamic candidate node pools for specific schedulers.

2. HyperJob CRD (@JesseStutler)

Multi-cluster training job abstraction for the Volcano Global enhancements.

3. DataSource/DataSourceClaim CRDs (@fx147)

Data-aware scheduling for federated environments.

4. SubGroupPolicy API (@wangyang0616)

Defines fine-grained network topology constraints and gang scheduling policies at the SubGroup level. See Network Topology Aware Scheduling section for details.

5. DequeueStrategy (@wangdongyang1)

New configuration for queue dequeue strategies.

6. Kubebuilder Validation Markers (@zhoujinyu)

Added CEL validation markers to batch, scheduling, topology, and flow APIs.

7. ColocationConfiguration CRD (@ouyangshengjia)

Introduced ColocationConfiguration CRD to configure colocation policies (currently supporting Memory QoS). This API allows users to define percentage-based coefficients (range 0-100) for highRatio, lowRatio, and minRatio, which are multiplied with pod memory limits or requests to dynamically calculate Cgroup V2 interfaces (memory.high, memory.low, memory.min).

Features and Enhancements

  • ValidatingAdmissionPolicy/MutatingAdmissionPolicy: Optional enhanced admission mechanisms (#4596, @suyiiyii)
  • GitHub Container Registry support: Added GHCR as an alternative to Docker Hub (#4845, #4846, @medyagh)
  • In-place Pod Update: Support for updating pods without recreation (#3839, @guoqinwill)
  • Scheduler performance dashboards: Added Grafana dashboards for scheduler metrics (#4885, @guoqinwill)
  • Distributed Inference Support (Kthena): Enables role-based elasticity and Gang Scheduling for distributed inference workloads (#4781, @wangyang0616)
  • E2E test improvements: Enhanced context collection on test failures (#4764, @neeraj542)
  • Staging directory for APIs: Introduced staging directory structure for volcano.sh/apis (#4823, @PersistentJZH)

Bug Fixes

  • Scheduling main loop blocked: Fixed timeout due to un-released PreBind lock (#4692, @guoqinwill)
  • Scheduler panic on hierarchy queue: Fixed panic when child queue capability check fails (#4673, @hajnalmt, @zhengjiajin)
  • minAvailable mismatch: Fixed job pending state caused by minAvailable/minPartitions mismatch (#4867, @zhengchenyu)
  • Job aborting state transition: Fixed early transition to Aborted when terminating pods are removed (#4828, @kingeasternsun)
  • Memory leak from metrics: Fixed memory leak caused by leftover metrics data due to race conditions (#4821, @halcyon-r)
  • Queue hierarchy validation: Fixed cluster-wide scheduling failure due to missing validation for child queue guarantee sum (#4818, @zjj2wry)
  • RestartTask policy on PodEvicted: Fixed unexpected behavior when pod is evicted (#4814, @ouyangshengjia)
  • Task validity check failure: Fixed scheduling failures caused by task member changes in PodGroup (#4802, @ouyangshengjia)
  • Scheduler memory leak: Fixed memory leak in volcano-scheduler deletedJobs queue (#4745, @fengruotj)
  • VGPU multi-card memory bug: Fixed VGPU request when multiple cards have different total video memory (#4713, @weapons97)
  • GPU resource calculation error: Fixed gpu resource error at allocate state in checkNodeGPUSharingPredicateAndScore (#4884, @ChenW66)
  • Capacity plugin preemptiveFn: Fixed resource comparison preventing reclaim (#4658, @hajnalmt)
  • Capacity plugin queue share: Fixed queue share calculation for best-effort queues (#4865, @guoqinwill)
  • Colocation cgroup configuration: Fixed systemd driver with containerd runtime oversubscribed cgroups (#4912, @JesseStutler)
  • NUMA topology snapshot panic: Fixed panic when numatopology is added and updated in short time before snapshot (#4891, @qi-min)
  • FailedScheduling message: Fixed inaccurate message when podgroup is ready (#4978, @ruanwenjun)
  • Nil pod handling: Fixed nil pod when AlreadyExists error on create (#4985, @aman-kumar)
  • vNPU multi-type cluster: Fixed vNPU not working in multi-type NPU clusters (#4779, @kiritoxkiriko)
  • Hami vNPU memory allocation: Fixed memory allocation restrictions in Hami vNPU scene (#4859, @haorenhui)
  • MindCluster bugs: Fixed multiple bugs in MindCluster vNPU mode (#4775, @DSFans2014)
  • PodGroup controller replicaset annotation: Fixed annotation handling for replicaset (#4792, @hajnalmt)
  • Reclaimed resource calculation: Fixed to add node futureIdle resource first (#4774, @jiahuat)
  • syncTask and retryResyncTask: Fixed cache communication issues (#4938, @hajnalmt)

Dependencies

Upgrade Instructions

To upgrade to Volcano v1.14.0:

# Using Helm
helm repo update
helm upgrade volcano volcano-sh/volcano --version 1.14.0

# Using kubectl
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/v1.14.0/installer/volcano-development.yaml

Note: Review the new CRDs (NodeShard, HyperJob) if you plan to use the Agent Scheduler, Sharding Controller, or cross-cluster training features.

Contributors

Thank you to all contributors who made this release possible:

@3sunny, @3th4novo, @acsoto, @Aman-Cool, @archlitchi, @dafu-wu, @DSFans2014, @FAUST-BENCHOU, @fengruotj, @Freshwlnd, @goyalpalak18, @guoqinwill, @Haibara-Ai97, @hajnalmt, @halcyon-r, @handan-yxh, @JackyTYang, @JesseStutler, @jiahuat, @kingeasternsun, @kiritoxkiriko, @kube-gopher, @LiZhenCheng9527, @medyagh, @MondayCha, @Monokaix, @mvinchoo, @neeraj542, @nitindhiman314e, @ouyangshengjia, @PersistentJZH, @qi-min, @rhh777, @ruanwenjun, @RushabhMehta2005, @sailorvii, @ssfffss, @suyiiyii, @Tau721, @tunedev, @wangyang0616, @weapons97, @Wonki4, @zhaoqi612, @zhengchenyu, @zjj2wry

Full Changelog

v1.13.0...v1.14.0

Don't miss a new volcano release

NewReleases is sending notifications on new releases.