Highlights
Ray Label Selector API
Ray v2.49 introduced a label selector API. Correspondingly, KubeRay v1.5 now features a top-level API for defining Ray labels and resources. This new top-level API is the preferred method going forward, replacing the previous practice of setting labels and custom resources within rayStartParams.
The new API will be consumed by the Ray autoscaler, improving autoscaling decisions based on task and actor label selectors. Furthermore, labels configured through this API are mirrored directly into the Pods. This mirroring allows users to more seamlessly combine Ray label selectors with standard Kubernetes label selectors when managing and interacting with their Ray clusters.
You can use the new API in the following way:
apiVersion: ray.io/v1
kind: RayCluster
spec:
...
headGroupSpec:
rayStartParams: {}
resources:
Custom1: "1"
labels:
ray.io/zone: us-west-2a
ray.io/region: us-west-2
workerGroupSpec:
- replicas: 1
rayStartParams: {}
resources:
Custom1: "1"
labels:
ray.io/zone: us-west-2a
ray.io/region: us-west-2RayJob Sidecar submission mode
The RayJob resource now supports a new value for spec.submissionMode called SidecarMode.
Sidecar mode directly addresses a key limitation in both K8sJobMode and HttpMode: the network connectivity requirement from an external Pod or the KubeRay operator for job submission. With Sidecar mode, job submission is orchestrated by injecting a sidecar container into the Head Pod. This solution eliminates the need for an external client to handle the submission process and reduces job submission failure due to network failures.
To use this feature, set spec.submissionMode to SidecarMode in your RayJob:
apiVersion: ray.io/v1
kind: RayJob
metadata:
name: my-rayjob
spec:
submissionMode: "SidecarMode"
...Advanced deletion policies for RayJob
KubeRay now supports a more advanced and flexible API for expressing deletion policies within the RayJob specification. This new design moves beyond the singular boolean field, spec.shutdownAfterJobFinishes, and allows users to define different cleanup strategies using configurable TTL values based on the Ray job's status.
This API unlocks new use cases that require specific resource retention after a job completes or fails. For example, users can now implement policies that:
- Preserve only the Head Pod for a set duration after job failure to facilitate debugging.
- Retain the entire Ray Cluster for a longer TTL after a successful run for post-analysis or data retrieval.
By linking specific TTLs to Ray job statuses (e.g., success, failure) and strategies (e.g. DeleteWorkers, DeleteCluster, DeleteSelf), users gain fine-grained control over resource cleanup and cost management.
Below is an example of how to use this new, flexible API structure:
apiVersion: ray.io/v1
kind: RayJob
metadata:
name: rayjob-deletion-rules
spec:
deletionStrategy:
deletionRules:
- policy: DeleteWorkers
condition:
jobStatus: FAILED
ttlSeconds: 100
- policy: DeleteCluster
condition:
jobStatus: FAILED
ttlSeconds: 600
- policy: DeleteCluster
condition:
jobStatus: SUCCEEDED
ttlSeconds: 0This feature is disabled by default and requires enabling the RayJobDeletionPolicy feature gate.
Incremental upgrade support for RayService
KubeRay v1.5 introduces the capability to enable zero-downtime incremental upgrades for RayServices. This new feature improves the upgrade process by leveraging the Gateway API and Ray autoscaling to incrementally migrate user traffic from the existing Ray cluster to the newly upgraded one.
This approach is more efficient and reliable compared to the former mechanism. The previous method required creating the upgraded Ray cluster at its full capacity and then shifting all traffic at once, which could lead to disruptions and unnecessary resource usage. By contrast, the incremental approach gradually scales up the new cluster and migrates traffic in smaller, controlled steps, resulting in improved stability and resource utilization during upgrade.
To enable this feature, set the following fields in RayService:
apiVersion: ray.io/v1
kind: RayService
metadata:
name: example-rayservice
spec:
upgradeStrategy:
type: "NewClusterWithIncrementalUpgrade"
clusterUpgradeOptions:
maxSurgePercent: 40
stepSizePercent: 5
intervalSeconds: 10
gatewayClassName: "cluster-gateway"This feature is disabled by default and requires enabling the RayServiceIncrementalUpgrade feature gate.
Improved multi-host support for RayCluster
Previous KubeRay versions supported multi-host worker groups via the numOfHosts API, but this capability lacked fundamental capabilities required for managing multi-host accelerators. Firstly, it lacked logical grouping of worker Pods belonging to the same multi-host unit (or slice). As a result, it was not possible to run operations like “replace all workers in this group”. In addition, there was no ordered indexing, which is often required for coordinating multi-host workers when using TPUs.
When using multi-host in KubeRay v1.5, KubeRay will automatically set the following labels for multi-host Ray workers:
labels:
ray.io/worker-group-replica-name: tpu-group-af03de
ray.io/worker-group-replica-index: 0
ray.io/replica-host-index: 1Below is a description of each label and its purpose:
ray.io/worker-group-replica-name: this label provides a unique identifier for each replica (i.e. host group or slice) in a worker group. The label enables KubeRay to rediscover all other pods in the same group and apply group operators.ray.io/worker-group-replica-index: this label is an ordered replica index in the worker group. This label is particularly important for cases like multi-slice TPUs, where each slice must be aware of its slice index.ray.io/replica-host-index: this label is an ordered host index per replica (host group or slice).
These changes collectively enable reliable, production-level scaling and management of multi-host GPU workers or TPU slices.
This feature is disabled by default and requires enabling the RayMultiHostIndexing feature gate.
Breaking Changes
For RayCluster objects created by a RayJob, KubeRay will no longer attempt to recreate the Head Pod if it fails or is deleted after its initial successful provisioning. To retry failed jobs, use spec.backoffLimit which will result in KubeRay provisioning a new RayCluster.
CHANGELOG
- [release-1.5] update version to v1.5.0 (#4177, @andrewsykim)
- [CherryPick][Feature Enhancement] Set ordered replica index label to support mult… (#4171, @ryanaoleary)
- [releasey-1.5] update version to v1.5.0-rc.1 (#4170, @andrewsykim)
- [release-1.5] fix: dashboard build for kuberay 1.5.0 (#4169, @andrewsykim)
- [release-1.5] update versions to v1.5.0-rc.0 (#4155, @andrewsykim)
- [Bug] Sidecar mode shouldn't restart head pod when head pod is delete… (#4156, @rueian)
- Bump Kubernetes dependencies to v0.34.x (#4147, @mbobrovskyi)
- [Chore] Remove duplicate
test-e2e-rayservicein Makefile (#4145, @seanlaii) - [Scheduler] Replace AddMetadataToPod with AddMetadataToChildResource across all schedulers (#4123, @win5923)
- [Feature] Add initializing timeout for RayService (#4143, @seanlaii)
- [RayService] Support Incremental Zero-Downtime Upgrades (#3166, @ryanaoleary)
- Example RayCluster spec with
Labelsandlabel_selectorAPI (#4136, @ryanaoleary) - [RayCluster] Fix for multi-host indexing worker creation (#4139, @chiayi)
- Support uppercase default resource names for top-level Resources (#4137, @ryanaoleary)
- [Bug] [KubeRay Dashboard] Misclassifies RayCluster type (#4135, @CheyuWu)
- [RayCluster] Add multi-host indexing labels (#3998, @chiayi)
- [Grafana] Use Range option instead of instant for RayCluster Provisioned Duration panel (#4062, @win5923)
- [Feature] Separate controller namespace and CRD namespaces for KubeRay-Operator Dashboard (#4088, @400Ping)
- Update grafana dashboards to ray 2.49.2 + add README instructions on how to update (#4111, @alanwguo)
- fix: update broken and outdated links (#4129, @ErikJiang)
- [Feature] Provide multi-arch images for apiserver and security proxy (#4131, @seanlaii)
- test: add LastTransition to fix test (#4132, @machichima)
- Add top-level Labels and Resources Structed fields to
HeadGroupSpecandWorkerGroupSpec(#4106, @ryanaoleary) - [rayjob] add grace period after submitter finished (#4091, @machichima)
- [kubectl-plugin] avoid race condition in jobId (#4071, @JosefNagelschmidt)
- [Feature] Support inject specific env vars to all Ray containers in all RayCluster CRs by ConfigMap (#4103, @win5923)
- [Feature] [KubeRay DashBoard] Reimplement and replace the Compute Template section in the New Job (#4119, @CheyuWu)
- [Feature] Support Volcano Network Topology Aware Scheduling for kuberay (#4105, @mtian29)
- [FEAT] show event message when raycluster not found in clusterSelector in rayjob (#4125, @machichima)
- [Helm] Add priorityClassName for kuberay-operator chart (#3703, @win5923)
- [Bug] [KubeRay DashBoard] RayJob table cannot get Ray Dashboard link in Pending status (#4122, @CheyuWu)
- Increase rayJob e2e timeout (#4124, @seanlaii)
- [RayJob] Enhance RayJob DeletionStrategy to Support Multi-Stage Deletion (#4040, @seanlaii)
- RayJob Volcano Integration (#3972, @win5923)
- [Feature] Make Ray and Logs links proxy to their Ray dashboards (#4112, @CheyuWu)
- [Fix] [KubeRay Dashboard] The current filter status is not shown in the status filter (#4108, @CheyuWu)
- Get details of only declarative serve apps (#4084, @jugalshah291)
- [Autoscaler][Sample] Add comment for RAY_LOGGER_LEVEL (#4104, @nadongjun)
- fix: when rayJobInstance.Spec.RayClusterSpec is nil, Kuberay Operator crash (#4102, @KunWuLuan)
- [Feature] [KubeRay Dashboard] hidden the grafana dashboard if link is not provided (#4094, @CheyuWu)
- [Feature][APIServer v2] Support Compute Template in APIServer v2 (#3959, @machichima)
- Improve log message wording when service already exists during reconciliation (#4096, @acrewdson)
- [RayJob] avoid RayCluster resource leak in k8s job mode(#3903) (#4080, @dushulin)
- Fix light weight job submitter e2e flaky test (#4092, @owenowenisme)
- [RayJob] Yunikorn Integration (#3948, @owenowenisme)
- [Doc] Increase head node memory limit for RayService sample to avoid OOM (#4089, @seanlaii)
- [Feature] integrate RayDashboard with apiserver V2 (#4054, @CheyuWu)
- AGC gateway api example (#4076, @snehachhabria)
- [RayCluster] yunikorn batchscheduler respect gang scheduling (#4075, @fscnick)
- [RayJob] ClusterSelector shouldn't support SidecarMode (#4074, @Future-Outlier)
- [RayJob] Directly fail CR if is invalid (#3981, @machichima)
- [Bug]
apiserversdkmay return incomplete response body causingERR_INCOMPLETE_CHUNKED_ENCODING(#4061, @CheyuWu) - [Feature] Add allow CORS in apiserversdk (#4059, @CheyuWu)
- feat: mirror RayService svc object creation for RayJob (#3996, @lowjoel)
- [RayJob] add Light-weight RayJob Submitter (#3943, @owenowenisme)
- [Refactor] RayJob Spec ClusterSelector validation logic (#4032, @Future-Outlier)
- [ApiServer] Fix nil HostPath type in GetVolumeHostPathType and add unit tests. (#3965, @daiping8)
- [refactor][7/N] Make dashboard http client a new package (#4057, @owenowenisme)
- [Feature][APIServer] Support decimal memory values in KubeRay APIServer (#3956, @daiping8)
- Use RayClusterSpec as input for
GetDefaultSubmitterTemplate(#4056, @owenowenisme) - [feat][python-client]: Add suspend, resubmit, & delete logic, and improve status reporting in python client (#4026, @kryanbeane)
- feat[python-client]: add support for loading kubeconfig from within pod (#4004, @kryanbeane)
- [Feature] Include CR UID in kuberay metrics (#4003, @YuxiaoWang-520)
- [Refactor] Eliminate redundant range variable capture with Go 1.22 scoped iteration (#4044, @Tomlord1122)
- [Fix] Fix rayClusterScaleExpectation deletion to use request object when instance is nil (#4039, @phantom5125)
- [refactor][6/N] Move types in ray-operator utils into new package (#3979, @owenowenisme)
- Revert "Bump crd-ref-docs to v0.2.0 for Go 1.24+ compatibility" (#4031, @Future-Outlier)
- Bump crd-ref-docs to v0.2.0 for Go 1.24+ compatibility (#4029, @seanlaii)
- [RayJob] Sidecar Mode (#3971, @Future-Outlier)
- [RayCluster] grant pods and pods/resize patch permissions for IPPR (#3960, @rueian)
- Use ctrl logger in Volcano scheduler to include context (#4023, @win5923)
- [refactor][5/N] Refactor
httpproxy_httpclient.go(#4010, @owenowenisme) - [Feature] Remove checking CRD in Volcano scheduler initialization (#4011, @win5923)
- [Refactor] Refactor testRayJob global variable to avoid test side effects (#4017, @400Ping)
- [Helm] Make Kube Client QPS and Burst configurable for kuberay-operator (#4002, @Future-Outlier)
- [Feature] Add cleanup for terminated RayJob/RayCluster metrics (#3923, @phantom5125)
- [refactor][4/N] Remove ctrl in dashboard http client in
dashboard-httpclient.go(#4009, @owenowenisme) - [Follow Up][Test] Support to set QPS and burst by configuration (#3999, @Future-Outlier)
- [Test] Add ReconcileConcurrency Configuration Test (#4000, @Future-Outlier)
- Follow up 3992: Remove logs and add comments (#4006, @kevin85421)
- [refactor][3/N] Refactor dashbpard httpclient (#3992, @owenowenisme)
- Support to set QPS and burst by configuration. (#3969, @KunWuLuan)
- Support --address flag for kubectl ray job submit (#3922, @JosefNagelschmidt)
- Remove unecessary raycluster log in kai-scheduler logger (#3997, @owenowenisme)
- Use ctrl logger and create logger in function in kai-scheduler (#3995, @owenowenisme)
- [refactor][1/N] Move
FetchHeadServiceURLtoutil.goto reduce imported packages indashboard_httpclient.go(#3983, @kevin85421) - [apiserver]: merge http utils (timeout) of apiserver v1/v2 (#3946, @kenchung285)
- [Feature] Add eslint and Prettier to ray dashboard (#3975, @CheyuWu)
- [Community][2/N] Governance model (#3977, @kevin85421)
- Add validation for RAY_enable_autoscaler_v2 environment variable (#3963, @liugs0213)
- [RayJob] remove redundant RayJob status-transition logs in reconciler (#3976, @Future-Outlier)
- [Experimental] Fix Makefile tool check: replace
-swithtest -s(#3970, @MiniSho) - [Dashboard-client] Add proper error checking in dashboard client (#3953, @owenowenisme)
- [Helm] Make reconcile concurrency configurable for kuberay-operator (#3962, @win5923)
- [Dashboard-client] replace http method from string to constant (#3961, @fscnick)
- [Feature] update yarn version from v1 to latest (#3945, @CheyuWu)
- Update RayCluster
values.yaml(#3950, @SheldonTsen) - [Test] Split E2E nightly operator tests into RayCluster/GCS and RayJob runners (#3932, @Future-Outlier)
- Correct
sumGPUsto include MIGs in count (#3933, @kimminw00) - [CI] Use golang:1.24-bookworm (Debian 12) in CI for Python-3.11 support (#3949, @Future-Outlier)
- Integration: KAI Scheduler (#3886, @EkinKarabulut)
- Move
BatchSchedulerManagerinto reconciler option (#3935, @owenowenisme) - [apiserver]: merge http utils of apiserver v1/v2 (#3924, @kenchung285)
- Add seccompProfile to KubeRay operator deployment for PSS compliance (#3931, @akagami-harsh)
- fix: add missing logging for RayJob HTTP mode status transition (#3936, @Future-Outlier)
- [Test] Fix Apiserver flaky test (#3934, @machichima)
- [feat][python-client]: add rayjob support to kuberay python-client (#3830, @kryanbeane)
- [Helm] Use helm-docs to generate README for chart api-server automatically (#3916, @win5923)
- fix: kustomize download fails for Apple Silicon (arm64) architecture (#3913, @Future-Outlier)
- [apiserver] Add migration doc from v1 to v2 (#3812, @nadongjun)
- [Feature] Add e2e test for setting RayCluster deletion delay in RayService (#3912, @machichima)
- test: add worker pod to upgrade tests for stability (#3891, @pawelpaszki)
- Revert "Feature/cron scheduling rayjob 2426 (#3836)" (#3911, @DW-Han)
- Update RayServices section title in Grafana Dashboard json (#3906, @seanlaii)
- [apiserversdk] add RayService envtests (#3904, @win5923)
- Feature/cron scheduling rayjob 2426 (#3836, @DW-Han)
- dashboard: allow changing the dashboard api url using an env var (#3892, @aurbano)
- [apiserver] Add retry and timeout to apiserver V2 (#3869, @kenchung285)
- [Helm] Add missing environment variables to operator chart (#3867, @win5923)
- Use DeletePodAndWait in e2e test (#3901, @owenowenisme)
- Add a test util function for killing the head Pod and wait (#3890, @owenowenisme)
- [Helm] Use helm-docs to generate README for chart ray-cluster automatically (#3887, @win5923)
- [Test] add e2e tests for autoscaler v1 and v2 with GCS FT (#3888, @rueian)
- [feat][operator] validate Ray resource metadata in webhook (#3831, @davidxia)
- [Feature] [scheduler-plugins] Support second scheduler mode (#3852, @CheyuWu)
- adding local deplyment script using kind (#3863, @DW-Han)
- [kubeclt-plugin] use solid value as default value in get and create (#3815, @fscnick)
- [Helm] Add gcsFaultToleranceOptions in RayCluster chart (#3881, @win5923)
- [Bug] Add default value for entrypoint flags in job_submit.go (#3808, @400Ping)
- [apiserversdk] add RayJob envtests (#3862, @win5923)
- [RayCluster] Toggle usage of deterministic/non-deterministic head pod name with feature flag (#3873, @machichima)
- Refactor Apiserver e2e run in cluster (#3529, @machichima)
- Pin DeepSeek example to stable Ray release (#3885, @eicherseiji)
- [Feature][APIServer] add retry for http client (#3551, @machichima)
- [kubectl-plugin] fix incorrect flag name in help message (#3875, @davidxia)
- [Feature] Support configurable RayCluster deletion delay in RayService (#3864, @machichima)
- Chore: fix indentation issues in RayJob sample YAML (#3874, @win5923)
- [RayCluster] Make headpod name back to non-deterministic (#3872, @machichima)
- [RayService][Test] create curl pod waiting until running (#3740, @fscnick)
- test: enable upgrade to image built from source (#3736, @pawelpaszki)
- [refactor] Move inconsistency check functions to a new util function file (#3866, @kevin85421)
- chore: update obsolete reconcileServe description (#3865, @fscnick)
- [kubectl-plugin] Validate empty resouce quantity strings (#3821, @win5923)
- [Community] Add KubeRay community guide (#3859, @kevin85421)
- Add DeepSeek example RayService (#3838, @eicherseiji)
- pass client when call batchscheduler.New() (#3785, @KunWuLuan)
- chore: reduce memory allocation on handling http response (#3800, @fscnick)
- [Scheduler-plugin] Handle case when numOfHosts > 1 (#3844, @troychiu)
- Use Go 1.24.0 in go module (#3835, @tenzen-y)
- [docs]: add badge release (#3842, @Olexandr88)
- Add RayCluster YAML for verl example (#3833, @kevin85421)
- Fix ray nightly image env var setup (#3826, @dayshah)
- chore: remove unnecessary empty
rayStartParams(#3586, @davidxia) - [Test][Release] Change upgrade test version to test upgrade from 1.3.2 to 1.4.0 (#3825, @MortalHappiness)
- [Fix] changelog-generator.py failed to parse some commit messages (#3818, @MortalHappiness)
- [Fix][Release] Fix Krew release indenetation error (#3823, @MortalHappiness)
- [Chore] Remove CHANGELOG.md (#3819, @MortalHappiness)
- [kubeclt-plugin] fix get cluster all namespace (#3809, @fscnick)
- [Docs] Add kubectl plugin create cluster sample yaml config files (#3804, @MortalHappiness)
- [Helm Chart] Set honorLabel of serviceMonitor to
true(#3805, @owenowenisme) - [Metrics] Remove serviceMonitor.yaml (#3795, @owenowenisme)
- [Chore][Sample-yaml] Upgrade pytorch-lightning to 1.8.5 for
ray-job.pytorch-distributed-training.yaml(#3796, @MortalHappiness) - [RayJob] Support deletion policies based on job status (#3731, @weizhaowz)
- Use ImplementationSpecific in ray-cluster.separate-ingress.yaml (#3781, @troychiu)
- Remove vLLM examples in favor of Ray Serve LLM (#3786, @kevin85421)
- Update update-ray-job.kueue-toy-sample.yaml (#3782, @troychiu)
- [Feat] Add e2e test for applying
ray-job.interactive-mode.yaml(#3779, @CheyuWu) - [Doc][Fix] correct the indention of storageClass in ray-cluster.persistent-redis.yaml (#3780, @rueian)
- [doc] Improve APIServer v2 doc (#3773, @kevin85421)
- [Doc] Reference helm chart version in
helm-chart/kuberay-operator/README.md.gotmplwith go template (#3763, @MortalHappiness) - Revert "Fix issue where unescaped semicolons caused task execution failures. (#3691)" (#3771, @MortalHappiness)
- [DOCS] KubeRay APIServer V2 document (#3594, @machichima)
- support scheduler plugins (#3612, @KunWuLuan)
- fix ray-service.different-port.yaml (#3721, @zjx20)
- Remove
ray-pod.tls.yaml(#3762, @kevin85421) - [doc] Update GitHub pages's home page (#3761, @kevin85421)
- [doc] Remove the HA document in favor of the Ray doc (#3760, @kevin85421)
- Added Ray-Serve Config For LLMs (#3517, @Blaze-DSP)