Highlights
Ray History Server (alpha)
KubeRay v1.6 introduces alpha support for the Ray History Server. This project enables users to collect and aggregate events from a Ray cluster, replaying them to restore historical snapshots of the cluster's state. By providing an alternative backend for the Ray Dashboard, the History Server allows users to view the Ray dashboard and debug ephemeral clusters (such as those managed via RayJob) even after they have been terminated.
Try the history server here: History Server Quick Start Guide.
⚠️ Warning: This feature is in alpha status, meaning future KubeRay releases may include breaking updates. We’d love to hear your experience with it! Please drop your feedback in this tracking issue to help us shape its development.
Ray Token Authentication using Kubernetes RBAC
Starting in KubeRay v1.6 and Ray v2.55, you can use Kubernetes RBAC to manage user access control to Ray clusters that have token authentication enabled. With this feature enabled, Ray will be configured to delegate token authentication to Kubernetes. This means you can use the same credentials used with Kubernetes to access Ray clusters and platform operators can use standard Kubernetes RBAC to control access to Ray clusters. See Configure Ray clusters to use Kubernetes RBAC authentication for a step-by-step guide.
You can now also reference Secrets containing static auth tokens for Ray cluster token authentication.
apiVersion: v1
kind: Secret
metadata:
name: ray-cluster-token
type: Opaque
stringData:
auth_token: "super-secret-example-token"
---
apiVersion: ray.io/v1
kind: RayCluster
metadata:
name: ray-cluster-with-auth
spec:
authOptions:
mode: token
secretName: ray-cluster-token
rayVersion: '2.53.0'
headGroupSpec:
rayStartParams: {}RayCronJob
KubeRay v1.6 introduces the RayCronJob Custom Resource Definition (CRD), enabling users to schedule RayJobs on a recurring schedule using standard cron expressions. This is useful for periodic batch processing, scheduled training runs, or recurring data pipelines.
⚠️ Warning: RayCronJob is an Alpha feature and is disabled by default. To enable it, set the feature gate on the kuberay-operator:
--feature-gates=RayCronJob=true
Below is an example of the new custom resource:
apiVersion: ray.io/v1
kind: RayCronJob
metadata:
name: raycronjob-sample
spec:
schedule: "* * * * *"
jobTemplate:
entrypoint: python /home/ray/samples/sample_code.py
shutdownAfterJobFinishes: true
ttlSecondsAfterFinished: 600
runtimeEnvYAML: |
pip:
- requests==2.26.0
- pendulum==2.1.2
env_vars:
counter_name: "test_counter"
rayClusterSpec:
rayVersion: '2.52.0'
headGroupSpec:
...
...See ray-cronjob.sample.yaml for a full example.
RayJob Deletion Policy API
The RayJobDeletionPolicy feature gate is graduating to Beta and enabled by default. This feature enables a more advanced and flexible API for expressing deletion policies within the RayJob specification. This new design moves beyond the singular boolean field, spec.shutdownAfterJobFinishes, and allows users to define different cleanup strategies using configurable TTL values based on the Ray job's status.
Below is an example of how to use this new, flexible API structure:
apiVersion: ray.io/v1
kind: RayJob
metadata:
name: rayjob-deletion-rules
spec:
deletionStrategy:
deletionRules:
- policy: DeleteWorkers
condition:
jobStatus: FAILED
ttlSeconds: 100
- policy: DeleteCluster
condition:
jobStatus: FAILED
ttlSeconds: 600See ray-job.deletion-rules.yaml for a comprehensive example.
Other Notable Features
- RayJob now supports spec.preRunningDeadlineSeconds to automatically mark failed jobs if they do not reach Running state within a specified timeout.
- RayService now supports spec.managedBy for improved support with Multi-Kueue
- The
RayMultihostIndexingfeature gate is graduating to Beta and enabled by default. This feature provides ordered replica and host index labels that are useful for managing Ray clusters for multi-host TPU/GPU workloads that require atomic scheduling and scaling. These labels are only applied if numOfHosts > 1 in worker group configuration. - KubeRay v1.6 adds a new
spec.upgradeStrategyfield to RayCluster. Supported values areRecreateandNone. TheRecreatestrategy is useful for automatically recreating all Ray cluster Pods when the Ray cluster spec changes. This upgrade strategy is not recommended if Ray cluster state needs to be persisted. - RayService incremental upgrade feature (alpha status) now supports rollback.
Breaking Changes
When using Sidecar submission mode with RayJob, the Head Pod will no longer be automatically recreated after initial provisioning. This is because the submission container runs along the Head container and recreating the Pod would result in restarting the job entirely. See #4141 for more details.
CHANGELOG
- Add support for Ray token auth (#4179, @andrewsykim)
- feat: upgrade to Ray 2.52.0 to support token auth mode (#4152, @Future-Outlier)
- update minimum Ray version required for token authentication to 2.52.0 (#4201, @andrewsykim)
- add samples for RayCluster token auth (#4200, @andrewsykim)
- [RayCluster] Enable Secret informer watch/list and remove unused RBAC verbs (#4202, @Future-Outlier)
- [RayJob] Add token authentication support for All mode (#4210, @Future-Outlier)
- Support X-Ray-Authorization fallback header for accepting auth token via proxy (#4213, @Future-Outlier)
- [RayJob] Add token authentication support for light weight job submitter (#4215, @Future-Outlier)
- [RayCluster] make auth token secret name consistency (#4216, @fscnick)
- feat: kubectl ray get token command (#4218, @rueian)
- [RayService] auth token mode e2e test (#4225, @ryankert01)
- [e2e] RayJob Auth Mode E2E (#4229, @seanlaii)
- [e2e] Enhance RayCluster Auth E2E (#4231, @seanlaii)
- [RayJob] light weight job submitter upgrade to 1.5.1 to support auth token mode (#4235, @Future-Outlier)
- Support enabling RAY_ENABLE_K8S_TOKEN_AUTH (#4509, @andrewsykim)
- [Enhance] Refactor IsK8sAuthEnabled to accept AuthOptions and add token audience TODO (#4543, @Future-Outlier)
- [Enhancement] Add sample YAML for K8s token auth RayCluster (#4544, @Future-Outlier)
- improve API docs for EnableK8sTokenAuth (#4553, @andrewsykim)
- [ray-operator] add support for referencing Secret names for auth tokens (#4554, @andrewsykim)
- [Fix] Skip auth Secret reconciliation when K8s token auth is enabled (#4556, @Future-Outlier)
- [ray-operator][validation] K8s token auth mode does not support RayJob and RayService (#4562, @Future-Outlier)
- fix: upgrade Ray image in ray-cluster.auth.yaml to 2.53.0 to resolve dashboard 'Failed to load' error (#4310, @win5923)
- introduce historyserver directory and project structure (#4232, @andrewsykim)
- add the implementation of historyserver collector (#4241, @KunWuLuan)
- [history server] Remove go.work and go.work.sum to follow Go's best practices (#4301, @Future-Outlier)
- [history server] move storage interface (#4302, @my-vegetable-has-exploded)
- [historyserver][collector] Remove unused function processAllLogs (#4316, @ryankert01)
- [historyserver][collector] Add file-level idempotency check for prev-logs processing on container restart (#4321, @my-vegetable-has-exploded)
- [history server] Web Server + Event Processor (#4329, @Future-Outlier)
- [historyserver] Ensure at least one worker in sample RayCluster (#4330, @Future-Outlier)
- historyserver: remove unused function in RayLogHandler (#4336, @AndySung320)
- [history server][collector] Fix getJobID for job event collection (#4342, @Future-Outlier)
- [KubeRay Dashboard][Feature] Integrate History Server into KubeRay Dashboard (#4395, @CheyuWu)
- [Feat] [history server] Enable running the history server outside the K8s cluster (#4404, @JiangJiaWei1103)
- [history server][collector] Add WaitGroup for graceful shutdown (#4409, @fweilun)
- [1/N][Feature][history server] support endpoint /api/v0/logs/file (#4411, @machichima)
- [history server][storage] Add Azure Blob Storage support (#4413, @ikchifo)
- [historyserver][collector] use filepath functions to handle file path (#4417, @400Ping)
- [Feature][history server] support endpoint /api/cluster_status (#4421, @justinyeh1995)
- [1/N] [history server] Support job event processing and endpoint (#4422, @chiayi)
- [historyserver] implement grafana health for live session (#4425, @fscnick)
- [history server][collector] Enable run with race condition checker (#4430, @machichima)
- [Feat] [history server] Add node endpoint (#4436, @JiangJiaWei1103)
- [Feature][history server] support endpoint /api/v0/tasks/timeline (#4437, @AndySung320)
- [history server] Add logs/file resolution and logs/stream endpoint (#4456, @machichima)
- [Feat] [history server] Add actor task endpoint (#4463, @JiangJiaWei1103)
- [history server] use sessionName in task, actor, and job endpoints (#4464, @Future-Outlier)
- [Feat] [history server] Support endpoint /v0/tasks/summarize (#4469, @win5923)
- [Feature][history server] Add Google Cloud Storage (GCS) support to history server (#4478, @chiayi)
- [Feature][history server] support endpoint /events (#4479, @seanlaii)
- [Feature][history server] support endpoint /timezone (#4510, @CheyuWu)
- [Feature][history server] support endpoint /api/v0/cluster_metadata (#4519, @alimaazamat)
- [Feat] [history server] Correctly use glob in logs/ endpoint (#4526, @machichima)
- [history server] Remove unused filter logic from /logical/actors and clarify submission_id on /api/v0/logs/file (#4528, @Future-Outlier)
- [Feature][history server] Support arbitrary Ray Dashboard endpoint collection (#4529, @Future-Outlier)
- [Refactor][History Server] merge logs/file and stream together with media_type path parameter (#4552, @machichima)
- [Chore] [history server] Add GOPROXY support for history server builds (#4410, @fangyinc)
- chore: Add historyserver option to bug report dropdown (#4441, @JiangJiaWei1103)
- Fix history server config path in set_up_historyserver.md (#4492, @win5923)
- [history server][e2e] Fix missing manifestPath argument in ApplyHistoryServer call (#4495, @win5923)
- [Test] [history server] [collector] Add collector e2e tests (#4308, @JiangJiaWei1103)
- [Test] [history server] [collector] Ensure event type coverage (#4343, @JiangJiaWei1103)
- [Test] [historyserver] [collector] Enhance log file coverage test for both head and worker pods (#4351, @JiangJiaWei1103)
- [Test][HistoryServer] E2E test for live clusters (#4406, @win5923)
- [historyserver] add prometheus health e2e test for live session (#4449, @my-vegetable-has-exploded)
- [history server][e2e] Add Azure Blob Storage E2E test for History Server (#4460, @ikchifo)
- [Test][HistoryServer] E2E test for dead cluster actor endpoint (#4461, @fangyinc)
- [Docs] Add history server collector setup doc (#4303, @JiangJiaWei1103)
- [Docs] [history server] Create service account for history server deployment (#4396, @JiangJiaWei1103)
- [Docs][History Server] update instructions for live cluster section (#4408, @machichima)
- [History Server] Fix API response format to match Ray Dashboard frontend schema (#4615, @Future-Outlier)
- [Feat] Add Ray Cron Job (#4159, @machichima)
- [chore] fix cronjob crd inconsistent (#4292, @rueian)
- [Feat] Cron job add suspend (#4313, @machichima)
- [E2E] [RayCronJob] add e2e test for suspend behavior (#4349, @AndySung320)
- [CronJob] Add RayCronJob related YAML (#4577, @machichima)
- [Feature] Support recreate pods for RayCluster using RayClusterSpec.upgradeStrategy (#4185, @win5923)
- [RayCluster] Status includes head container status message (#4196, @spencer-p)
- [RayCluster] Improved the efficiency when checking rayclusters' expectations (#4209, @harryge00)
- [Feature Enhancement] Set ordered replica index label to support multi-slice (#4163, @ryanaoleary)
- ray-operator: always set RAY_CLUSTER_NAMESPACE using downward API (#4467, @andrewsykim)
- respect explicit plasma-directory for shared memory injection (#4524, @lorriexingfang)
- Exclude tolerations and scheduling gates from RayCluster spec hash (#4569, @lorriexingfang)
- Promote RayMultiHostIndexing feature to Beta (#4572, @ryanaoleary)
- Disable RayMultiHostIndexing feature for TestReconcile_Multihost_Replicas (#4585, @Future-Outlier)
- [RayCluster] Add more context why we don't recreate head Pod for RayJob (#4175, @kevin85421)
- feat: Add runtimeClassName support for head and worker Pods (#4184, @Narwhal-fish)
- Use HTTP health check when possible (#4448, @spencer-p)
- [Bug] Sidecar mode shouldn't restart head pod when head pod is deleted (#4141, @400Ping)
- [Bug][RayJob] Sidecar mode shouldn't restart head pod when head pod is deleted (#4234, @400Ping)
- [RayJob] Lift cluster status while initializing (#4191, @spencer-p)
- [Fix] rayjob update raycluster status (#4192, @machichima)
- [RayJob] Remove updateJobStatus call (#4198, @spencer-p)
- [Feature] Support JobDeploymentStatus as the deletion condition (#4262, @JiangJiaWei1103)
- [Fix][RayJob] Do not requeue RayJobs when suspended (#4443, @EthanGuoliang)
- [Feature][RayJob] Remove wget dependency in Sidecar GCS wait (#4468, @400Ping)
- [Feat] Rayjob add preRunningDeadlineSeconds (#4525, @machichima)
- add validation and delete check with test (#4527, @hango880623)
- [RayJob] Promote RayJobDeletionPolicy feature gate to Beta (#4576, @Future-Outlier)
- background goroutine get job info (#4160, @fscnick)
- [RayJob] update light weight submitter image from quay.io (#4181, @Future-Outlier)
- [flaky] RayJob fails when head Pod is deleted when job is running (#4182, @Future-Outlier)
- [RayJob] Fix sidecar mode flaky test (#4208, @Future-Outlier)
- Feat/background goroutine get job info test (#4368, @fscnick)
- Revert "Feat/background goroutine get job info test" (#4433, @rueian)
- [Refactor] [Test] Add helpers and use auto cleanup for testing the RayJob deletion strategy (#4363, @JiangJiaWei1103)
- [RayService] Directly fail CR if is invalid (#4228, @win5923)
- [RayService] Migrate from Endpoints API to EndpointSlice API for RayService (#4245, @seanlaii)
- Remove erroneous call in applyServeTargetCapacity (#4212, @ryanaoleary)
- Add managedBy field to RayService (#4491, @lorriexingfang)
- Add RayService IncrementalUpgrade E2E tests to Buildkite (#4497, @ryanaoleary)
- Include serve-x prefixed ports in ray serve service for grpc support (#4558, @lorriexingfang)
- [Autoscaler] Add validation to require RayCluster v2 when using idleTimeoutSeconds (#4162, @alimaazamat)
- [Autoscaler] validate idleTimeoutSeconds for AutoscalerOptions (#4267, @alimaazamat)
- Revert "[Test][Autoscaler] deflaky unexpected dead actors in tests by setting max_restarts=-1 (#3700)" (#4271, @win5923)
- [scheduler] Update kai_scheduler.go to support RayJob and RayService (#4418, @enoodle)
- [scheduler] Make KAI-Scheduler support RayJob InteractiveMode (#4508, @Future-Outlier)
- [Feat] Sync RayJob or RayCluster annotations to Volcano PodGroup (#4340, @dushulin)
- [Fix][operator] Fix volcano podgroup stuck in inqueue state after rayjob completes (#4476, @fangyinc)
- [Chore] Remove unused variable in volcano scheduler (#4223, @seanlaii)
- Clean up unused label for volcano scheduler (#4305, @seanlaii)
- [Fix] Remove quotes from numeric fields in KAI Scheduler sample YAMLs (#4533, @AndySung320)
- [Helm] Fix: inject flag-based env into ConfigMap when configuration.enabled=true (#4270, @win5923)
- Add Helm values for ResourceClaims to RayCluster (#4290, @nojnhuh)
- [helm chart] add default resources for additionalWorkerGroups (#4511, @yuhuan130)
- Fix issue in helm charts where ingress backend name might not match route/service name (#4532, @marosset)
- Make replicas configurable for kuberay-operator (#4195, @divyamraj18)
- Feature/kubectl plugin/improve support for autoscaling clusters (#4146, @AndySung320)
- [Feat][kubectl-plugin] Add shell completion for kubectl ray get [workergroups|nodes] (#4291, @justinyeh1995)
- [kubectl-plugin][Test] Use client-go reactors for FieldSelector filtering in fake client tests (#4361, @ikchifo)
- [Dockerfile] [KubeRay Dashboard]: Fix Dockerfile warnings (ENV format, CMD JSON args) (#4167, @cchung100m)
- [Dockerfile] [KubeRay Dashboard] Update docker base image (#4193, @kash2104)
- [KubeRay Dashboard] Remove exampleJobs.ts which is no longer used (#4247, @CheyuWu)
- feat: add allow method to api server when allow cors (#4259, @CheyuWu)
- [APIServer][Docs] Add user guide for retry behavior & configuration (#4144, @justinyeh1995)
- [PodPool-VK] add podpool vk README (#4251, @lw309637554)
- [PodPool-VK] add skeleton code for cache pod manager (#4475, @lw309637554)
- [Bug] Fix health probes to use custom ports from rayStartParams (#4041, @MiniSho)
- [Fix] Resolve int32 overflow by having the calculation in int64 (#4158, @justinyeh1995)
- fix: Return upon update error for active and pending clusters (#4273, @JiangJiaWei1103)
- Fix: Move replica validation logic to right place (#4307, @kash2104)
- fix: dashboard http client tests discovered and passing (#4173, @alimaazamat)
- fix: hardening kuberay operator security context (#4243, @LilyLinh)
- [CI] Pin Docker api version to avoid API version mismatch (#4188, @win5923)
- [CI] Increase RayJob E2E test timeout from 40m to 60m (#4432, @Future-Outlier)
- [CI] Fix apiserversdk test failures by updating setup-envtest (#4499, @AndySung320)
- fixing covdata errors when running go tests via make targets (#4535, @marosset)
- [master] Fix Ray CI integration for release automation (#4370, @Future-Outlier)
- generate clientset with 1.35 code-generator (#4347, @KunWuLuan)
- Support Multi-Arch Image in CI (#4348, @KunWuLuan)
- Updating k8s.io/api/admission/v1beta1 usage to k8s.io/api/admission/v1 (#4571, @marosset)
- Update setup-envtest and use K8s v1.34 by default for ray-controller tests (#4434, @marosset)
- [Test] Fixing SubmittedFinishedTimeout e2e test for K8s v1.33+ clusters (#4428, @marosset)
- [Chore] Upgrade golangci-lint to v2.7.2 and adjust linting configurations (#4007, @seanlaii)
- [Chore] Upgrade Golang version to v1.25 (#4269, @JiangJiaWei1103)
- [Chore] Enable modernize linter (#4317, @seanlaii)
- [Chore] Fix testifylint and gci lint issues (#4293, @seanlaii)
- [Chore] Fix errorlint lint issues (#4306, @justinyeh1995)
- [Chore] Fix gosec, govet and errcheck lint issues (#4309, @win5923)
- [Chore] Fix staticcheck lint errors (#4326, @seanlaii)
- [Chore] Fix noctx, revive lint issues (#4333, @justinyeh1995)
- [Chore] Upgrade operator version in test-sample-yamls (#4248, @seanlaii)
- chore: Bump up KuberayUpgradeVersion default version for e2e test (#4331, @JiangJiaWei1103)
- chore: update Dockerfile deps (#4435, @alimaazamat)
- feature: Remove empty resource list initialization (#4168, @kash2104)
- [Refactor] Consolidate duplicate test utilities for maintainability (#4038, @Tomlord1122)
- update stale feature gate comments (#4174, @andrewsykim)
- Edit RayCluster example config for label selectors (#4151, @ryanaoleary)
- Add RayService incremental upgrade sample for guide (#4164, @ryanaoleary)
- Update README with additional resource links (#4230, @win5923)
- Add example in GKE to enable Ray resource isolation using cgroupsv2 and writable cgroup containers (#4236, @andrewsykim)
- add sample that uses --system-reserved-cpu and --system-reserved-memory (#4237, @andrewsykim)
- [Docs] Upgrade kind base image to v1.26.0 (#4252, @JiangJiaWei1103)
- docs: Show missing phony targets and align styles (#4295, @JiangJiaWei1103)
- docs: Clarify multi-arch phony comments (#4311, @JiangJiaWei1103)
- Update head and worker pod resources in sample manifests (#4288, @ChenYi015)
- [Config] Change all RayCluster headGroupSpec limit memory to 5Gi (#4328, @CheyuWu)
- chore: Use double quoted resource values in sample manifest files (#4339, @kash2104)
- chore: Use RayCluster name as SA name for RBAC auth (#4611, @Future-Outlier)
- [Helm] Update ray-cluster default resource values (#4588, @Future-Outlier)
- Add Google Artifact Registry image build/push guide (#4618, @Future-Outlier)
- Bump glob from 10.4.5 to 10.5.0 in /dashboard (#4207, @dependabot)
- Bump next from 15.4.9 to 15.4.10 in /dashboard (#4266, @dependabot)
- Bump next from 15.4.8 to 15.4.9 in /dashboard (#4264, @dependabot)
- Bump next from 15.2.4 to 15.4.8 in /dashboard (#4254, @dependabot)
- Bump urllib3 from 2.5.0 to 2.6.0 in /clients/python-client (#4260, @dependabot)
- Bump js-yaml from 4.1.0 to 4.1.1 in /dashboard (#4194, @dependabot)
- [Release][Helm] update KubeRay version to v1.6.0 (#4622, @Future-Outlier)
- [Release] Sync master changes to release-1.6 (#4600, @rueian)
- [Release] Update KubeRay version references for 1.6.0 (#4586, @Future-Outlier)
- Change Ray/Kuberay Google Calendar and Kuberay Sync link (#4401, @Future-Outlier)
- Revert "Change Ray/Kuberay Google Calendar link (#4401)" (#4536, @Future-Outlier)
- [Test][Release] Change upgrade test version to test upgrade from 1.5.1 to 1.6.0 (#4623, @Future-Outlier)
Contributors
Thanks to all the contributors who made this release possible!
@400Ping, @AndySung320, @ChenYi015, @CheyuWu, @EthanGuoliang, @Future-Outlier, @JiangJiaWei1103, @KunWuLuan, @LilyLinh, @MiniSho, @Narwhal-fish, @Tomlord1122, @alimaazamat, @andrewsykim, @cchung100m, @chiayi, @divyamraj18, @dushulin, @enoodle, @fangyinc, @fscnick, @fweilun, @hango880623, @harryge00, @ikchifo, @justinyeh1995, @kash2104, @kevin85421, @lorriexingfang, @lw309637554, @machichima, @marosset, @my-vegetable-has-exploded, @nojnhuh, @rueian, @ryanaoleary, @ryankert01, @seanlaii, @spencer-p, @win5923, @yuhuan130