github ray-project/kuberay v1.6.0

one day ago

Highlights

Ray History Server (alpha)

KubeRay v1.6 introduces alpha support for the Ray History Server. This project enables users to collect and aggregate events from a Ray cluster, replaying them to restore historical snapshots of the cluster's state. By providing an alternative backend for the Ray Dashboard, the History Server allows users to view the Ray dashboard and debug ephemeral clusters (such as those managed via RayJob) even after they have been terminated.

Try the history server here: History Server Quick Start Guide.

⚠️ Warning: This feature is in alpha status, meaning future KubeRay releases may include breaking updates. We’d love to hear your experience with it! Please drop your feedback in this tracking issue to help us shape its development.

Ray Token Authentication using Kubernetes RBAC

Starting in KubeRay v1.6 and Ray v2.55, you can use Kubernetes RBAC to manage user access control to Ray clusters that have token authentication enabled. With this feature enabled, Ray will be configured to delegate token authentication to Kubernetes. This means you can use the same credentials used with Kubernetes to access Ray clusters and platform operators can use standard Kubernetes RBAC to control access to Ray clusters. See Configure Ray clusters to use Kubernetes RBAC authentication for a step-by-step guide.

You can now also reference Secrets containing static auth tokens for Ray cluster token authentication.

apiVersion: v1
kind: Secret
metadata:
  name: ray-cluster-token
type: Opaque
stringData:
  auth_token: "super-secret-example-token"
---
apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: ray-cluster-with-auth
spec:
  authOptions:
    mode: token
    secretName: ray-cluster-token
  rayVersion: '2.53.0'
  headGroupSpec:
    rayStartParams: {}

RayCronJob

KubeRay v1.6 introduces the RayCronJob Custom Resource Definition (CRD), enabling users to schedule RayJobs on a recurring schedule using standard cron expressions. This is useful for periodic batch processing, scheduled training runs, or recurring data pipelines.

⚠️ Warning: RayCronJob is an Alpha feature and is disabled by default. To enable it, set the feature gate on the kuberay-operator:

--feature-gates=RayCronJob=true

Below is an example of the new custom resource:

apiVersion: ray.io/v1
kind: RayCronJob
metadata:
  name: raycronjob-sample
spec:
  schedule: "* * * * *"
  jobTemplate:
    entrypoint: python /home/ray/samples/sample_code.py
    shutdownAfterJobFinishes: true
    ttlSecondsAfterFinished: 600
    runtimeEnvYAML: |
      pip:
        - requests==2.26.0
        - pendulum==2.1.2
      env_vars:
        counter_name: "test_counter"
    rayClusterSpec:
      rayVersion: '2.52.0'
      headGroupSpec:
        ...
        ...

See ray-cronjob.sample.yaml for a full example.

RayJob Deletion Policy API

The RayJobDeletionPolicy feature gate is graduating to Beta and enabled by default. This feature enables a more advanced and flexible API for expressing deletion policies within the RayJob specification. This new design moves beyond the singular boolean field, spec.shutdownAfterJobFinishes, and allows users to define different cleanup strategies using configurable TTL values based on the Ray job's status.

Below is an example of how to use this new, flexible API structure:

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: rayjob-deletion-rules
spec:
  deletionStrategy:
    deletionRules:
    - policy: DeleteWorkers
      condition:
        jobStatus: FAILED
        ttlSeconds: 100
    - policy: DeleteCluster
      condition:
        jobStatus: FAILED
        ttlSeconds: 600

See ray-job.deletion-rules.yaml for a comprehensive example.

Other Notable Features

  • RayJob now supports spec.preRunningDeadlineSeconds to automatically mark failed jobs if they do not reach Running state within a specified timeout.
  • RayService now supports spec.managedBy for improved support with Multi-Kueue
  • The RayMultihostIndexing feature gate is graduating to Beta and enabled by default. This feature provides ordered replica and host index labels that are useful for managing Ray clusters for multi-host TPU/GPU workloads that require atomic scheduling and scaling. These labels are only applied if numOfHosts > 1 in worker group configuration.
  • KubeRay v1.6 adds a new spec.upgradeStrategy field to RayCluster. Supported values are Recreate and None. The Recreate strategy is useful for automatically recreating all Ray cluster Pods when the Ray cluster spec changes. This upgrade strategy is not recommended if Ray cluster state needs to be persisted.
  • RayService incremental upgrade feature (alpha status) now supports rollback.

Breaking Changes

When using Sidecar submission mode with RayJob, the Head Pod will no longer be automatically recreated after initial provisioning. This is because the submission container runs along the Head container and recreating the Pod would result in restarting the job entirely. See #4141 for more details.

CHANGELOG

Contributors

Thanks to all the contributors who made this release possible!

@400Ping, @AndySung320, @ChenYi015, @CheyuWu, @EthanGuoliang, @Future-Outlier, @JiangJiaWei1103, @KunWuLuan, @LilyLinh, @MiniSho, @Narwhal-fish, @Tomlord1122, @alimaazamat, @andrewsykim, @cchung100m, @chiayi, @divyamraj18, @dushulin, @enoodle, @fangyinc, @fscnick, @fweilun, @hango880623, @harryge00, @ikchifo, @justinyeh1995, @kash2104, @kevin85421, @lorriexingfang, @lw309637554, @machichima, @marosset, @my-vegetable-has-exploded, @nojnhuh, @rueian, @ryanaoleary, @ryankert01, @seanlaii, @spencer-p, @win5923, @yuhuan130

Don't miss a new kuberay release

NewReleases is sending notifications on new releases.