github ray-project/kuberay v1.3.0

4 days ago

Highlights

RayCluster Conditions API

The RayCluster conditions API is graduating to Beta status in v1.3. The new API provides more details about the RayCluster’s observable state that were not possible to express in the old API. The following conditions are supported for v1.3: AllPodRunningAndReadyFirstTime, RayClusterPodsProvisioning, HeadPodNotFound and HeadPodRunningAndReady. We will be adding more conditions in future releases.

Ray Kubectl Plugin

The Ray Kubectl Plugin is graduating to Beta status. The following commands are supported with KubeRay v1.3:

  • kubectl ray logs <cluster-name>: download Ray logs to a local directory
  • kubectl ray session <cluster-name>: initiate port-forwarding session to the Ray head
  • kubectl ray create <cluster>: create a Ray cluster
  • kubectl ray job submit: create a RayJob and submit a job using a local working directory

See the Ray Kubectl Plugin docs for more details.

RayJob Stability Improvements

Several improvements have been made to enhance the stability of long-running RayJobs. In particular, when using submissionMode=K8sJobMode, job submissions will no longer fail due to the submission of duplicate IDs. Now, if a submission ID already exists, the logs of the existing job will be retrieved instead.

RayService API Improvements

RayService strives to deliver zero-downtime serving. When changes in the RayService spec cannot be applied in place, it attempts to migrate traffic to a new RayCluster in the background. However, users might not always have sufficient resources for a new RayCluster. Beginning with KubeRay 1.3, users can customize this behavior using the new UpgradeStrategy option within the RayServiceSpec.

Previously, the serviceStatus field in RayService was inconsistent and did not accurately represent the actual state. Starting with KubeRay v1.3.0, we have introduced two conditions, Ready and UpgradeInProgress, to RayService. Following the approach taken with RayCluster, we have decided to deprecate serviceStatus. In the future, serviceStatus will be removed, and conditions will serve as the definitive source of truth. For now, serviceStatus remains available but is limited to two possible values: "Running" or an empty string.

GCS Fault Tolerance API Improvements

The new GcsFaultToleranceOptions field in the RayCluster now provides a streamlined way for users to enable GCS Fault Tolerance on a RayCluster. This eliminates the previous need to distribute related settings across Pod annotations, container environment variables, and the RayStartParams. Furthermore, users can now specify their Redis username in the newly introduced field (requires Ray 2.4.1 or later). To see the impact of this change on a YAML configuration, please refer to the example manifest.

Breaking Changes

RayService API

Starting from KubeRay v1.3.0, we have removed all possible values of RayService.Status.ServiceStatus except Running, so the only valid values for ServiceStatus are Running and empty. If ServiceStatus is Running, it means that RayService is ready to serve requests. In other words, ServiceStatus is equivalent to the Ready condition. It is strongly recommended to use the Ready condition instead of ServiceStatus going forward.

Features

  • RayCluster Conditions API is graduating to Beta status. The feature gate RayClusterStatusConditions is now enabled by default.
  • New events were added for RayCluster, RayJob and RayService for improved observability
  • Various improvements to Ray autoscaler v2
  • Introduce a new API in RayService spec.upgradeStrategy. The upgrade strategy type can be set to NewCluster or None to modify the behavior of zero-downtime upgrades for RayService.
  • Add RayCluster controller expecatations to mitigate stale informer caches
  • RayJob now supports submission mode InteractiveMode. Use this submission mode when you want to submit jobs from a local working directory on your laptop.
  • RayJob now supports spec.deletionPolicy API, this feature requires the RayJobDeletionPolicy feature gate to be enabled. Initial deltion policies are DeleteCluster, DeleteWorkers, DeleteSelf and DeleteNone.
  • KubeRay now detects TPUs and Neuron Core resources and specifies them as custom resources to ray start parameters
  • Introduce RayClusterSuspending and RayClusterSuspended conditions
  • Container CPU requests are now used in Ray –num-cpus if CPU limits is not specified
  • Various example manifests for using TPU v6 with KubeRay
  • Add ManagedBy field in RayJob and RayCluster. This is required for Multi-Kueue support.
  • Add support for kubectl ray create cluster command
  • Add support for kubectl ray create workergroup command

Guides & Tutorials

Changelog

  • [Fix][RayCluster] fix missing pod name in CreatedWorkerPod and Failed… (#3057, @rueian)
  • [Refactor] Use constants for image tag, image repo, and versions in golang to avoid hard-coded strings (#2978, @400Ping)
  • Update TPU Ray CR manifests to use Ray 2.41.0 (#2965, @ryanaoleary)
  • Update samples to use Ray 2.41.0 images (#2964, @andrewsykim)
  • [Test] Use GcsFaultToleranceOptions in test and backward compatibility (#2972, @fscnick)
  • [chore][docs] enable Markdownlint rule MD004 (#2973, @davidxia)
  • [release] Update Volcano YAML files to Ray 2.41 (#2976, @win5923)
  • [release] Update Yunikorn YAML file to Ray 2.41 (#2969, @kenchung285)
  • [CI] Change Pre-commit-shellcheck-to-shellcheck-py (#2974, @owenowenisme)
  • [chore][docs] enable Markdownlint rule MD010 (#2975, @davidxia)
  • [Release] Upgrade ray-job.batch-inference.yaml image to 2.41 (#2971, @MortalHappiness)
  • [RayService] adapter vllm 0.6.1.post2 (#2823, @pxp531)
  • [release][9/N] Update text summarizer RayService to Ray 2.41 (#2961, @kevin85421)
  • [RayService] Deflaky RayService envtest (#2962, @kevin85421)
  • [RayJob] Deflaky RayJob e2e tests (#2963, @kevin85421)
  • [fix][kubectl-plugin] set worker group CPU limit (#2958, @davidxia)
  • [docs][kubectl-plugin] fix incorrect example commands (#2951, @davidxia)
  • [release][8/N] Upgrade Stable Diffusion RayService to Ray 2.41 (#2960, @kevin85421)
  • [kubectl-plugin] Fix panic when GPU resource is not set (#2954, @win5923)
  • [docs][kubectl-plugin] improve help messages (#2952, @davidxia)
  • [CI] Enable testifylint len rule (#2945, @LeoLiao123)
  • [release][7/N] Update RayService YAMLs (#2956, @kevin85421)
  • [Fix][RayJob] Invalid quote for RayJob submitter (#2949, @MortalHappiness)
  • [chore][kubectl-plugin] use consistent capitalization (#2950, @davidxia)
  • [chore] add Markdown linting pre-commit hook (#2953, @davidxia)
  • [chore][kubectl-plugin] use better test assertions (#2955, @davidxia)
  • [CI] Add shellcheck and fix error of it (#2933, @owenowenisme)
  • [docs][kubectl-plugin] add dev docs (#2912, @davidxia)
  • [release][6/N] Remove unnecessary YAMLs (#2946, @kevin85421)
  • [release][5/N] Update some RayJob YAMLs from Ray 2.9 to Ray 2.41 (#2941, @kevin85421)
  • [release][4/N] Update Ray images / versions in kubectl plugin (#2938, @kevin85421)
  • [release][3/N] Update RayService e2e tests YAML files from Ray 2.9 to Ray 2.41 (#2937, @kevin85421)
  • [release][2/N] Update RayCluster Helm chart from Ray 2.9 to Ray 2.41 (#2936, @kevin85421)
  • Delete [raycluster|rayjob|rayservice]_types_test.go unnecessary tests (#2935, @kevin85421)
  • [release][1/N] Update YAMLs from Ray 2.9 to Ray 2.41 (#2934, @kevin85421)
  • [CI] Generate CRD json schema separately in pre-commit (#2930, @MortalHappiness)
  • [CI] Enable testifylint expected-actual rule (#2914, @davidxia)
  • [docs] move pre-commit instructions to main dev docs (#2921, @davidxia)
  • [CI] Enable testifylint float-compare rule (#2910, @MortalHappiness)
  • [CI] Fix lint error (require-error) (#2931, @MortalHappiness)
  • [kubectl-plugin] support general kubectl switches like --context (#2883, @davidxia)
  • [CI] Enable testifylint require-error rule (#2909, @MortalHappiness)
  • [chore][kubectl-plugin] use consistent capitalization (#2922, @davidxia)
  • [RayService] Refactor unit tests for ShouldPrepareNewCluster (#2928, @kevin85421)
  • [RayService] Add a safeguard to prevent overriding the pending cluster during a upgrade (#2887, @rueian)
  • [CI] Auto download golang tools in pre-commit (#2917, @MortalHappiness)
  • [CI] Enable testifylint bool-compare rule (#2911, @400Ping)
  • [CI] Enable testifylint empty rule (#2908, @400Ping)
  • [CI] Enable testifylint formatter rule (#2915, @400Ping)
  • [Fix][kubectl-plugin] make tests use a temporary kube config (#2894, @davidxia)
  • [kubectl-plugin] update context error messages (#2891, @davidxia)
  • Use webhook.CustomValidator instead of deprecated webhook.Validator. (#2803, @mbobrovskyi)
  • [kubectl-plugin][feat] support specifying number of head GPUs (#2895, @davidxia)
  • [CI] Enable testifylint error-nil rule (#2907, @MortalHappiness)
  • [CI] Enable testifylint rule (#2896, @MortalHappiness)
  • [Fix][kubectl-plugin] Fix no context nil error SIGSEGV in tests (#2892, @MortalHappiness)
  • [docs][ray-operator] fix typo in Golang version (#2893, @davidxia)
  • [RayService] Refactor envtests (#2888, @kevin85421)
  • [RayService] Remove outdated env tests (#2886, @kevin85421)
  • [RayService] More envtests that follow the most common scenario in the RayService code path (#2880, @rueian)
  • [Fix][kubectl-plugin]: make version handle digests (#2876, @davidxia)
  • [kubectl-plugin] fix minor typo (#2884, @davidxia)
  • [RayService] Add zero-downtime triggered test after rayVersion is updated (#2881, @owenowenisme)
  • [CI] Remove compatibility-test.py and modified CI (#2882, @owenowenisme)
  • [RayService] Refactor updateRayClusterInstance (#2875, @kevin85421)
  • [RayService] Refactor createRayClusterInstance (#2874, @kevin85421)
  • [RayService] Create k8s events after creating/updating k8s resources (#2873, @rueian)
  • Rewrite detached actor test with go (#2722, @owenowenisme)
  • [RayService] Add an envtest for autoscaler (#2872, @kevin85421)
  • [RayService] Add unit tests for isZeroDowntimeUpgradeEnabled (#2871, @kevin85421)
  • [RayService] Setting observedGeneration inside calculateStatus (#2869, @kevin85421)
  • [RayService] Add an envtest for RayService happy path (#2868, @kevin85421)
  • [RayService] Trim Redis Cleanup job less than 63 chars (#2846, @aviadshimoni)
  • [CI]: change kubectl plugin e2e test to buildkite (#2861, @hcc429)
  • [Refactor] Move test name from map key to struct field (#2865, @win5923)
  • [RayService] Merge initConditions into calculateConditions (#2866, @rueian)
  • [RayService] Add checks of RayService conditions in e2e tests (#2864, @kevin85421)
  • [RayService] Mark ServiceStatus as deprecated (#2863, @kevin85421)
  • [RayService] Refactor reconcileRayCluster to avoid updating CR status in the function (#2859, @kevin85421)
  • Refactor multiple cases in single test function with array (#2857, @owenowenisme)
  • [Grafana] Add a Cluster variable to the Grafana Dashboard to enable filtering of different RayClusters (#2685, @win5923)
  • [Autoscaler][Test] Fix flaky idleTimeoutSeconds test (#2862, @ryanaoleary)
  • Add KubeRay e2e Test for custom idleTimeoutSeconds with v2 Autoscaler (#2725, @ryanaoleary)
  • Best practice for fault-tolerant redis with kuberay (#2684, @spencer-p)
  • [RayService] Remove the dependencies between constructRayClusterForRayService and the reconciler to make it more unit testable (#2853, @kevin85421)
  • [RayCluster] e2e test for GCS FT with Redis Username (#2855, @rueian)
  • [RayCluster] Update sample yamls to use the new gcsFaultToleranceOptions option (#2856, @rueian)
  • [RayService] Use Ready condition in e2e tests (#2854, @rueian)
  • [Compatibility] Update Redis image for compatibility tests (#2852, @rueian)
  • [Chore] Make error as a local variable (#2841, @fscnick)
  • [RayService] reword the comment on ServiceStatus = rayv1.Running (#2848, @rueian)
  • [RayService] Use Ready condition in e2e tests (#2849, @kevin85421)
  • [RayService] e2e for redeploying RayServe application after recreating a new Head Pod (#2834, @rueian)
  • [RayService] Remove WaitForServeDeploymentReady (#2842, @kevin85421)
  • [RayService] Move cleanUpRayClusterInstance from reconcileRayCluster to Reconcile (#2838, @kevin85421)
  • [RayService] Passing serve applications to calculateStatus and avoid calling Status().Update(...) inside reconcileServe (#2831, @kevin85421)
  • [RayService] refactor envtest by adding a util function rayServiceTemplate (#2833, @kevin85421)
  • Fix FromAsCasing warning. (#2830, @mbobrovskyi)
  • [RayService] Avoid passing RayServiceStatus to functions in reconcileServe (#2828, @kevin85421)
  • [RayService] Remove updateStatusForActiveCluster (#2827, @kevin85421)
  • [RayService] Move the update of RayClusterStatus to calculateStatus (#2826, @kevin85421)
  • [RayService] Remove HealthLastUpdateTime from ServeDeploymentStatus (#2825, @kevin85421)
  • [Chore] Modify pre-commit yaml to allow golangci-lint version with prefix "v" (#2824, @owenowenisme)
  • [RayService] make checkIfNeedSubmitServeApplications more unit testable (#2822, @kevin85421)
  • [Refactor][RayService] Add conditions to RayService (#2807, @MortalHappiness)
  • [RayService] Add logs and remove in-place update for the TestOldHeadPodFailDuringUpgrade e2e test (#2819, @kevin85421)
  • [RayService] e2e for check the readiness of head Pods for both pending / active clusters (#2806, @rueian)
  • [Refactor] Move validateRayServiceSpec to validation.go and its unit test to validation_test.go (#2816, @CheyuWu)
  • [RayService] Calculate status based on K8s resources (#2818, @kevin85421)
  • [RayService] Unify the cluster switch over logic together (#2805, @rueian)
  • [Refactor] Move function ValidateRayJobSpec to validation.go and its unit test (#2812, @CheyuWu)
  • [Chore] make ingressClassName as a local variable (#2815, @fscnick)
  • [autoscaler] Bump Ray e2e test image (#2814, @ryanaoleary)
  • [Refactor] Move ValidateRayJobStatus to validation.go and create its unit test (#2813, @CheyuWu)
  • [Chore] remove redundant var declaration (#2811, @fscnick)
  • Bump golang.org/x/net from 0.26.0 to 0.33.0 in /proto (#2723, @dependabot[bot])
  • [Refactor] Move ValidateRayClusterSpec to validation.go and its unit test to validation_test.go (#2790, @CheyuWu)
  • [Chore] update comment for headGroupSpec and entrypoint (#2802, @Abirdcfly)
  • Bump golang.org/x/net to v0.33.0 fix upstream vulnerability (#2799, @ryanaoleary)
  • [CI] Skip kubectl plugin flaky e2e tests (#2800, @MortalHappiness)
  • [RayCluster][Feature] reject redis username to head pod out side of GcsFaultToleranceOptions (#2796, @rueian)
  • [CI] Make kubectl plugin release can only be triggered manually (#2798, @MortalHappiness)
  • [kubectl-plugin] silence warnings when creating worker groups (#2792, @andrewsykim)
  • [Refactor] Move validateRayClusterStatus function to validation.go and move unit test to validation_test.go (#2780, @CheyuWu)
  • [RayJob][Chore] make err as a local variable (#2789, @fscnick)
  • [RayService] Rename Restarting to PreparingNewCluster (#2785, @kevin85421)
  • [RayService] Always check the readiness of head Pods for both pending / active clusters if cluster exists (#2783, @kevin85421)
  • [RayCluster][Feature] add redis username to head pod from GcsFaultToleranceOptions (#2760, @win5923)
  • [Refactor]: Move IsGCSFaultToleranceEnabled to utils.go (#2779, @CheyuWu)
  • [RayService] Add a safeguard and remove the dead code to ensure that both clusters are not empty before reconciling serve (#2778, @kevin85421)
  • [RayService] Move the cluster switch logic from reconcileServe to Reconcile (#2777, @kevin85421)
  • [RayService] Avoid sending health check requests to the head Pod when excludeHeadPodFromServeSvc is true (#2776, @kevin85421)
  • [GCS FT] Redis e2e cleanup check (#2773, @rueian)
  • [Refactor] Add a util function IsAutoscalingEnabled and refactor validations of RayJob deletion policy (#2775, @kevin85421)
  • [RayJob] RayJob deletion policy validation (#2771, @rueian)
  • [GCS FT] More validations for configuring GCS FT with envs and annotations (#2772, @rueian)
  • [GCS FT] Add e2e tests for configuring GCS FT with annotations (#2766, @kevin85421)
  • Move matching labels to association.go (#2734, @owenowenisme)
  • skip suspending worker groups if the RayJobDeletionPolicy feature flag is not enabled (#2770, @rueian)
  • [RayCluster][Feature] skip suspending worker groups if the in-tree autoscaler is enabled (#2748, @rueian)
  • [RayJob] Follow up of RayJob deletion policy PR (#2763, @kevin85421)
  • [GCS FT] Unify configuring Gcs FT into a single function (#2755, @kevin85421)
  • [RayJob] implement deletion policy API (#2643, @andrewsykim)
  • [Refactor] Move functions that don’t rely on the controller to non-controller member functions (#2747, @win5923)
  • [kubectl-plugin] add create workergroup command (#2673, @andrewsykim)
  • [kubectl-plugin] add --worker-gpu flag for cluster creation (#2675, @andrewsykim)
  • [RayCluster][Refactor] use RayClusterAllPodsAssociationOptions instead (#2756, @fscnick)
  • [RayCluster] Validate GCSFaultToleranceOptions and redis password (#2754, @kevin85421)
  • [RayCluster][Feature] add redis password to head pod from GcsFaultToleranceOptions (#2731, @fscnick)
  • [Fix][kubectl-plugin] Create separate namespaces for each kubectl plugin e2e test (#2745, @MortalHappiness)
  • [chore] Refactor GetHeadPort (#2750, @kevin85421)
  • [RayCluster] Validate RayClusterSpec for empty containers and GCS FT (#2749, @kevin85421)
  • [Feature] Validation of RayFTEnabled is false and GcsFaultToleranceOption is not nil (#2726, @CheyuWu)
  • [RayCluster][Fix] DesiredReplicas, MinReplicas and MaxReplicas should respect workerGroupSpec.Suspend (#2728, @rueian)
  • [Refactor] move validateRayClusterStatus out of RayClusterReconciler (#2738, @CheyuWu)
  • [Fix] Update Ray Service Troubleshooting Link (#2727, @simotw)
  • Remove preStop hooks from Ray CR Samples (#2724, @ryanaoleary)
  • [Chore] specify the capacity on calling make (#2719, @fscnick)
  • [RayCluster][Feature] setup GCS FT annotations and the RAY_REDIS_ADDRESS env by the GcsFaultToleranceOptions (#2721, @rueian)
  • [Feat][kubectl-plugin] Retry port-forward when connection lost (#2704, @MortalHappiness)
  • [Refactor][Kubectl-plugin] Replace dynamic client with Ray client (#2703, @MortalHappiness)
  • [kubectl-plugin] Add support for retrieving logs for different ray resource types (#2677, @chiayi)
  • [RayCluster][Feature] add GcsFaultToleranceOptions to the RayCluster CRD [1/N] (#2715, @rueian)
  • [Refactor][RayService] Unify ClusterAction decision to single function (#2716, @MortalHappiness)
  • [Chore] make err as local variable in if-statement (#2718, @fscnick)
  • Refactor validateRayServiceSpec (#2711, @hcc429)
  • [CI] Downgrade runner image from ubuntu-latest to ubuntu-22.04 (#2714, @owenowenisme)
  • [Refactor] Replace Hard-Coded HTTP Values with Constants (#2702, @simotw)
  • Refactor UpgradeStrategy to UpgradeSpec.Type (#2678, @ryanaoleary)
  • [Chore] remove unnecessary line break in log (#2709, @fscnick)
  • [Refactor] Remove ingress in service controller (#2708, @owenowenisme)
  • [RayService][refactor] Remove updateState (#2705, @kevin85421)
  • [Feature] Support ARM image for test (#2699, @simotw)
  • [chore] remove redundant interface check (#2700, @fscnick)
  • [Feature]: Add a new event type FailedToDeleteWorkerPodCollection (#2680, @CheyuWu)
  • [Feature] Add an e2e test for K8s Job submitter failures (#2688, @simotw)
  • [RayCluster][CI] add e2e tests for the RayClusterSuspended status condition (#2686, @rueian)
  • [RayCluster] support suspending worker groups (#2663, @andrewsykim)
  • [Fix][RayService] Use LRU cache for ServeConfigs (#2683, @MortalHappiness)
  • [Prometheus] Use PodMonitor instead of ServiceMonitor for the Head Node to avoid metric duplication (#2689, @win5923)
  • [Feature] Print KubeRay logs in Buildkite runner when tests fail (#2690, @LeoLiao123)
  • [Fix][Doc] Fix development markdown example (#2687, @owenowenisme)
  • [CI] split rayservice e2e test into another runner and decrease timeout to 30m (#2667, @fscnick)
  • [kubectl-plugin] fix worker resources in 'kubectl ray create cluster' command (#2671, @andrewsykim)
  • [CI][Hotfix] Increase the timeout of Test E2E from 30m to 1h (#2664, @kevin85421)
  • [Feature][RayService]Add kubernetes event to inform user of upgrade strategy (#2592, @chiayi)
  • [RayCluster][CI] add e2e tests for RayClusterStatusCondition (#2661, @rueian)
  • [CI] Deflaky TestRayServiceGCSFaultTolerance (#2660, @kevin85421)
  • [kubectl-plugin] Add rayjob yaml generation to ray job submit command (#2644, @chiayi)
  • [Feature] RayService HA test - GCS fault tolerance + kill GCS process (#2590, @CheyuWu)
  • [RayService] Use waitGroup to ensure a goroutine's completion before the RayService HA test ends (#2657, @win5923)
  • [kubectl-plugin] Add kubectl ray delete rayservice/job/cluster (#2635, @chiayi)
  • default RayClusterStatusConditions=true in helm-chart (#2656, @andrewsykim)
  • [RayJob][Refactor] use ray job status and ray job logs to be tolerant of duplicated job submissions (#2579, @rueian)
  • [CI] Unjail TestRayServiceInPlaceUpdate (#2650, @kevin85421)
  • [kubectl-plugin] Make sure kubectl ray logs only get ray container logs (#2649, @chiayi)
  • [Refactor] Use global constants for Ray versions and Ray image versions in go tests (#2641, @simotw)
  • prioritize memory limits over requests for /dev/shm size (#2642, @andrewsykim)
  • [Chore][CI] Remove StreamKubeRayOperatorLogs (#2637, @MortalHappiness)
  • [CI] Move e2e tests to buildkite (#2639, @MortalHappiness)
  • [CI] Jail flaky test: TestRayServiceInPlaceUpdate (#2638, @kevin85421)
  • [RayService] follow up for #2598 (#2636, @kevin85421)
  • Update swagger-initializer.js (#2543, @metasyn)
  • [Feature] Add an e2e test for Autoscaler to scale up by manually updating minReplicas (#2634, @LeoLiao123)
  • [Feat]: Add a field to configure whether to add a proxy actor on the head Pod to the K8s serve service or not (#2598, @CheyuWu)
  • [kubectl-plugin] Add kubectl ray create cluster (#2607, @chiayi)
  • [Feature] Add ManagedBy field to RayCluster (#2597, @mszadkow)
  • [Cleanup] Align RayJob's ManagedBy with RayCluster's ManagedBy. (#2630, @mszadkow)
  • [Feature] Add e2e tests for Autoscaler V2 (#2588, @simotw)
  • fix: update topology spread constraints for custom worker pools (#2633, @TessaIO)
  • [RayCluster][Fix] leave .Status.State untouched when there is a reconcile error (#2622, @rueian)
  • Convert byte slice and string without copy (#2628, @dentiny)
  • [Chore][CI] Upgrade ray version to 2.40 except for TestRayServiceInPlaceUpdate (#2629, @MortalHappiness)
  • Fix/make helm and kustomize consistent (#2624, @fscnick)
  • Add a util function to convert string and bytes array (#2621, @dentiny)
  • [kubectl-plugin] Add e2e test for kubectl ray job submit (#2614, @chiayi)
  • [Feature] Add ManagedBy field to RayJob (#2589, @mszadkow)
  • [Bug] TestRayServiceInPlaceUpdate is flaky (#2620, @kevin85421)
  • [Test] Add ray-cluster.fluentbit.yaml to sample YAML tests (#2611, @win5923)
  • Add test for autoscaler and its desired state (#2601, @dentiny)
  • [Feature][kubectl-plugin] e2e test for 'kubectl ray log' (#2486, @chiayi)
  • [Doc] Fix RayCluster auth sample to include --config-file in kube-rbac-proxy (#2604, @andrewsykim)
  • [RayCluster][Feature] Make RayClusterStatusConditions feature gate Beta and enabled by default (#2562, @rueian)
  • [Docs] Add sample yaml RayCluster with FluentBit sidecar to persist Ray logs (#2602, @win5923)
  • [Doc] Remove KubeRay CLI references and add Python client details (#2521, @nadongjun)
  • [RayService][Refactor] Change the ServeConfigs to nested map (#2591, @MortalHappiness)
  • [no-op] Avoid implicit package import (#2599, @dentiny)
  • [Feature][kubectl-plugin] return usage error when no entrypoint input (#2503, @chiayi)
  • [Docs] add sample RayCluster using kube-rbac-proxy for dashboard access control (#2578, @andrewsykim)
  • feat: support default function for containerEnv on additionalWorkerGroups in ray-cluster helm chart (#2570, @TessaIO)
  • [RayCluster][Fix] Add expectations of RayCluster (#2150, @Eikykun)
  • [Feat] Remove RayService sample YAML Python tests (#2565, @CheyuWu)
  • [Test] Implement RayService In-place update test in Golang (#2536, @CheyuWu)
  • Add workerGroupSpec.idleTimeoutSeconds to v1 RayCluster CRD (#2558, @ryanaoleary)
  • remove autoscaler's permission of patch pods (#2559, @KunWuLuan)
  • some logs are not json format (#2535, @fscnick)
  • [Feature] Disable zero downtime upgrade for a RayService using RayServiceSpec (#2468, @chiayi)
  • [RayCluster] don't allow overriding ray.io/cluster label (#2555, @andrewsykim)
  • [Refactor][kubectl-plugin] Change kubectl ray cluster get to get cluster (#2493, @MortalHappiness)
  • [Test][HA] Test high-availability during zero-downtime upgrade (#2539, @MortalHappiness)
  • [RayService][Refactor] Remove ctrlResult (#2545, @kevin85421)
  • [RayService][Refactor] Avoid flooding Kubernetes events (#2546, @kevin85421)
  • [API Server] Add Ray Job output - start/end time and ray cluster name (#2533, @han-steve)
  • [API Server] Add security context to Ray Cluster (#2538, @han-steve)
  • [Chore][precommit] Replace grep with awk in pre-commit hooks for BSD compatibility (#2541, @nadongjun)
  • [Helm] add sizeLimit for emptyDir (#2532, @win5923)
  • [Logging] Remove duplicate info in CR logs (#2531, @nadongjun)
  • [Test][HA] Ray Autoscaler enabled + Ray Serve autoscaling-enabled (#2485, @MortalHappiness)
  • [Logging] add context info for yunikorn logger (#2522, @win5923)
  • [BUG] Fix Dockerfile WARN: FromAsCasing: 'as' and 'FROM' keywords' casing do not match (#2527, @win5923)
  • Revert "[BUG] Fix Dockerfile Error: WARN: FromAsCasing: 'as' and 'FROM' Keywords' Casing Do Not match (#2527)" (#2529, @kevin85421)
  • [BUG] Fix Dockerfile WARN: FromAsCasing: 'as' and 'FROM' keywords' casing do not match (#2527, @win5923)
  • [metrics] Add ray_io_cluster to all Ray metrics (#2524, @kevin85421)
  • [CI] Fix RayService CI (#2525, @kevin85421)
  • [Feat] Add sample yaml for RayJob clusterSelector config (#2505, @MortalHappiness)
  • [Feat][Sample-yaml] Deprecated python sample yaml test cleanup (#2507, @MortalHappiness)
  • Update v6e-256 KubeRay Sample (#2466, @ryanaoleary)
  • [Logging] Avoid using fmt.Sprintf inside logging functions (#2508, @vincent-626)
  • Add TPU to Known Custom Accelerators for generated rayStartCommand (#2495, @ryanaoleary)
  • [Test] Query dashboard to get the serve application status in head pod (#2489, @CheyuWu)
  • [Refactor] Extract KubectlApplyYaml and yaml deserialization to support package (#2498, @MortalHappiness)
  • [Test] Check all applications in Ray Serve are running (#2496, @win5923)
  • test: add sample yaml rayjob test cases (#2487, @fscnick)
  • Add topology spread constraints test for RayCluster (#2472, @YoussefEssDS)
  • [Test] Check .status.numServeEndpoints is greater than zero (#2488, @win5923)
  • [Test] Check RayService can successfully create RayCluster (#2475, @win5923)
  • Drop unused permission + configurable binary path (#2478, @bpineau)
  • [Feature][kubectl-plugin]'ray log command' Add check and cleanup directory when no ray node exist (#2473, @chiayi)
  • [REFACTOR]: refactor execute pod cmd with client-go function (#2467, @CheyuWu)
  • [Fix][Helm] Fix ClusterRole for volcano if .Values.batchScheduler.name is set (#2474, @MortalHappiness)
  • [scheduler] Setting both EnableBatchScheduler and BatchScheduler at the same time is not allowed (#2471, @kevin85421)
  • [Refactor][Test] Don't compose Gomega in Test struct (#2470, @MortalHappiness)
  • [Feature][kubectl-plugin] Add all and worker node type to kubectl ray log (#2442, @chiayi)
  • [Feature][kubectl-plugin] Fix for setting job submission ID in kubectl ray job submit (#2469, @chiayi)
  • test: add check ray import testcase (#2459, @CheyuWu)
  • [Chore][kubectl-plugin] Fix wrong homepage link in krew template file (#2461, @MortalHappiness)
  • [Fix] Consistent parsing of custom accelerator resources (#2464, @mounchin)
  • Upgrade kustomization files to Kustomize v5 (#2352, @oksanabaza)
  • [Docs][kubectl-plugin] Add doc for install via Krew (#2458, @MortalHappiness)
  • fix(apiserver): env MEMORY_* use memory ResourceField (#2438, @Abirdcfly)
  • [Feature][kubectl-plugin] add KubeRay operator version query (#2443, @win5923)
  • [Docs][kubectl-plugin] Add instructions for downloading from GitHub release (#2450, @MortalHappiness)
  • [Test][sample-yaml] Check RayCluster create correct number of pods (#2434, @MortalHappiness)
  • [RayJob] UserMode -> InteractiveMode and check rayjob.spec.jobId instead of annotation (#2446, @andrewsykim)
  • Update V6e TPU Ray Samples (#2448, @ryanaoleary)
  • Fix v6e TPU Scripts and RayJob CRs (#2447, @ryanaoleary)
  • feat: add InvalidRayJobSpec, InvalidRayJobStatus, InvalidRayServiceSpec and InvalidRayClusterStatus events (#2441, @rueian)
  • Add TPU v6e sample manifests and scripts (#2445, @ryanaoleary)
  • [Fix] Return error when marshaling tolerations fails in NewComputeTemplate (#2444, @YQ-Wang)
  • support extended resources for Ray pods (#2436, @YQ-Wang)
  • [Fix][RayService] Raise error if spec.rayClusterConfig.headGroupSpec.headService.metadata.name is set (#2440, @MortalHappiness)
  • Fall back to CPU requests if limit is not specified (#2365, @andrewsykim)
  • Add dnsConfig to head, worker and additional workers (#2377, @edward2a)
  • [Refactor] Support adding custom accelerator to resources in rayStartParams (#2425, @mounchin)
  • [Feature][RayCluster]: introduce RayClusterSuspending and RayClusterSuspended conditions (#2403, @rueian)
  • [Chore] Use Ray 2.9.0 for Apache YuniKorn example (#2427, @kevin85421)
  • [Feat][kubectl-plugin] Add kubectl ray version command (#2424, @MortalHappiness)
  • cleanup: remove unused initConnectionTimeout for RayClient in apiserver (#2399, @Abirdcfly)
  • [Chore][YuniKorn] Add sample yaml file for Apache YuniKorn (#2412, @MortalHappiness)
  • [Feat][kubectl-plugin] Include LICENSE file into kubectl plugin tar (#2422, @MortalHappiness)
  • Add support for parsing neuron core resource limit and pass it as ray… (#2409, @mounchin)
  • Add a variant of the ray data processing job with GCSFuse CSI driver (#2401, @saikat-royc)
  • [v1.2.2] Update KUBERAY_VERSION (#2417, @kevin85421)
  • [release v1.2.2] Update tags and versions (#2416, @kevin85421)
  • [release] Update Ray image to 2.34.0 (#2303, @kevin85421)
  • Revert "[release] Update Ray image to 2.34.0 (#2303)" (#2413, @kevin85421)
  • Revert "[release] Update Ray image to 2.34.0 (#2303)" (#2413) (#2415, @kevin85421)
  • [release] Update Ray image to 2.34.0 (#2303, @kevin85421)
  • Revert "[release] Update Ray image to 2.34.0 (#2303)" (#2413, @kevin85421)

Don't miss a new kuberay release

NewReleases is sending notifications on new releases.