Highlights
RayCluster Conditions API
The RayCluster conditions API is graduating to Beta status in v1.3. The new API provides more details about the RayCluster’s observable state that were not possible to express in the old API. The following conditions are supported for v1.3: AllPodRunningAndReadyFirstTime
, RayClusterPodsProvisioning
, HeadPodNotFound
and HeadPodRunningAndReady
. We will be adding more conditions in future releases.
Ray Kubectl Plugin
The Ray Kubectl Plugin is graduating to Beta status. The following commands are supported with KubeRay v1.3:
kubectl ray logs <cluster-name>
: download Ray logs to a local directorykubectl ray session <cluster-name>
: initiate port-forwarding session to the Ray headkubectl ray create <cluster>
: create a Ray clusterkubectl ray job submit
: create a RayJob and submit a job using a local working directory
See the Ray Kubectl Plugin docs for more details.
RayJob Stability Improvements
Several improvements have been made to enhance the stability of long-running RayJobs. In particular, when using submissionMode=K8sJobMode
, job submissions will no longer fail due to the submission of duplicate IDs. Now, if a submission ID already exists, the logs of the existing job will be retrieved instead.
RayService API Improvements
RayService strives to deliver zero-downtime serving. When changes in the RayService spec cannot be applied in place, it attempts to migrate traffic to a new RayCluster in the background. However, users might not always have sufficient resources for a new RayCluster. Beginning with KubeRay 1.3, users can customize this behavior using the new UpgradeStrategy option within the RayServiceSpec.
Previously, the serviceStatus
field in RayService was inconsistent and did not accurately represent the actual state. Starting with KubeRay v1.3.0, we have introduced two conditions, Ready
and UpgradeInProgress
, to RayService. Following the approach taken with RayCluster, we have decided to deprecate serviceStatus. In the future, serviceStatus will be removed, and conditions will serve as the definitive source of truth. For now, serviceStatus remains available but is limited to two possible values: "Running" or an empty string.
GCS Fault Tolerance API Improvements
The new GcsFaultToleranceOptions field in the RayCluster now provides a streamlined way for users to enable GCS Fault Tolerance on a RayCluster. This eliminates the previous need to distribute related settings across Pod annotations, container environment variables, and the RayStartParams. Furthermore, users can now specify their Redis username in the newly introduced field (requires Ray 2.4.1 or later). To see the impact of this change on a YAML configuration, please refer to the example manifest.
Breaking Changes
RayService API
Starting from KubeRay v1.3.0, we have removed all possible values of RayService.Status.ServiceStatus except Running, so the only valid values for ServiceStatus are Running and empty. If ServiceStatus is Running, it means that RayService is ready to serve requests. In other words, ServiceStatus is equivalent to the Ready condition. It is strongly recommended to use the Ready condition instead of ServiceStatus going forward.
Features
- RayCluster Conditions API is graduating to Beta status. The feature gate RayClusterStatusConditions is now enabled by default.
- New events were added for RayCluster, RayJob and RayService for improved observability
- Various improvements to Ray autoscaler v2
- Introduce a new API in RayService
spec.upgradeStrategy
. The upgrade strategy type can be set toNewCluster
orNone
to modify the behavior of zero-downtime upgrades for RayService. - Add RayCluster controller expecatations to mitigate stale informer caches
- RayJob now supports submission mode InteractiveMode. Use this submission mode when you want to submit jobs from a local working directory on your laptop.
- RayJob now supports
spec.deletionPolicy
API, this feature requires theRayJobDeletionPolicy
feature gate to be enabled. Initial deltion policies areDeleteCluster
,DeleteWorkers
,DeleteSelf
andDeleteNone
. - KubeRay now detects TPUs and Neuron Core resources and specifies them as custom resources to ray start parameters
- Introduce
RayClusterSuspending
andRayClusterSuspended
conditions - Container CPU requests are now used in Ray
–num-cpus
if CPU limits is not specified - Various example manifests for using TPU v6 with KubeRay
- Add ManagedBy field in RayJob and RayCluster. This is required for Multi-Kueue support.
- Add support for
kubectl ray create cluster
command - Add support for
kubectl ray create workergroup
command
Guides & Tutorials
- Use Ray Kubectl Plugin
- New sample manifests using TPU v6e chips
- Tuning Redis for a Persistent Fault Tolerant GCS
- Reducing image pull latency on Kubernetes
- Configure Ray clusters with authentication and access control using KubeRay
- RayService + vLLM examples updated to use vLLM v0.6.2
- All YAML samples in KubeRay repo has been updated to use Ray v2.41.0
Changelog
- [Fix][RayCluster] fix missing pod name in CreatedWorkerPod and Failed… (#3057, @rueian)
- [Refactor] Use constants for image tag, image repo, and versions in golang to avoid hard-coded strings (#2978, @400Ping)
- Update TPU Ray CR manifests to use Ray 2.41.0 (#2965, @ryanaoleary)
- Update samples to use Ray 2.41.0 images (#2964, @andrewsykim)
- [Test] Use GcsFaultToleranceOptions in test and backward compatibility (#2972, @fscnick)
- [chore][docs] enable Markdownlint rule MD004 (#2973, @davidxia)
- [release] Update Volcano YAML files to Ray 2.41 (#2976, @win5923)
- [release] Update Yunikorn YAML file to Ray 2.41 (#2969, @kenchung285)
- [CI] Change Pre-commit-shellcheck-to-shellcheck-py (#2974, @owenowenisme)
- [chore][docs] enable Markdownlint rule MD010 (#2975, @davidxia)
- [Release] Upgrade ray-job.batch-inference.yaml image to 2.41 (#2971, @MortalHappiness)
- [RayService] adapter vllm 0.6.1.post2 (#2823, @pxp531)
- [release][9/N] Update text summarizer RayService to Ray 2.41 (#2961, @kevin85421)
- [RayService] Deflaky RayService envtest (#2962, @kevin85421)
- [RayJob] Deflaky RayJob e2e tests (#2963, @kevin85421)
- [fix][kubectl-plugin] set worker group CPU limit (#2958, @davidxia)
- [docs][kubectl-plugin] fix incorrect example commands (#2951, @davidxia)
- [release][8/N] Upgrade Stable Diffusion RayService to Ray 2.41 (#2960, @kevin85421)
- [kubectl-plugin] Fix panic when GPU resource is not set (#2954, @win5923)
- [docs][kubectl-plugin] improve help messages (#2952, @davidxia)
- [CI] Enable
testifylint
len
rule (#2945, @LeoLiao123) - [release][7/N] Update RayService YAMLs (#2956, @kevin85421)
- [Fix][RayJob] Invalid quote for RayJob submitter (#2949, @MortalHappiness)
- [chore][kubectl-plugin] use consistent capitalization (#2950, @davidxia)
- [chore] add Markdown linting pre-commit hook (#2953, @davidxia)
- [chore][kubectl-plugin] use better test assertions (#2955, @davidxia)
- [CI] Add shellcheck and fix error of it (#2933, @owenowenisme)
- [docs][kubectl-plugin] add dev docs (#2912, @davidxia)
- [release][6/N] Remove unnecessary YAMLs (#2946, @kevin85421)
- [release][5/N] Update some RayJob YAMLs from Ray 2.9 to Ray 2.41 (#2941, @kevin85421)
- [release][4/N] Update Ray images / versions in kubectl plugin (#2938, @kevin85421)
- [release][3/N] Update RayService e2e tests YAML files from Ray 2.9 to Ray 2.41 (#2937, @kevin85421)
- [release][2/N] Update RayCluster Helm chart from Ray 2.9 to Ray 2.41 (#2936, @kevin85421)
- Delete
[raycluster|rayjob|rayservice]_types_test.go
unnecessary tests (#2935, @kevin85421) - [release][1/N] Update YAMLs from Ray 2.9 to Ray 2.41 (#2934, @kevin85421)
- [CI] Generate CRD json schema separately in pre-commit (#2930, @MortalHappiness)
- [CI] Enable
testifylint
expected-actual
rule (#2914, @davidxia) - [docs] move pre-commit instructions to main dev docs (#2921, @davidxia)
- [CI] Enable
testifylint
float-compare
rule (#2910, @MortalHappiness) - [CI] Fix lint error (require-error) (#2931, @MortalHappiness)
- [kubectl-plugin] support general
kubectl
switches like--context
(#2883, @davidxia) - [CI] Enable
testifylint
require-error
rule (#2909, @MortalHappiness) - [chore][kubectl-plugin] use consistent capitalization (#2922, @davidxia)
- [RayService] Refactor unit tests for ShouldPrepareNewCluster (#2928, @kevin85421)
- [RayService] Add a safeguard to prevent overriding the pending cluster during a upgrade (#2887, @rueian)
- [CI] Auto download golang tools in pre-commit (#2917, @MortalHappiness)
- [CI] Enable
testifylint
bool-compare
rule (#2911, @400Ping) - [CI] Enable
testifylint
empty
rule (#2908, @400Ping) - [CI] Enable
testifylint
formatter
rule (#2915, @400Ping) - [Fix][kubectl-plugin] make tests use a temporary kube config (#2894, @davidxia)
- [kubectl-plugin] update context error messages (#2891, @davidxia)
- Use webhook.CustomValidator instead of deprecated webhook.Validator. (#2803, @mbobrovskyi)
- [kubectl-plugin][feat] support specifying number of head GPUs (#2895, @davidxia)
- [CI] Enable
testifylint
error-nil
rule (#2907, @MortalHappiness) - [CI] Enable testifylint rule (#2896, @MortalHappiness)
- [Fix][kubectl-plugin] Fix no context nil error SIGSEGV in tests (#2892, @MortalHappiness)
- [docs][ray-operator] fix typo in Golang version (#2893, @davidxia)
- [RayService] Refactor envtests (#2888, @kevin85421)
- [RayService] Remove outdated env tests (#2886, @kevin85421)
- [RayService] More envtests that follow the most common scenario in the RayService code path (#2880, @rueian)
- [Fix][kubectl-plugin]: make
version
handle digests (#2876, @davidxia) - [kubectl-plugin] fix minor typo (#2884, @davidxia)
- [RayService] Add zero-downtime triggered test after rayVersion is updated (#2881, @owenowenisme)
- [CI] Remove compatibility-test.py and modified CI (#2882, @owenowenisme)
- [RayService] Refactor
updateRayClusterInstance
(#2875, @kevin85421) - [RayService] Refactor
createRayClusterInstance
(#2874, @kevin85421) - [RayService] Create k8s events after creating/updating k8s resources (#2873, @rueian)
- Rewrite detached actor test with go (#2722, @owenowenisme)
- [RayService] Add an envtest for autoscaler (#2872, @kevin85421)
- [RayService] Add unit tests for
isZeroDowntimeUpgradeEnabled
(#2871, @kevin85421) - [RayService] Setting observedGeneration inside calculateStatus (#2869, @kevin85421)
- [RayService] Add an envtest for RayService happy path (#2868, @kevin85421)
- [RayService] Trim Redis Cleanup job less than 63 chars (#2846, @aviadshimoni)
- [CI]: change kubectl plugin e2e test to buildkite (#2861, @hcc429)
- [Refactor] Move test name from map key to struct field (#2865, @win5923)
- [RayService] Merge
initConditions
intocalculateConditions
(#2866, @rueian) - [RayService] Add checks of RayService conditions in e2e tests (#2864, @kevin85421)
- [RayService] Mark
ServiceStatus
as deprecated (#2863, @kevin85421) - [RayService] Refactor
reconcileRayCluster
to avoid updating CR status in the function (#2859, @kevin85421) - Refactor multiple cases in single test function with array (#2857, @owenowenisme)
- [Grafana] Add a
Cluster
variable to the Grafana Dashboard to enable filtering of different RayClusters (#2685, @win5923) - [Autoscaler][Test] Fix flaky idleTimeoutSeconds test (#2862, @ryanaoleary)
- Add KubeRay e2e Test for custom idleTimeoutSeconds with v2 Autoscaler (#2725, @ryanaoleary)
- Best practice for fault-tolerant redis with kuberay (#2684, @spencer-p)
- [RayService] Remove the dependencies between
constructRayClusterForRayService
and the reconciler to make it more unit testable (#2853, @kevin85421) - [RayCluster] e2e test for GCS FT with Redis Username (#2855, @rueian)
- [RayCluster] Update sample yamls to use the new gcsFaultToleranceOptions option (#2856, @rueian)
- [RayService] Use Ready condition in e2e tests (#2854, @rueian)
- [Compatibility] Update Redis image for compatibility tests (#2852, @rueian)
- [Chore] Make error as a local variable (#2841, @fscnick)
- [RayService] reword the comment on
ServiceStatus = rayv1.Running
(#2848, @rueian) - [RayService] Use
Ready
condition in e2e tests (#2849, @kevin85421) - [RayService] e2e for redeploying RayServe application after recreating a new Head Pod (#2834, @rueian)
- [RayService] Remove WaitForServeDeploymentReady (#2842, @kevin85421)
- [RayService] Move
cleanUpRayClusterInstance
fromreconcileRayCluster
toReconcile
(#2838, @kevin85421) - [RayService] Passing serve applications to
calculateStatus
and avoid callingStatus().Update(...)
insidereconcileServe
(#2831, @kevin85421) - [RayService] refactor envtest by adding a util function
rayServiceTemplate
(#2833, @kevin85421) - Fix FromAsCasing warning. (#2830, @mbobrovskyi)
- [RayService] Avoid passing RayServiceStatus to functions in
reconcileServe
(#2828, @kevin85421) - [RayService] Remove
updateStatusForActiveCluster
(#2827, @kevin85421) - [RayService] Move the update of
RayClusterStatus
tocalculateStatus
(#2826, @kevin85421) - [RayService] Remove
HealthLastUpdateTime
fromServeDeploymentStatus
(#2825, @kevin85421) - [Chore] Modify pre-commit yaml to allow golangci-lint version with prefix "v" (#2824, @owenowenisme)
- [RayService] make
checkIfNeedSubmitServeApplications
more unit testable (#2822, @kevin85421) - [Refactor][RayService] Add conditions to RayService (#2807, @MortalHappiness)
- [RayService] Add logs and remove in-place update for the TestOldHeadPodFailDuringUpgrade e2e test (#2819, @kevin85421)
- [RayService] e2e for check the readiness of head Pods for both pending / active clusters (#2806, @rueian)
- [Refactor] Move
validateRayServiceSpec
tovalidation.go
and its unit test tovalidation_test.go
(#2816, @CheyuWu) - [RayService] Calculate status based on K8s resources (#2818, @kevin85421)
- [RayService] Unify the cluster switch over logic together (#2805, @rueian)
- [Refactor] Move function
ValidateRayJobSpec
tovalidation.go
and its unit test (#2812, @CheyuWu) - [Chore] make ingressClassName as a local variable (#2815, @fscnick)
- [autoscaler] Bump Ray e2e test image (#2814, @ryanaoleary)
- [Refactor] Move
ValidateRayJobStatus
tovalidation.go
and create its unit test (#2813, @CheyuWu) - [Chore] remove redundant var declaration (#2811, @fscnick)
- Bump golang.org/x/net from 0.26.0 to 0.33.0 in /proto (#2723, @dependabot[bot])
- [Refactor] Move
ValidateRayClusterSpec
tovalidation.go
and itsunit test
tovalidation_test.go
(#2790, @CheyuWu) - [Chore] update comment for headGroupSpec and entrypoint (#2802, @Abirdcfly)
- Bump golang.org/x/net to v0.33.0 fix upstream vulnerability (#2799, @ryanaoleary)
- [CI] Skip kubectl plugin flaky e2e tests (#2800, @MortalHappiness)
- [RayCluster][Feature] reject redis username to head pod out side of GcsFaultToleranceOptions (#2796, @rueian)
- [CI] Make kubectl plugin release can only be triggered manually (#2798, @MortalHappiness)
- [kubectl-plugin] silence warnings when creating worker groups (#2792, @andrewsykim)
- [Refactor] Move
validateRayClusterStatus
function tovalidation.go
and move unit test tovalidation_test.go
(#2780, @CheyuWu) - [RayJob][Chore] make err as a local variable (#2789, @fscnick)
- [RayService] Rename Restarting to PreparingNewCluster (#2785, @kevin85421)
- [RayService] Always check the readiness of head Pods for both pending / active clusters if cluster exists (#2783, @kevin85421)
- [RayCluster][Feature] add redis username to head pod from GcsFaultToleranceOptions (#2760, @win5923)
- [Refactor]: Move
IsGCSFaultToleranceEnabled
toutils.go
(#2779, @CheyuWu) - [RayService] Add a safeguard and remove the dead code to ensure that both clusters are not empty before reconciling serve (#2778, @kevin85421)
- [RayService] Move the cluster switch logic from
reconcileServe
toReconcile
(#2777, @kevin85421) - [RayService] Avoid sending health check requests to the head Pod when
excludeHeadPodFromServeSvc
is true (#2776, @kevin85421) - [GCS FT] Redis e2e cleanup check (#2773, @rueian)
- [Refactor] Add a util function
IsAutoscalingEnabled
and refactor validations of RayJob deletion policy (#2775, @kevin85421) - [RayJob] RayJob deletion policy validation (#2771, @rueian)
- [GCS FT] More validations for configuring GCS FT with envs and annotations (#2772, @rueian)
- [GCS FT] Add e2e tests for configuring GCS FT with annotations (#2766, @kevin85421)
- Move matching labels to association.go (#2734, @owenowenisme)
- skip suspending worker groups if the RayJobDeletionPolicy feature flag is not enabled (#2770, @rueian)
- [RayCluster][Feature] skip suspending worker groups if the in-tree autoscaler is enabled (#2748, @rueian)
- [RayJob] Follow up of RayJob deletion policy PR (#2763, @kevin85421)
- [GCS FT] Unify configuring Gcs FT into a single function (#2755, @kevin85421)
- [RayJob] implement deletion policy API (#2643, @andrewsykim)
- [Refactor] Move functions that don’t rely on the controller to non-controller member functions (#2747, @win5923)
- [kubectl-plugin] add create workergroup command (#2673, @andrewsykim)
- [kubectl-plugin] add --worker-gpu flag for cluster creation (#2675, @andrewsykim)
- [RayCluster][Refactor] use RayClusterAllPodsAssociationOptions instead (#2756, @fscnick)
- [RayCluster] Validate GCSFaultToleranceOptions and redis password (#2754, @kevin85421)
- [RayCluster][Feature] add redis password to head pod from GcsFaultToleranceOptions (#2731, @fscnick)
- [Fix][kubectl-plugin] Create separate namespaces for each kubectl plugin e2e test (#2745, @MortalHappiness)
- [chore] Refactor GetHeadPort (#2750, @kevin85421)
- [RayCluster] Validate RayClusterSpec for empty containers and GCS FT (#2749, @kevin85421)
- [Feature] Validation of RayFTEnabled is false and GcsFaultToleranceOption is not nil (#2726, @CheyuWu)
- [RayCluster][Fix] DesiredReplicas, MinReplicas and MaxReplicas should respect workerGroupSpec.Suspend (#2728, @rueian)
- [Refactor] move validateRayClusterStatus out of RayClusterReconciler (#2738, @CheyuWu)
- [Fix] Update Ray Service Troubleshooting Link (#2727, @simotw)
- Remove preStop hooks from Ray CR Samples (#2724, @ryanaoleary)
- [Chore] specify the capacity on calling make (#2719, @fscnick)
- [RayCluster][Feature] setup GCS FT annotations and the RAY_REDIS_ADDRESS env by the GcsFaultToleranceOptions (#2721, @rueian)
- [Feat][kubectl-plugin] Retry port-forward when connection lost (#2704, @MortalHappiness)
- [Refactor][Kubectl-plugin] Replace dynamic client with Ray client (#2703, @MortalHappiness)
- [kubectl-plugin] Add support for retrieving logs for different ray resource types (#2677, @chiayi)
- [RayCluster][Feature] add GcsFaultToleranceOptions to the RayCluster CRD [1/N] (#2715, @rueian)
- [Refactor][RayService] Unify ClusterAction decision to single function (#2716, @MortalHappiness)
- [Chore] make err as local variable in if-statement (#2718, @fscnick)
- Refactor validateRayServiceSpec (#2711, @hcc429)
- [CI] Downgrade runner image from ubuntu-latest to ubuntu-22.04 (#2714, @owenowenisme)
- [Refactor] Replace Hard-Coded HTTP Values with Constants (#2702, @simotw)
- Refactor UpgradeStrategy to UpgradeSpec.Type (#2678, @ryanaoleary)
- [Chore] remove unnecessary line break in log (#2709, @fscnick)
- [Refactor] Remove ingress in service controller (#2708, @owenowenisme)
- [RayService][refactor] Remove
updateState
(#2705, @kevin85421) - [Feature] Support ARM image for test (#2699, @simotw)
- [chore] remove redundant interface check (#2700, @fscnick)
- [Feature]: Add a new event type FailedToDeleteWorkerPodCollection (#2680, @CheyuWu)
- [Feature] Add an e2e test for K8s Job submitter failures (#2688, @simotw)
- [RayCluster][CI] add e2e tests for the RayClusterSuspended status condition (#2686, @rueian)
- [RayCluster] support suspending worker groups (#2663, @andrewsykim)
- [Fix][RayService] Use LRU cache for ServeConfigs (#2683, @MortalHappiness)
- [Prometheus] Use PodMonitor instead of ServiceMonitor for the Head Node to avoid metric duplication (#2689, @win5923)
- [Feature] Print KubeRay logs in Buildkite runner when tests fail (#2690, @LeoLiao123)
- [Fix][Doc] Fix development markdown example (#2687, @owenowenisme)
- [CI] split rayservice e2e test into another runner and decrease timeout to 30m (#2667, @fscnick)
- [kubectl-plugin] fix worker resources in 'kubectl ray create cluster' command (#2671, @andrewsykim)
- [CI][Hotfix] Increase the timeout of
Test E2E
from 30m to 1h (#2664, @kevin85421) - [Feature][RayService]Add kubernetes event to inform user of upgrade strategy (#2592, @chiayi)
- [RayCluster][CI] add e2e tests for RayClusterStatusCondition (#2661, @rueian)
- [CI] Deflaky TestRayServiceGCSFaultTolerance (#2660, @kevin85421)
- [kubectl-plugin] Add rayjob yaml generation to ray job submit command (#2644, @chiayi)
- [Feature] RayService HA test - GCS fault tolerance + kill GCS process (#2590, @CheyuWu)
- [RayService] Use
waitGroup
to ensure a goroutine's completion before the RayService HA test ends (#2657, @win5923) - [kubectl-plugin] Add kubectl ray delete rayservice/job/cluster (#2635, @chiayi)
- default RayClusterStatusConditions=true in helm-chart (#2656, @andrewsykim)
- [RayJob][Refactor] use
ray job status
andray job logs
to be tolerant of duplicated job submissions (#2579, @rueian) - [CI] Unjail TestRayServiceInPlaceUpdate (#2650, @kevin85421)
- [kubectl-plugin] Make sure kubectl ray logs only get ray container logs (#2649, @chiayi)
- [Refactor] Use global constants for Ray versions and Ray image versions in go tests (#2641, @simotw)
- prioritize memory limits over requests for /dev/shm size (#2642, @andrewsykim)
- [Chore][CI] Remove StreamKubeRayOperatorLogs (#2637, @MortalHappiness)
- [CI] Move e2e tests to buildkite (#2639, @MortalHappiness)
- [CI] Jail flaky test: TestRayServiceInPlaceUpdate (#2638, @kevin85421)
- [RayService] follow up for #2598 (#2636, @kevin85421)
- Update swagger-initializer.js (#2543, @metasyn)
- [Feature] Add an e2e test for Autoscaler to scale up by manually updating
minReplicas
(#2634, @LeoLiao123) - [Feat]: Add a field to configure whether to add a proxy actor on the head Pod to the K8s serve service or not (#2598, @CheyuWu)
- [kubectl-plugin] Add kubectl ray create cluster (#2607, @chiayi)
- [Feature] Add ManagedBy field to RayCluster (#2597, @mszadkow)
- [Cleanup] Align RayJob's ManagedBy with RayCluster's ManagedBy. (#2630, @mszadkow)
- [Feature] Add e2e tests for Autoscaler V2 (#2588, @simotw)
- fix: update topology spread constraints for custom worker pools (#2633, @TessaIO)
- [RayCluster][Fix] leave .Status.State untouched when there is a reconcile error (#2622, @rueian)
- Convert byte slice and string without copy (#2628, @dentiny)
- [Chore][CI] Upgrade ray version to 2.40 except for TestRayServiceInPlaceUpdate (#2629, @MortalHappiness)
- Fix/make helm and kustomize consistent (#2624, @fscnick)
- Add a util function to convert string and bytes array (#2621, @dentiny)
- [kubectl-plugin] Add e2e test for kubectl ray job submit (#2614, @chiayi)
- [Feature] Add ManagedBy field to RayJob (#2589, @mszadkow)
- [Bug] TestRayServiceInPlaceUpdate is flaky (#2620, @kevin85421)
- [Test] Add
ray-cluster.fluentbit.yaml
to sample YAML tests (#2611, @win5923) - Add test for autoscaler and its desired state (#2601, @dentiny)
- [Feature][kubectl-plugin] e2e test for 'kubectl ray log' (#2486, @chiayi)
- [Doc] Fix RayCluster auth sample to include --config-file in kube-rbac-proxy (#2604, @andrewsykim)
- [RayCluster][Feature] Make RayClusterStatusConditions feature gate Beta and enabled by default (#2562, @rueian)
- [Docs] Add sample yaml RayCluster with FluentBit sidecar to persist Ray logs (#2602, @win5923)
- [Doc] Remove KubeRay CLI references and add Python client details (#2521, @nadongjun)
- [RayService][Refactor] Change the ServeConfigs to nested map (#2591, @MortalHappiness)
- [no-op] Avoid implicit package import (#2599, @dentiny)
- [Feature][kubectl-plugin] return usage error when no entrypoint input (#2503, @chiayi)
- [Docs] add sample RayCluster using kube-rbac-proxy for dashboard access control (#2578, @andrewsykim)
- feat: support default function for containerEnv on additionalWorkerGroups in ray-cluster helm chart (#2570, @TessaIO)
- [RayCluster][Fix] Add expectations of RayCluster (#2150, @Eikykun)
- [Feat] Remove RayService sample YAML Python tests (#2565, @CheyuWu)
- [Test] Implement RayService In-place update test in Golang (#2536, @CheyuWu)
- Add workerGroupSpec.idleTimeoutSeconds to v1 RayCluster CRD (#2558, @ryanaoleary)
- remove autoscaler's permission of patch pods (#2559, @KunWuLuan)
- some logs are not json format (#2535, @fscnick)
- [Feature] Disable zero downtime upgrade for a RayService using RayServiceSpec (#2468, @chiayi)
- [RayCluster] don't allow overriding ray.io/cluster label (#2555, @andrewsykim)
- [Refactor][kubectl-plugin] Change kubectl ray cluster get to get cluster (#2493, @MortalHappiness)
- [Test][HA] Test high-availability during zero-downtime upgrade (#2539, @MortalHappiness)
- [RayService][Refactor] Remove ctrlResult (#2545, @kevin85421)
- [RayService][Refactor] Avoid flooding Kubernetes events (#2546, @kevin85421)
- [API Server] Add Ray Job output - start/end time and ray cluster name (#2533, @han-steve)
- [API Server] Add security context to Ray Cluster (#2538, @han-steve)
- [Chore][precommit] Replace grep with awk in pre-commit hooks for BSD compatibility (#2541, @nadongjun)
- [Helm] add sizeLimit for emptyDir (#2532, @win5923)
- [Logging] Remove duplicate info in CR logs (#2531, @nadongjun)
- [Test][HA] Ray Autoscaler enabled + Ray Serve autoscaling-enabled (#2485, @MortalHappiness)
- [Logging] add context info for yunikorn logger (#2522, @win5923)
- [BUG] Fix Dockerfile WARN: FromAsCasing: 'as' and 'FROM' keywords' casing do not match (#2527, @win5923)
- Revert "[BUG] Fix Dockerfile Error: WARN: FromAsCasing: 'as' and 'FROM' Keywords' Casing Do Not match (#2527)" (#2529, @kevin85421)
- [BUG] Fix Dockerfile WARN: FromAsCasing: 'as' and 'FROM' keywords' casing do not match (#2527, @win5923)
- [metrics] Add ray_io_cluster to all Ray metrics (#2524, @kevin85421)
- [CI] Fix RayService CI (#2525, @kevin85421)
- [Feat] Add sample yaml for RayJob clusterSelector config (#2505, @MortalHappiness)
- [Feat][Sample-yaml] Deprecated python sample yaml test cleanup (#2507, @MortalHappiness)
- Update v6e-256 KubeRay Sample (#2466, @ryanaoleary)
- [Logging] Avoid using fmt.Sprintf inside logging functions (#2508, @vincent-626)
- Add TPU to Known Custom Accelerators for generated rayStartCommand (#2495, @ryanaoleary)
- [Test] Query dashboard to get the serve application status in head pod (#2489, @CheyuWu)
- [Refactor] Extract KubectlApplyYaml and yaml deserialization to support package (#2498, @MortalHappiness)
- [Test] Check all applications in Ray Serve are running (#2496, @win5923)
- test: add sample yaml rayjob test cases (#2487, @fscnick)
- Add topology spread constraints test for RayCluster (#2472, @YoussefEssDS)
- [Test] Check
.status.numServeEndpoints
is greater than zero (#2488, @win5923) - [Test] Check RayService can successfully create RayCluster (#2475, @win5923)
- Drop unused permission + configurable binary path (#2478, @bpineau)
- [Feature][kubectl-plugin]'ray log command' Add check and cleanup directory when no ray node exist (#2473, @chiayi)
- [REFACTOR]: refactor execute pod cmd with client-go function (#2467, @CheyuWu)
- [Fix][Helm] Fix ClusterRole for volcano if .Values.batchScheduler.name is set (#2474, @MortalHappiness)
- [scheduler] Setting both EnableBatchScheduler and BatchScheduler at the same time is not allowed (#2471, @kevin85421)
- [Refactor][Test] Don't compose Gomega in Test struct (#2470, @MortalHappiness)
- [Feature][kubectl-plugin] Add
all
andworker
node type to kubectl ray log (#2442, @chiayi) - [Feature][kubectl-plugin] Fix for setting job submission ID in
kubectl ray job submit
(#2469, @chiayi) - test: add check ray import testcase (#2459, @CheyuWu)
- [Chore][kubectl-plugin] Fix wrong homepage link in krew template file (#2461, @MortalHappiness)
- [Fix] Consistent parsing of custom accelerator resources (#2464, @mounchin)
- Upgrade kustomization files to Kustomize v5 (#2352, @oksanabaza)
- [Docs][kubectl-plugin] Add doc for install via Krew (#2458, @MortalHappiness)
- fix(apiserver): env MEMORY_* use memory ResourceField (#2438, @Abirdcfly)
- [Feature][kubectl-plugin] add KubeRay operator version query (#2443, @win5923)
- [Docs][kubectl-plugin] Add instructions for downloading from GitHub release (#2450, @MortalHappiness)
- [Test][sample-yaml] Check RayCluster create correct number of pods (#2434, @MortalHappiness)
- [RayJob] UserMode -> InteractiveMode and check rayjob.spec.jobId instead of annotation (#2446, @andrewsykim)
- Update V6e TPU Ray Samples (#2448, @ryanaoleary)
- Fix v6e TPU Scripts and RayJob CRs (#2447, @ryanaoleary)
- feat: add InvalidRayJobSpec, InvalidRayJobStatus, InvalidRayServiceSpec and InvalidRayClusterStatus events (#2441, @rueian)
- Add TPU v6e sample manifests and scripts (#2445, @ryanaoleary)
- [Fix] Return error when marshaling tolerations fails in NewComputeTemplate (#2444, @YQ-Wang)
- support extended resources for Ray pods (#2436, @YQ-Wang)
- [Fix][RayService] Raise error if spec.rayClusterConfig.headGroupSpec.headService.metadata.name is set (#2440, @MortalHappiness)
- Fall back to CPU requests if limit is not specified (#2365, @andrewsykim)
- Add dnsConfig to head, worker and additional workers (#2377, @edward2a)
- [Refactor] Support adding custom accelerator to resources in rayStartParams (#2425, @mounchin)
- [Feature][RayCluster]: introduce RayClusterSuspending and RayClusterSuspended conditions (#2403, @rueian)
- [Chore] Use Ray 2.9.0 for Apache YuniKorn example (#2427, @kevin85421)
- [Feat][kubectl-plugin] Add kubectl ray version command (#2424, @MortalHappiness)
- cleanup: remove unused initConnectionTimeout for RayClient in apiserver (#2399, @Abirdcfly)
- [Chore][YuniKorn] Add sample yaml file for Apache YuniKorn (#2412, @MortalHappiness)
- [Feat][kubectl-plugin] Include LICENSE file into kubectl plugin tar (#2422, @MortalHappiness)
- Add support for parsing neuron core resource limit and pass it as ray… (#2409, @mounchin)
- Add a variant of the ray data processing job with GCSFuse CSI driver (#2401, @saikat-royc)
- [v1.2.2] Update KUBERAY_VERSION (#2417, @kevin85421)
- [release v1.2.2] Update tags and versions (#2416, @kevin85421)
- [release] Update Ray image to 2.34.0 (#2303, @kevin85421)
- Revert "[release] Update Ray image to 2.34.0 (#2303)" (#2413, @kevin85421)
- Revert "[release] Update Ray image to 2.34.0 (#2303)" (#2413) (#2415, @kevin85421)
- [release] Update Ray image to 2.34.0 (#2303, @kevin85421)
- Revert "[release] Update Ray image to 2.34.0 (#2303)" (#2413, @kevin85421)