Highlights
- RayCluster CRD status observability improvement: design doc
- Support retry in RayJob: #2192
- Coding style improvement
RayCluster
- [RayCluster][Fix] evicted head-pod can be recreated or restarted (#2217, @JasonChen86899)
- [Test][RayCluster] Add tests for RestartPolicyOnFailure for eviction (#2302, @MortalHappiness)
- kuberay autoscaler pod use same command and args as ray head container (#2268, @cswangzheng)
- Updated default timeout seconds for probes (#2265, @HarshAgarwal11)
- Buildkite autoscaler e2e (#2199, @rueian)
- [Test][Autoscaler][2/n] Add Ray Autoscaler e2e tests for GPU workers (#2181, @rueian)
- [Test][Autoscaler][1/n] Add Ray Autoscaler e2e tests (#2168, @kevin85421)
- [Bug] Fix RayCluster with an overridden app.kubernetes.io/name (#2147) (#2166, @rueian)
- [Feat][RayCluster] Make the Head service headless (#2117, @rueian)
- [Refactor][RayCluster] Make ray.io/group=headgroup be constant (#1970, @rueian)
- [Feature][autoscaler v2] Set RAY_NODE_TYPE_NAME when starting ray node (#1973, @kevin85421)
- feat: add
RayCluster.status.readyWorkerReplicas
(#1930, @davidxia) - [Chore][Samples] Rename ray-cluster.mini.yaml and add workerGroupSpecs (#2100, @MortalHappiness)
- [Chore] Delete redundant pod existance checking (#2113, @MortalHappiness)
- [Autoscaler V2] Polish Autoscaler V2 YAML (#2064, @kevin85421)
- [Refactor] Use RayClusterHeadPodsAssociationOptions to replace MatchingLabels (#2056, @evalaiyc98)
- [Sample][autoscaler v2] Add sample yaml for autosclaer v2 (#1974, @rickyyx)
- Allow configuration of restartPolicy (#2197, @c0dearm)
- [Chore][Log] Delete error loggings right before returned errors (#2103, @MortalHappiness)
- [Refactor] Follow-up for PR 1930 (#2124, @MortalHappiness)
- [Test] Move StateTransitionTimes envtest to a better place (#2111, @kevin85421)
- support using proxy subresources when connecting to Ray head node (#1980, @andrewsykim)
- [Bug] All worker Pods are deleted if using KubeRay v1.0.0 CRD with KubeRay operator v1.1.0 image (#2087, @kevin85421)
- [Bug] KubeRay operator failed to watch endpoint (#2080, @kevin85421)
- [Refactor] Remove
cleanupInvalidVolumeMounts
(#2104, @kevin85421) - support using proxy subresources when connecting to Ray head node (#1980, @andrewsykim)
- [Chore] Run operator outside the cluster (#2090, @MortalHappiness)
- [Feat] Deprecate ForcedClusterUpgrade (#2075, @MortalHappiness)
- [Bug] Ray operator crashes when specifying RayCluster with resources.limits but no resources.requests (#2077, @kevin85421)
RayCluster CRD status improvement
- RayClusterProvisioned status should be set while cluster is being provisioned for the first time (#2304, @andrewsykim)
- Add RayClusterProvisioned Condition Type (#2301, @Yicheng-Lu-llll)
- [Test][RayCluster] Add envtests for RayCluster conditions (#2283, @MortalHappiness)
- [Fix][RayCluster] Make the RayClusterReplicaFailureReason to capture the correct reason (#2282, @rueian)
- Add RayClusterReady Condition Type (#2271, @Yicheng-Lu-llll)
- [Feature][RayCluster]: Implement the HeadReady condition (#2261, @cchen777)
- [Feature] REP 54: Add PodName to the HeadInfo (#2266, @rueian)
- [Feat][RayCluster] Use a new RayClusterReplicaFailure condition to reflect the result of reconcilePods (#2259, @rueian)
- Don’t assign the rayv1.Failed to the State field (#2258, @Yicheng-Lu-llll)
- [Refactor][RayCluster] Unify status update to single place (#2249, @MortalHappiness)
- [Feat][RayCluster] Introduce the RayClusterStatus.Conditions field (#2214, @rueian)
- [Test][Autoscaling] Add custom resource test (#2193, @MortalHappiness)
- feat: record last state transition times (#2053, @davidxia)
- [RayCluster] Add serviceName to status.headInfo (#2089, @andrewsykim)
- [RayCluster][Status][1/n] Remove ClusterState Unhealthy (#2068, @kevin85421)
Coding style improvement
- [Style] Fix golangci-lint rule: govet (#2144, @MortalHappiness)
- [Chore] Fix golangci-lint rule: gosec (#2163, @MortalHappiness)
- [Style] Fix golangci-lint rule: nolintlint (#2196, @MortalHappiness)
- [Style] Fix golangci-lint rule: unparam (#2195, @MortalHappiness)
- [Fix][CI] Fix revive error (#2183, @MortalHappiness)
- [Style] Fix golangci-lint rule: revive (#2167, @MortalHappiness)
- [Style] Fix golangci-lint rule: ginkgolinter (#2164, @MortalHappiness)
- [Style] Fix golangci-lint rule: errorlint (#2141, @MortalHappiness)
- [Chore] Use new golangci-lint rules only for ray-operator (#2152, @MortalHappiness)
- [Docs][Development] Delete linting docs (#2145, @MortalHappiness)
- [Style] Fix golangci-lint rule: unconvert (#2143, @MortalHappiness)
- [Style] Fix golangci-lint rule: noctx (#2142, @MortalHappiness)
- [Fix][precommit] Fix pre-commit golangci-lint always succeed (#2140, @MortalHappiness)
- [N/N][Chore] Add golangci-lint rules (#2128, @MortalHappiness)
- [Chore] Turn off no-commit-to-branch rule (#2139, @MortalHappiness)
- [5/N][Refactor] Run golangci-lint for all files (only autofix rules) (#2133, @MortalHappiness)
- [4/N][Chore] Turn off golangci-lint rules except ray-operator (#2138, @MortalHappiness)
- [3/N][CI] Replace lint CI with pre-commit (#2129, @MortalHappiness)
- [2/N][Refactor] Run pre-commit for all files (without golangci-lint) (#2130, @MortalHappiness)
- [1/N][Chore] Add pre-commit hooks (#2127, @MortalHappiness)
RayJob
- [RayJob] allow create verb for services/proxy, which is required for HTTPMode (#2321, @andrewsykim)
- [Fix][Sample-Yaml] Increase ray head CPU resource for pytorch minst (#2330, @MortalHappiness)
- Support Apache YuniKorn as one batch scheduler option (#2184, @yangwwei)
- [RayJob] add RayJob pass Deadline e2e-test with retry (#2241, @karta1502545)
- add feature gate mechanism to ray-operator (#2219, @andrewsykim)
- [RayJob] add Failing RayJob in HTTPMode e2e test for rayjob with retry (#2242, @tinaxfwu)
- [Feat][RayJob] Delete RayJob CR after job termination (#2225, @MortalHappiness)
- reconcile concurrency flag should apply for RayJob and RayService controllers (#2228, @andrewsykim)
- [RayJob] add Failing submitter K8s Job e2e test for rayjob with retry (#2226, @cchen777)
- [Chore] Create example Modin RayJob (#2221, @MortalHappiness)
- [RayJob] Unified checkBackoffLimitAndUpdateStatusIfNeeded codepath and add an e2e test for retry (#2215, @kevin85421)
- [RayJob] Add spec.backoffLimit for retrying RayJobs with new clusters (#2192, @andrewsykim)
- fix: update ray-job.pytorch-image-classifier.yaml (#2178, @davidxia)
- Add test for configurable k8s job backoff limit (#2134, @jjyao)
- Init dashboardClientFunc and httpProxyClientFunc by the config arg (#2092, @bugorz)
- [RayJob] Add Tests for Atomic Suspend Operation (#2050, @Yicheng-Lu-llll)
- Show cluster name in kubectl get rayjob (#2065, @jjyao)
- [RayJob] Add Cluster Name For Rayjob. (#2046, @slfan1989)
- Make k8s job backoff limit configurable for RayJob (#2091, @jjyao)
- [Refactor] Renaming RayHttpProxyClient attribute UseProxy #1980 (#2093, @Xiao75896453)
RayService
- Generate RayCluster Hash on KubeRay Version Change (#2320, @ryanaoleary)
- [release blocker][v1.2.0] Fix HA OOM issue (#2313, @kevin85421)
- add vLLM + RayService sample (#2289, @andrewsykim)
- Fix logging issue for FetchHeadServiceURL (#2216, @tinaxfwu)
- Add RayService Manifests for Stable Diffusion TPU Examples (#2198, @ryanaoleary)
- [Test][HA] RayService high-availability test without autoscaling enabled (#2176, @kevin85421)
- RayService: Omits Min and Max replicas from hash calculation (#2172, @kpitzen)
- Ray serve gke gateway ingress (#1978, @ravishtiwari)
- [RayService] Add RayService High Availability Test Doc (#1986, @Yicheng-Lu-llll)
- [Refactor][RayService] Add GetRayClusterWithRayServiceAssociationOptions (#2070, @evalaiyc98)
- [Hotfix] Increase the timeout of the ProxyActor health check (#2082, @kevin85421)
KubeRay kubectl plugin
- add kubectl-plugin directory to kuberay.code-workspace (#2291, @andrewsykim)
- Add basic e2e test for kubectl plugin (#2287, @chiayi)
- Add unit test for cluster get and add steps in workflows (#2263, @chiayi)
- Add kubectl-plugin lint to pre-commit (#2255, @chiayi)
- Add kubectl plugin with basic command (#2243, @chiayi)
- Fix for deploy error with deprecated cli (#2251, @chiayi)
- Deprecate Kuberay CLI for Ray Kubectl plugin (#2246, @chiayi)
Helm
- [Helm] Enable leader election when leaderElectionEnabled is not set (#2284, @kevin85421)
- Support disable leader election for manager go binary via Values.yaml to mitigate kuberay restarts (#2262, @aviadshimoni)
- Change the rules in
role.yaml
andmultiple_namespaces_role.yaml
to use the same template in_helpers.tpl
to ensure consistency. (#2244, @LeoLiao123) - fix: Add deletecollection for multi-namespace role to helm charts (#2231, @spencer-barton-klaviyo)
- [Chore] Use safe YAML for helm-chart-verify-rbac (#2230, @spencer-p)
- Properly set env field based on containerEnv values (#2175, @arueth)
- add priority and priorityClassName to ray-cluster template (#2171, @walterddr)
- Added Pod securityContext value to Helm charts (#2160, @arueth)
- [Fix][Helm chart] Move service.headService -> head.headService in values.yaml (#1998, @jjaniec)
- Add NumOfHosts to RayCluster helm-chart template (#1969, @ryanaoleary)
Benchmark
- [perf-tests] make the bucket name and prefix configurable for ray data image resize job (#2156, @andrewsykim)
- [perf-test] update 100 RayJob perf tests to use PyTorch trainer and Ray Data examples (#2149, @andrewsykim)
- Add RayJob training example using pytorch resnet image classifier (#2107, @andrewsykim)
- [Perf] Add a CPU-based image resizing workload using Ray Data (#2135, @kevin85421)
- [Perf] Add NUM_WORKERS and CPUS_PER_WORKER env to the mnist workload (#2126, @rueian)
- [Perf] Add a CPU-based training workload (#2116, @kevin85421)
- [Perf] Improve perf-test YAMLs and README (#2110, @kevin85421)
- add initial perf tests for 100 RayCluster and 100 RayJob (#2102, @andrewsykim)
Others
- [release v1.2.0] Update tags and versions (#2342, @kevin85421)
- [release v1.2.0-rc.1] Update tags and versions (#2341, @kevin85421)
- Bump go to 1.22.4 to fix ray-operator vulnerabilities (#2325, @ryanaoleary)
- [Telemetry][v1.2.0] Update KUBERAY_VERSION (#2309, @kevin85421)
- [release v1.2.0-rc.0] Update tags and versions (#2308, @kevin85421)
- [release] Update Ray image to 2.34.0 (#2303, @kevin85421)
- [Minor][Chore] Fix wrong comment (#2294, @MortalHappiness)
- [Fix][Envtest] Decorate container nodes with Ordered (#2285, @MortalHappiness)
- [Minor] Remove redundant variable (#2281, @MortalHappiness)
- [Bug] Issue with glibc version GLIBC_2.34 and GLIBC_2.32 not found in earlier operator tags (#2272, @kevin85421)
- Bump google.golang.org/grpc from 1.64.0 to 1.64.1 in /experimental (#2248, @dependabot[bot])
- Bump google.golang.org/grpc from 1.64.0 to 1.64.1 in /cli (#2229, @dependabot[bot])
- [core] Bump Go Dependencies (#2205, @thomasdesr)
- [Fix] Use go 1.22 on Buildkite autoscaler e2e tests (#2211, @rueian)
- fix invalid link in docs/index.md (#2179, @yf4n)
- added autoscaling support to Python APIs (#2159, @blublinsky)
- [CI] Remove unnecessary sample YAML symbolic links (#2118, @kevin85421)
- [release][v1.1.0] Improve release doc and update KubeRay API server chart's repository (#1960, @kevin85421)
- Post release 1.1.0 (#2040, @kevin85421)
- [Post v1.1.0] Run the sample YAML tests with KubeRay v1.1.0 (#2039, @kevin85421)
- [Release] Update release Makefile (#2037, @kevin85421)
- [Hotfix][CI] Pin setup-envtest dep (#2038, @kevin85421)
- [Grafana] Update Grafana dashboard (#2106, @kevin85421)
- [CI] Pin kustomize to v5.3.0 (#2067, @kevin85421)
- [Doc] Fix Doc Typos (#2060, @slfan1989)
- [Doc] Fix Yaml Typos (#2049, @slfan1989)
- Remove extranous arguments from examples (#2051, @thomasdesr)
- Support for Image pull policy (#2101, @blublinsky)
- CVE fix - Upgrade golang.org/x/net (#2081, @ChristianZaccaria)
- [Refactor] Rename raycluster_controller_fake_test.go to XXX_unit_test.go (#2074, @MortalHappiness)
- [Bug] Change image repository for
make deploy
(#2059, @kevin85421) - Include KUBERAY_VERSION in the user-agent (#2042, @andrewsykim)
- ray-operator coding style change (#2096, @LeoLiao123)