Highlights
The KubeRay 0.4.0 release includes the following improvements.
- Integrations for the MCAD and Volcano batch scheduling systems.
- Stable Helm support for the KubeRay Operator, KubeRay API Server, and Ray clusters. These charts are now hosted at a Helm repo.
- Critical stability improvements to the Ray Autoscaler integration. (To benefit from these improvements, use KubeRay >=0.4.0 and Ray >=2.2.0.)
- Numerous improvements to CI, tests, and developer workflows; a new configuration test framework.
- Numerous improvements to documentation.
- Bug fixes for alpha features, such as RayJobs and RayServices.
- Various improvements and bug fixes for the core RayCluster controller.
Contributors
The following individuals contributed to KubeRay 0.4.0. This list is alphabetical and incomplete.
@AlessandroPomponio @architkulkarni @Basasuya @DmitriGekhtman @IceKhan13 @asm582 @davidxia @dhaval0108 @haoxins @iycheng @jasoonn @Jeffwan @jianyuan @kaushik143 @kevin85421 @lizzzcai @orcahmlee @pcmoritz @peterghaddad @rafvasq @scarlet25151 @shrekris-anyscale @sigmundv @sihanwang41 @simon-mo @tbabej @tgaddair @ulfox @wilsonwang371 @wuisawesome
New features and integrations
- [Feature] Support Volcano for batch scheduling (#755, @tgaddair)
- kuberay int with MCAD (#598, @asm582)
Helm
These changes pertain to KubeRay's Helm charts.
- [Bug] Remove an unused field (ingress.enabled) from KubeRay operator chart (#812, @kevin85421)
- [helm] Add memory limits and resource documentation. (#789, @DmitriGekhtman)
- [Helm] Expose security context in helm chart. (#773, @DmitriGekhtman)
- [Helm] Clean up RayCluster Helm chart ahead of KubeRay 0.4.0 release (#751, @DmitriGekhtman)
- [Feature] Expose initContainer image in RayCluster chart (#674, @kevin85421)
- [Feature][Helm] Expose the autoscalerOptions (#666, @orcahmlee)
- [Feature][Helm] Align the key of minReplicas and maxReplicas (#663, @orcahmlee)
- Helm: add service type configuration to head group for ray-cluster (#614, @IceKhan13)
- Allow annotations in ray cluster helm chart (#574, @sigmundv)
- [Feature][Helm] Enable sidecar configuration in Helm chart (#604, @kevin85421)
- [bugfix][apiserver helm]: Adding missing rbacenable value (#594, @dhaval0108)
- [Bug] Modification of nameOverride will cause label selector mismatch for head node (#572, @kevin85421)
- [Helm][minor] Make "disabled" flag for worker groups optional (#548, @kevin85421)
- helm: Uncomment the disabled key for the default workergroup (#543, @tbabej)
- Fix Helm chart default configuration (#530, @kevin85421)
- helm-chart/ray-cluster: Allow setting pod lifecycle (#494, @ulfox)
CI
The changes in this section pertain to KubeRay CI, testing, and developer workflows.
- [Feature] Improve the observability of integration tests (#775, @jasoonn)
- [CI] Pin go version in CRD consistency check (#794, @DmitriGekhtman)
- [Feature] Test sample RayService YAML to catch invalid or out of date one (#731, @jasoonn)
- Replace kubectl wait command with RayClusterAddCREvent (#705, @kevin85421)
- [Feature] Test sample RayCluster YAMLs to catch invalid or out of date ones (#678, @kevin85421)
- [Bug] Misuse of Docker API and misunderstanding of Ray HA cause test_ray_serve flaky (#650, @jasoonn)
- Configuration Test Framework Prototype (#605, @kevin85421)
- Update tests for better Mac M1 compatibility (#654, @shrekris-anyscale)
- [Bug] Update wait function in test_detached_actor (#635, @kevin85421)
- [Bug] Misuse of Docker API and misunderstanding of Ray HA cause test_detached_actor flaky (#619, @kevin85421)
- [Feature] Docker support for chart-testing (#623, @jasoonn)
- [Feature] Optimize the wait functions in E2E tests (#609, @kevin85421)
- [Feature] Running end-to-end tests on local machine (#589, @kevin85421)
- [CI]use fixed version of gofumpt (#596, @wilsonwang371)
- update test files before separating them (#591, @wilsonwang371)
- Add reminders to avoid RBAC synchronization bug (#576, @kevin85421)
- [Feature] Consistency check for RBAC (#577, @kevin85421)
- [Feature] Sync for manifests and helm chart (#564, @kevin85421)
- [Feature] Add a chart-test script to enable chart lint error reproduction on laptop (#563, @kevin85421)
- [Feature] Add helm lint check in Github Actions (#554, @kevin85421)
- [Feature] Add consistency check for types.go, CRDs, and generated API in GitHub Actions (#546, @kevin85421)
- support ray 2.0.0 in compatibility test (#508, @wilsonwang371)
KubeRay Operator deployment
The changes in this section pertain to deployment of the KubeRay Operator.
- Fix finalizer typo and re-create manifests (#631, @AlessandroPomponio)
- Change Kuberay operator Deployment strategy type to Recreate (#566, @haoxins)
- [Bug][Doc] Increase default operator resource requirements, improve docs (#727, @kevin85421)
- [Feature] Sync logs to local file (#632, @Basasuya)
- [Bug] label rayNodeType is useless (#698, @kevin85421)
- Revise sample configs, increase memory requests, update Ray versions (#761, @DmitriGekhtman)
RayCluster controller
The changes in this section pertain to the RayCluster controller sub-component of the KubeRay Operator.
- [autoscaler] Expose autoscaler container security context. (#752, @DmitriGekhtman)
- refactor: log more descriptive info from initContainer (#526, @davidxia)
- [Bug] Fail to create ingress due to the deprecation of the ingress.class annotation (#646, @kevin85421)
- [kuberay] Fix inconsistent RBAC truncation for autoscaling clusters. (#689, @DmitriGekhtman)
- [raycluster controller] Always honor maxReplicas (#662, @DmitriGekhtman)
- [Autoscaler] Pass pod name to autoscaler, add pod patch permission (#740, @DmitriGekhtman)
- [Bug] Shallow copy causes different worker configurations (#714, @kevin85421)
- Fix duplicated volume issue (#690, @wilsonwang371)
- [fix][raycluster controller] No error if head ip cannot be determined. (#701, @DmitriGekhtman)
- [Feature] Set default appProtocol for Ray head service to tcp (#668, @kevin85421)
- [Telemetry] Inject env identifying KubeRay. (#562, @DmitriGekhtman)
- fix: correctly set GPUs in rayStartParams (#497, @davidxia)
- [operator] enable bashrc before container start (#427, @Basasuya)
- [Bug] Pod reconciliation fails if worker pod name is supplied (#587, @kevin85421)
Ray Jobs (alpha)
The changes pertain to the RayJob controller sub-component of the KubeRay Operator.
- [Feature] [RayJobs] Use finalizers to implement stopping a job upon cluster deletion (#735, @kevin85421)
- [ray job] support stop job after job cr is deleted in cluster selector mode (#629, @Basasuya)
- [RayJob] Fix example misconfiguration. (#602, @DmitriGekhtman)
- [operator] support clusterselector in job crd (#470, @Basasuya)
Ray Services (alpha)
The changes pertain to the RayService controller sub-component of the KubeRay Operator.
- [RayService] Skip update events without change (#811, @sihanwang41)
- [RayService] Track whether Serve app is ready before switching clusters (#730, @shrekris-anyscale)
- [RayService] Compare cached hashed config before triggering update (#655, @shrekris-anyscale)
- Disable async serve handler in Ray Service cluster. (#447, @iycheng)
- [RayService] Revert "Disable async serve handler in Ray Service cluster (#447)" (#606, @shrekris-anyscale)
- add support for rayserve in apiserver (#456, @scarlet25151)
- Fix initial health check not obeying deploymentUnhealthySecondThreshold (#540, @jianyuan)
KubeRay API Server
- [Bug][apiserver] fix apiserver create rayservice missing serve port (#734, @scarlet25151)
- Support updating RayServices using the KubeRay API Server (#633, @scarlet25151)
- [api server] enable job spec server (#416, @Basasuya)
Security
- [Bug] client_golang used by KubeRay has a vulnerability (#728, @kevin85421)
Observability
- feat: update RayCluster
.status.reason
field with pod creation error (#639, @davidxia) - feat: enrich RayCluster status with head IPs (#468, @davidxia)
- config/prometheus: add metrics exporter for workers (#469, @ulfox)
Documentation
- [docs] Updated Volcano integration documentation (#776, @tgaddair)
- [0.4.0 Release] Minor doc improvements (#780, @DmitriGekhtman)
- Update gcs-ft.md (#777, @wilsonwang371)
- [Feature] Refactor test framework & test kuberay-operator chart with configuration framework (#759, @kevin85421)
- fix docs: typo in README.md (#760, @davidxia)
- [APIServer][Docs] Identify API server as community-managed and optional (#753, @DmitriGekhtman)
- Add documentations for the release process of Helm charts (#723, @kevin85421)
- [docs] Fix markdown in ray services (#712, @lizzzcai)
- Cross-reference docs. (#703, @DmitriGekhtman)
- Adding example of manually setting up NGINX Ingress (#699, @jasoonn)
- [docs] State version requirement for kubectl (#702, @DmitriGekhtman)
- Remove ray-cluster.without-block.yaml (#675, @kevin85421)
- [doc] Add instructions about how to use SSL/TLS for redis connection. (#652, @iycheng)
- [Feature][Docs] AWS Application Load Balancer (ALB) support (#658, @kevin85421)
- [Feature][Doc] Explain that RBAC should be synchronized manually (#641, @kevin85421)
- [doc] Reformat README.md (#599, @rafvasq)
- [doc] Copy-Edit RayJob (#608, @rafvasq)
- [doc] VS Code IDE setup (#613, @kevin85421)
- [doc] Copy-Edit RayService (#607, @rafvasq)
- fix mkdocs URL (#600, @asm582)
- [doc] Add a tip on docker images (#586, @DmitriGekhtman)
- Update ray-operator documentation and image version in ray-cluster.heterogeneous.yaml (#585, @jasoonn)
- [Doc] Cannot build kuberay with Go 1.16 (#575, @kevin85421)
- docs: Add instructions for working with Argo CD (#535, @haoxins)
- Update Helm doc. (#531, @DmitriGekhtman)
- Failure happened when install operator with kubectl apply (#525, @kevin85421)
- fix examples: bad K8s log config causing logs to be lost (#501, @davidxia)
- Helm instructions: kubectl apply -> kubectl create (#505, @DmitriGekhtman)
- apiserver add new api docs (#498, @scarlet25151)