Release Highlights
- This release contains fixes for the Ray Dashboard. Additional context can be found here: https://www.anyscale.com/blog/update-on-ray-cves-cve-2023-6019-cve-2023-6020-cve-2023-6021-cve-2023-48022-cve-2023-48023
- Ray Train has now upgraded support for spot node preemption -- allowing Ray Train to handle preemption node failures differently than application errors.
- Ray is now compatible with Pydantic versions <2.0.0 and >=2.5.0, addressing a piece of user feedback we’ve consistently received.
- The Ray Dashboard now has a page for Ray Data to monitor real-time execution metrics.
- Streaming generator is now officially a public API (#41436, #38784). Streaming generator allows writing streaming applications easily on top of Ray via Python generator API and has been used for Ray Serve and Ray data for several releases. See the documentation for details.
- We’ve added experimental support for new accelerators: Intel GPU (#38553), Intel Gaudi Accelerators (#40561), and Huawei Ascend NPU (#41256).
Ray Libraries
Ray Data
🎉 New Features:
- Add the dashboard for Ray Data to monitor real-time execution metrics and log file for debugging (https://docs.ray.io/en/master/data/monitoring-your-workload.html).
- Introduce
concurrency
argument to replaceComputeStrategy
in map-like APIs (#41461) - Allow task failures during execution (#41226)
- Support PyArrow 14.0.1 (#41036)
- Add new API for reading and writing Datasource (#40296)
- Enable group-by over multiple keys in datasets (#37832)
- Add support for multiple group keys in
map_groups
(#40778)
💫 Enhancements:
- Optimize
OpState.outqueue_num_blocks
(#41748) - Improve stall detection for
StreamingOutputsBackpressurePolicy
(#41637) - Enable read-only Datasets to be executed on new execution backend (#41466, #41597)
- Inherit block size from downstream ops (#41019)
- Use runtime object memory for scheduling (#41383)
- Add retries to file writes (#41263)
- Make range datasource streaming (#41302)
- Test core performance metrics (#40757)
- Allow
ConcurrencyCapBackpressurePolicy._cap_multiplier
to be set to 1.0 (#41222) - Create
StatsManager
to manage_StatsActor
remote calls (#40913) - Expose
max_retry_cnt
parameter forBigQuery
Write (#41163) - Add rows outputted to data metrics (#40280)
- Add fault tolerance to remote tasks (#41084)
- Add operator-level dropdown to ray data overview (#40981)
- Avoid slicing too-small blocks (#40840)
- Ray Data jobs detail table (#40756)
- Update default shuffle block size to 1GB (#40839)
- Log progress bar to data logs (#40814)
- Operator level metrics (#40805)
🔨 Fixes:
- Partial fix for
Dataset.context
not being sealed after creation (#41569) - Fix the issue that
DataContext
is not propagated when usingstreaming_split
(#41473) - Fix Parquet partition filter bug (#40947)
- Fix split read output blocks (#41070)
- Fix
BigQueryDatasource
fault tolerance bugs (#40986)
📖 Documentation:
- Add example of how to read and write custom file types (#41785)
- Fix
ray.data.read_databricks_tables
doc (#41366) - Add
read_json
docs example for setting PyArrow block size when reading large files (#40533) - Add
AllToAllAPI
to dataset methods (#40842)
Ray Train
🎉 New Features:
- Support reading
Result
from cloud storage (#40622)
💫 Enhancements:
- Sort local Train workers by GPU ID (#40953)
- Improve logging for Train worker scheduling information (#40536)
- Load the latest unflattened metrics with
Result.from_path
(#40684) - Skip incrementing failure counter on preemption node died failures (#41285)
- Update TensorFlow
ReportCheckpointCallback
to delete temporary directory (#41033)
🔨 Fixes:
- Update config dataclass repr to check against None (#40851)
- Add a barrier in Lightning
RayTrainReportCallback
to ensure synchronous reporting. (#40875) - Restore Tuner and
Result
s properly from moved storage path (#40647)
📖 Documentation:
- Improve torch, lightning quickstarts and migration guides + fix torch restoration example (#41843)
- Clarify error message when trying to use local storage for multi-node distributed training and checkpointing (#41844)
- Copy edits and adding links to docstrings (#39617)
- Fix the missing ray module import in PyTorch Guide (#41300)
- Fix typo in lightning_mnist_example.ipynb (#40577)
- Fix typo in deepspeed.rst (#40320)
🏗 Architecture refactoring:
- Remove Legacy Trainers (#41276)
Ray Tune
🎉 New Features:
- Support reading
Result
from cloud storage (#40622)
💫 Enhancements:
- Skip incrementing failure counter on preemption node died failures (#41285)
🔨 Fixes:
- Restore Tuner and
Result
s properly from moved storage path (#40647)
📖 Documentation:
- Remove low value Tune examples and references to them (#41348)
- Clarify when to use
MLflowLoggerCallback
andsetup_mlflow
(#37854)
🏗 Architecture refactoring:
- Delete legacy
TuneClient
/TuneServer
APIs (#41469) - Delete legacy
Searcher
s (#41414) - Delete legacy persistence utilities (
air.remote_storage
, etc.) (#40207)
Ray Serve
🎉 New Features:
- Introduce logging config so that users can set different logging parameters for different applications & deployments.
- Added gRPC context object into gRPC deployments for user to set custom code and details back to the client.
- Introduce a runtime environment feature that allows running applications in different containers with different images. This feature is experimental and a new guide can be found in the Serve docs.
💫 Enhancements:
- Explicitly handle gRPC proxy task cancellation when the client dropped a request to not waste compute resources.
- Enable async
__del__
in the deployment to execute custom clean up steps. - Make Ray Serve compatible with Pydantic versions <2.0.0 and >=2.5.0.
🔨 Fixes:
- Fixed gRPC proxy streaming request latency metrics to include the entire lifecycle of the request, including the time to consume the generator.
- Fixed gRPC proxy timeout request status from CANCELLED to DEADLINE_EXCEEDED.
- Fixed previously Serve shutdown spamming log files with logs for each event loop to only log once on shutdown.
- Fixed issue during batch requests when a request is dropped, the batch loop will be killed and not processed any future requests.
- Updating replica log filenames to only include POSIX-compliant characters (removed the “#” character).
- Replicas will now be gracefully shut down after being marked unhealthy due to health check failures instead of being force killed.
- This behavior can be toggled using the environment variable RAY_SERVE_FORCE_STOP_UNHEALTHY_REPLICAS=1, but this is planned to be removed in the near future. If you rely on this behavior, please file an issue on github.
RLlib
🎉 New Features:
- New API stack (in progress):
- New
MultiAgentEpisode
class introduced. Basis for upcoming multi-agent EnvRunner, which will replace RolloutWorker APIs. (#40263, #40799) - PPO runs with new
SingleAgentEnvRunner
(w/o Policy/RolloutWorker APIs). CI learning tests added. (#39732, #41074, #41075) - By default: PPO reverted to use old API stack by default, for now. Pending feature-completion of new API stack (incl. multi-agent, RNN support, new EnvRunners, etc..). (#40706)
- New
- Old API stack:
💫 Enhancements:
🔨 Fixes:
- Restoring from a checkpoint from an older wheel (where
AlgorithmConfig.rl_module_spec
was NOT a “@Property” yet) breaks when trying to load from this checkpoint. (#41157) - SampleBatch slicing crashes when using tf + SEQ_LENS + zero-padding. (#40905)
- Other fixes: #39978, #40788, #41168, #41204
📖 Documentation:
- Updated codeblocks in RLlib. (#37271)
Ray Core and Ray Clusters
Ray Core
🎉 New Features:
- Streaming generator is now officially a public API (#41436, #38784). Streaming generator allows writing streaming applications easily on top of Ray via Python generator API and has been used for Ray serve and Ray data for several releases. See the documentation for details.
- As part of the change, num_returns=”dynamic” is planning to be deprecated, and its return type is changed from
ObjectRefGenerator
-> “DynamicObjectRefGenerator”
- As part of the change, num_returns=”dynamic” is planning to be deprecated, and its return type is changed from
- Add experimental accelerator support for new hardwares.
- Add the initial support to run MPI based code on top of Ray.(#40917, #41349)
💫 Enhancements:
- Optimize next/anext performance for streaming generator (#41270)
- Make the number of connections and thread number of the object manager client tunable. (#41421)
- Add
__ray_call__
default actor method (#41534)
🔨 Fixes:
- Fix NullPointerException cause by raylet id is empty when get actor info in java worker (#40560)
- Fix a bug where SIGTERM is ignored to worker processes (#40210)
- Fix mmap file leak. (#40370)
- Fix the lifetime issue in Plasma server client releasing object. (#40809)
- Upgrade grpc from 1.50.2 to 1.57.1 to include security fixes (#39090)
- Fix the bug where two head nodes are shown from ray list nodes (#40838)
- Fix the crash when the GCS address is not valid. (#41253)
- Fix the issue of unexpectedly high socket usage in ray core worker processes. (#41121)
- Make worker_process_setup_hook work with strings instead of Python functions (#41479)
Ray Clusters
💫 Enhancements:
- Stability improvements for the vSphere cluster launcher
- Better CLI output for cluster launcher
🔨 Fixes:
- Fixed
run_init
for TPU command runner
📖Documentation:
- Added missing steps and simplified YAML in top-level clusters quickstart
- Clarify that job entrypoints run on the head node by default and how to override it
Dashboard
💫 Enhancements:
- Improvements to the Ray Data Dashboard
- Added Ray Data-specific overview on jobs page, including a table view with Dataset-level metrics
- Added operator-level metrics granularity to drill down on Dataset operators
- Added additional metrics for monitoring iteration over Datasets
Docs
🎉 New Features:
- Updated to Sphinx version 7.1.2. Previously, the docs build used Sphinx 4.3.2. Upgrading to a recent version provides a more modern user experience while fixing many long standing issues. Let us know how you like the upgrade or any other docs issues on your mind, on the Ray Slack #docs channel.
Thanks
Many thanks to all those who contributed to this release!
@justinvyu, @zcin, @avnishn, @jonathan-anyscale, @shrekris-anyscale, @LeonLuttenberger, @c21, @JingChen23, @liuyang-my, @ahmed-mahran, @huchen2021, @raulchen, @scottjlee, @jiwq, @z4y1b2, @jjyao, @JoshTanke, @marxav, @ArturNiederfahrenhorst, @SongGuyang, @jerome-habana, @rickyyx, @rynewang, @batuhanfaik, @can-anyscale, @allenwang28, @wingkitlee0, @angelinalg, @peytondmurray, @rueian, @KamenShah, @stephanie-wang, @bryanjuho, @sihanwang41, @ericl, @sofianhnaide, @RaffaGonzo, @xychu, @simonsays1980, @pcmoritz, @aslonnie, @WeichenXu123, @architkulkarni, @matthew29tang, @larrylian, @iycheng, @hongchaodeng, @rudeigerc, @rkooo567, @robertnishihara, @alanwguo, @emmyscode, @kevin85421, @alexeykudinkin, @michaelhly, @ijrsvt, @ArkAung, @mattip, @harborn, @sven1977, @liuxsh9, @woshiyyya, @hahahannes, @GeneDer, @vitsai, @Zandew, @evalaiyc98, @edoakes, @matthewdeng, @bveeramani