Highlights
- Ray Train is now in beta! If you are using Ray Train, we’d love to hear your feedback here!
- Ray Docker images for multiple CUDA versions are now provided (#19505)! You can specify a
-cuXXX
suffix to pick a specific version.ray-ml:cpu
images are now deprecated. Theray-ml
images are only built for GPU.
- Ray Datasets now supports groupby and aggregations! See the groupby API and GroupedDataset docs for usage.
- We are making continuing progress in improving Ray stability and usability on Windows. We encourage you to try it out and report feedback or issues at https://github.com/ray-project/ray/issues.
- We are launching a Ray Job Submission server + CLI & SDK clients to make it easier to submit and monitor Ray applications when you don’t want an active connection using Ray Client. This is currently in alpha, so the APIs are subject to change, but please test it out and file issues / leave feedback on GitHub & discuss.ray.io!
Ray Autoscaler
💫Enhancements:
- Graceful termination of Ray nodes prior to autoscaler scale down (#20013)
- Ray Clusters on AWS are colocated in one Availability Zone to reduce costs & latency (#19051)
Ray Client
🔨 Fixes:
- ray.put on a list of of objects now returns a single object ref (#19737)
Ray Core
🎉 New Features:
- Support remote file storage for runtime_env (#20280, #19315)
- Added ray job submission client, cli and rest api (#19567, #19657, #19765, #19845, #19851, #19843, #19860, #19995, #20094, #20164, #20170, #20192, #20204)
💫Enhancements:
- Garbage collection for runtime_env (#20009, #20072)
- Improved logging and error messages for runtime_env (#19897, #19888, #18893)
🔨 Fixes:
- Fix runtime_env hanging issues (#19823)
- Fix specifying runtime env in @ray.remote decorator with Ray Client (#19626)
- Threaded actor / core worker / named actor race condition fixes (#19751, #19598, #20178, #20126)
📖Documentation:
- New page “Handling Dependencies”
- New page “Ray Job Submission: Going from your laptop to production”
Ray Java
API Changes:
- Fully supported namespace APIs. (Check out the namespace for more information.) #19468 #19986 #20057
- Removed global named actor APIs and global placement group APIs. #20219 #20135
- Added timeout parameter for
Ray.Get()
API. #20282
Note:
- Use
Ray.getActor(name, namespace)
API to get a named actor between jobs instead ofRay.getGlobalActor(name)
. - Use
PlacementGroup.getPlacementGroup(name, namespace)
API to get a placement group between jobs instead ofPlacementGroup.getGlobalPlacementGroup(name)
.
Ray Datasets
🎉 New Features:
- Added groupby and aggregations (#19435, #19673, #20010, #20035, #20044, #20074)
- Support custom write paths (#19347)
🔨 Fixes:
- Support custom CSV write options (#19378)
🏗 Architecture refactoring:
- Optimized block compaction (#19681)
Ray Workflow
🎉 New Features:
- Workflow right now support events (#19239)
- Allow user to specify metadata for workflow and steps (#19372)
- Allow in-place run a step if the resources match (#19928)
🔨 Fixes:
- Fix the s3 path issue (#20115)
RLlib
🏗 Architecture refactoring:
- “framework=tf2” + “eager_tracing=True” is now (almost) as fast as “framework=tf”. A check for tf2.x eager re-traces has been added making sure re-tracing does not happen outside the initial function calls. All CI learning tests (CartPole, Pendulum, FrozenLake) are now also run as framework=tf2. (#19273, #19981, #20109)
- Prepare deprecation of
build_trainer
/build_(tf_)?policy
utility functions. Instead, use sub-classing ofTrainer
orTorch|TFPolicy
. POCs done forPGTrainer
,PPO[TF|Torch]Policy
. (#20055, #20061) - V-trace (APPO & IMPALA): Don’t drop last ts can be optionally switch on. The default is still to drop it, but this may be changed in a future release. (#19601)
- Upgrade to gym 0.21. (#19535)
🔨 Fixes:
- Minor bugs/issues fixes and enhancements: #19069, #19276, #19306, #19408, #19544, #19623, #19627, #19652, #19693, #19805, #19807, #19809, #19881, #19934, #19945, #20095, #20128, #20134, #20144, #20217, #20283, #20366, #20387
📖Documentation:
- RLlib main page (“RLlib in 60sec”) overhaul. (#20215, #20248, #20225, #19932, #19982)
- Major docstring cleanups in preparation for complete overhaul of API reference pages. (#19784, #19783, #19808, #19759, #19829, #19758, #19830)
- Other documentation enhancements. (#19908, #19672, #20390)
Tune
💫Enhancements:
- Refactored and improved experiment analysis (#20197, #20181)
- Refactored cloud checkpointing API/SyncConfig (#20155, #20418, #19632, #19641, #19638, #19880, #19589, #19553, #20045, #20283)
- Remove magic results (e.g. config) before calculating trial result metrics (#19583)
- Removal of tech debt (#19773, #19960, #19472, #17654)
- Improve testing (#20016, #20031, #20263, #20210, #19730
- Various enhancements (#19496, #20211)
🔨Fixes:
- Documentation fixes (#20130, #19791)
- Tutorial fixes (#20065, #19999)
- Drop 0 value keys from PGF (#20279)
- Fix shim error message for scheduler (#19642)
- Avoid looping through _live_trials twice in _get_next_trial. (#19596)
- clean up legacy branch in update_avail_resources. (#20071)
- fix Train/Tune integration on Client (#20351)
Train
Ray Train is now in Beta! The beta version includes various usability improvements for distributed PyTorch training and checkpoint management, support for Ray Client, and an integration with Ray Datasets for distributed data ingest.
Check out the docs here, and the migration guide from Ray SGD to Ray Train here. If you are using Ray Train, we’d love to hear your feedback here!
🎉 New Features:
- New
train.torch.prepare_model(...)
andtrain.torch.prepare_data_loader(...)
API to automatically handle preparing your PyTorch model and DataLoader for distributed training (#20254). - Checkpoint management and support for custom checkpoint strategies (#19111).
- Easily configure what and how many checkpoints to save to disk.
- Support for Ray Client (#20123, #20351).
💫Enhancements:
- Simplify workflow for training with a single worker (#19814).
- Ray Placement Groups are used for scheduling the training workers (#20091).
PACK
strategy is used by default but can be changed by setting theTRAIN_ENABLE_WORKER_SPREAD
environment variable.- Automatically unwrap Torch DDP model and convert to CPU when saving a model as checkpoint (#20333).
🔨Fixes:
📖Documentation:
Serve
We would love to hear from you! Fill out the Ray Serve survey here.
🎉 New Features:
- New
checkpoint_path
configuration allows Serve to save its internal state to external storage (disk, S3, and GCS) and recover upon failure. (#19166, #19998, #20104) - Replica autoscaling is ready for testing out! (#19559, #19520)
- Native Pipeline API for model composition is ready for testing as well!
🔨Fixes:
- Serve deployment functions or classes can take no parameters (#19708)
- Replica slow start message is improved. You can now see whether it is slow to allocate resources or slow to run constructor. (#19431)
pip install ray[serve]
will now installray[default]
as well. (#19570)
🏗 Architecture refactoring:
- The terminology of “backend” and “endpoint” are officially deprecated in favor of “deployment”. (#20229, #20085, #20040, #20020, #19997, #19947, #19923, #19798).
- Progress towards Java API compatibility (#19463).
Dashboard
- Ray Dashboard is now enabled on Windows! (#19575)
Thanks
Many thanks to all those who contributed to this release!
@krfricke, @stefanbschneider, @ericl, @nikitavemuri, @qicosmos, @worldveil, @triciasfu, @AmeerHajAli, @javi-redondo, @architkulkarni, @pdames, @clay4444, @mGalarnyk, @liuyang-my, @matthewdeng, @suquark, @rkooo567, @mwtian, @chenk008, @dependabot[bot], @iycheng, @jiaodong, @scv119, @oscarknagg, @Rohan138, @stephanie-wang, @Zyiqin-Miranda, @ijrsvt, @roireshef, @tkaymak, @simon-mo, @ashione, @jovany-wang, @zenoengine, @tgaddair, @11rohans, @amogkam, @zhisbug, @lchu-ibm, @shrekris-anyscale, @pcmoritz, @yiranwang52, @mattip, @sven1977, @Yard1, @DmitriGekhtman, @ckw017, @WangTaoTheTonic, @wuisawesome, @kcpevey, @kfstorm, @rhamnett, @renos, @TeoZosa, @SongGuyang, @clarkzinzow, @avnishn, @iasoon, @gjoliver, @jjyao, @xwjiang2010, @dmatrix, @edoakes, @czgdp1807, @heng2j, @sungho-joo, @lixin-wei