Highlights
- Ray AI Runtime (AIR), an open-source toolkit for building end-to-end ML applications on Ray, is now in Alpha. AIR is an effort to unify the experience of using different Ray libraries (Ray Data, Train, Tune, Serve, RLlib). You can find more information on the docs or on the public RFC.
- Getting involved with Ray AIR. We’ll be holding office hours, development sprints, and other activities as we get closer to the Ray AIR Beta/GA release. Want to join us? Fill out this short form!
- Ray usage data collection is now off by default. If you have any questions or concerns, please comment on the RFC.
- New algorithms are added to RLlib: SlateQ & Bandits (for recommender systems use cases) and AlphaStar (multi-agent, multi-GPU w/ league-based self-play)
- Ray Datasets: new lazy execution model with automatic task fusion and memory-optimizing move semantics; first-class support for Pandas DataFrame blocks; efficient random access datasets.
Ray Autoscaler
🎉 New Features
💫 Enhancements
- Improved documentation and standards around built in autoscaler node providers. (#22236, 22237)
- Improved KubeRay support (#22987, #22847, #22348, #22188)
- Remove redis requirement (#22083)
🔨 Fixes
- No longer print infeasible warnings for internal placement group resources. Placement groups which cannot be satisfied by the autoscaler still trigger warnings. (#22235)
- Default ami’s per AWS region are updated/fixed. (#22506)
- GCP node termination updated (#23101)
- Retry legacy k8s operator on monitor failure (#22792)
- Cap min and max workers for manually managed on-prem clusters (#21710)
- Fix initialization artifacts (#22570)
- Ensure initial scaleup with high upscaling_speed isn't limited. (#21953)
Ray Client
🎉 New Features:
- ray.init has consistent return value in client mode and driver mode #21355
💫Enhancements:
🔨 Fixes:
- Fix ray client object ref releasing in wrong context #22025
Ray Core
🎉 New Features
- RuntimeEnv:
- Support setting timeout for runtime_env setup. (#23082)
- Support setting pip_check and pip_version for runtime_env. (#22826, #23306)
- env_vars will take effect when the pip install command is executed. (temporarily ineffective in conda) (#22730)
- Support strongly-typed API ray.runtime.RuntimeEnv to define runtime env. (#22522)
- Introduce virtualenv to isolate the pip type runtime env. (#21801,#22309)
- Raylet shares fate with the dashboard agent. And the dashboard agent will stay alive when it catches the port conflicts. (#22382,#23024)
- Enable dashboard in the minimal ray installation (#21896)
- Add task and object reconstruction status to ray memory cli tools(#22317)
🔨 Fixes
- Report only memory usage of pinned object copies to improve scaledown. (#22020)
- Scheduler:
- Object store:
- Improve ray stop behavior (#22159)
- Avoid warning when receiving too much logs from a different job (#22102)
- Gcs resource manager bug fix and clean up. (#22462, #22459)
- Release GIL when running
parallel_memcopy()
/memcpy()
during serializations. (#22492) - Fix registering serializer before initializing Ray. (#23031)
🏗 Architecture refactoring
- Ray distributed scheduler refactoring: (#21927, #21992, #22160, #22359, #22722, #22817, #22880, #22893, #22885, #22597, #22857, #23124)
- Removed support for bootstrapping with Redis.
Ray Data Processing
🎉 New Features
- Big Performance and Stability Improvements:
- Add lazy execution mode with automatic stage fusion and optimized memory reclamation via block move semantics (#22233, #22374, #22373, #22476)
- Support for random access datasets, providing efficient random access to rows via binary search (#22749)
- Add automatic round-robin load balancing for reading and shuffle reduce tasks, obviating the need for the
_spread_resource_prefix
hack (#21303)
- More Efficient Tabular Data Wrangling:
- Groupby + Aggregations Improvements:
- Improved Dataset Windowing:
- Better Text I/O:
- New Operations:
- Add
add_column()
utility for adding derived columns (#21967)
- Add
- Support for metadata provider callback for read APIs (#22896)
- Support configuring autoscaling actor pool size (#22574)
🔨 Fixes
- Force lazy datasource materialization in order to respect
DatasetPipeline
stage boundaries (#21970) - Simplify lifetime of designated block owner actor, and don’t create it if dynamic block splitting is disabled (#22007)
- Respect 0 CPU resource request when using manual resource-based load balancing (#22017)
- Remove batch format ambiguity by always converting Arrow batches to Pandas when
batch_format=”native”
is given (#21566) - Fix leaked stats actor handle due to closure capture reference counting bug (#22156)
- Fix boolean tensor column representation and slicing (#22323)
- Fix unhandled empty block edge case in shuffle (#22367)
- Fix unserializable Arrow Partitioning spec (#22477)
- Fix incorrect
iter_epochs()
batch format (#22550) - Fix infinite
iter_epochs()
loop on unconsumed epochs (#22572) - Fix infinite hang on
split()
whennum_shards < num_rows
(#22559) - Patch Parquet file fragment serialization to prevent metadata fetching (#22665)
- Don’t reuse task workers for actors or GPU tasks (#22482)
- Pin pipeline executor actors to driver node to allow for lineage-based fault tolerance for pipelines (#22715)
- Always use non-empty blocks to determine schema (#22834)
- API fix bash (#22886)
- Make label_column optional for
to_tf()
so it can be used for inference (#22916) - Fix
schema()
forDatasetPipeline
s (#23032) - Fix equalized split when
num_splits == num_blocks
(#23191)
💫 Enhancements
- Optimize Parquet metadata serialization via batching (#21963)
- Optimize metadata read/write for Ray Client (#21939)
- Add sanity checks for memory utilization (#22642)
🏗 Architecture refactoring
- Use threadpool to submit
DatasetPipeline
stages (#22912)
RLlib
🎉 New Features
- New “AlphaStar” algorithm: A parallelized, multi-agent/multi-GPU learning algorithm, implementing league-based self-play. (#21356, #21649)
- SlateQ algorithm has been re-tested, upgraded (multi-GPU capable, TensorFlow version), and bug-fixed (added to weekly learning tests). (#22389, #23276, #22544, #22543, #23168, #21827, #22738)
- Bandit algorithms: Moved into
agents
folder as first-class citizens, TensorFlow-Version, unified w/ other agents’ APIs. (#22821, #22028, #22427, #22465, #21949, #21773, #21932, #22421) - ReplayBuffer API (in progress): Allow users to customize and configure their own replay buffers and use these inside custom or built-in algorithms. (#22114, #22390, #21808)
- Datasets support for RLlib: Dataset Reader/Writer and documentation. (#21808, #22239, #21948)
🔨 Fixes
- Fixed memory leak in SimpleReplayBuffer. (#22678)
- Fixed Unity3D built-in examples: Action bounds from -inf/inf to -1.0/1.0. (#22247)
- Various bug fixes. (#22350, #22245, #22171, #21697, #21855, #22076, #22590, #22587, #22657, #22428, #23063, #22619, #22731, #22534, #22074, #22078, #22641, #22684, #22398, #21685)
🏗 Architecture refactoring
- A3C: Moved into new
training_iteration
API (fromexeution_plan
API). Lead to a ~2.7x performance increase on a Atari + CNN + LSTM benchmark. (#22126, #22316) - Make
multiagent->policies_to_train
more flexible via callable option (alternative to providing a list of policy IDs). (#20735)
💫Enhancements:
- Env pre-checking module now active by default. (#22191)
- Callbacks: Added
on_sub_environment_created
andon_trainer_init
callback options. (#21893, #22493) - RecSim environment wrappers: Ability to use google’s RecSim for recommender systems more easily w/ RLlib algorithms (3 RLlib-ready example environments). (#22028, #21773, #22211)
- MARWIL loss function enhancement (exploratory term for stddev). (#21493)
📖Documentation:
- Docs enhancements: Setup-dev instructions; Ray datasets integration. (#22239)
- Other doc enhancements and fixes. (#23160, #23226, #22496, #22489, #22380)
Ray Workflow
🎉 New Features:
- Support skip checkpointing.
🔨 Fixes:
- Fix an issue where the event loop is not set.
Tune
🎉 New Features:
- Expose new checkpoint interface to users (#22741)
💫Enhancements:
- Better error msg for grpc resource exhausted error. (#22806)
- Add CV support for XGB/LGBM Tune callbacks (#22882)
- Make sure tune.run can run inside worker thread (b8c28d1f2beb7a141f80a5fd6053c8e8520718b9)[#22566%5D(https://github.com/ray-project/ray/pull/22566)%5B)%5D(https://github.com/ray-project/ray/commit/b8c28d1f2beb7a141f80a5fd6053c8e8520718b9)
- Add Trainable.postprocess_checkpoint (#22973)
Trainables will now know TUNE_ORIG_WORKING_DIR (#22803) - Retry cloud sync up/down/delete on fail (#22029)
- Support functools.partial names and treat as function in registry (#21518)
🔨Fixes:
- Cleanup incorrectly formatted strings (Part 2: Tune) (#23129)
- fix error handling for fail_fast case. (#22982)
- Remove Trainable.update_resources (#22471)
- Bump flaml from 0.6.7 to 0.9.7 in /python/requirements/ml (#22071)
- Fix analysis without registered trainable (#21475)
- Update Lightning examples to support PTL 1.5 (#20562)
- Fix WandbTrainableMixin config for rllib trainables (#22063)
- [wandb] Use resume=False per default (#21892)
🏗 Refactoring:
- Move resource updater out of trial executor (#23178)
- Preparation for deadline schedulers (#22006)
- Single wait refactor. (#21852)
📖Documentation:
- Tune docs overhaul (first part) (#22112)
- Tune overhaul part II (#22656)
- Note TPESampler performance issues in docs (#22545)
- hyperopt notebook (#22315)
Train
🎉 New Features
- Integration with PyTorch profiler. Easily enable the pytorch profiler with Ray Train to profile training and visualize stats in Tensorboard (#22345).
- Automatic pipelining of host to device transfer. While training is happening on one batch of data, the next batch of data is concurrently being moved from CPU to GPU (#22716, #22974)
- Automatic Mixed Precision. Easily enable PyTorch automatic mixed precision during training (#22227).
💫 Enhancements
- Add utility function to enable reproducibility for Pytorch training (#22851)
- Add initial support for metrics aggregation (#22099)
- Add support for
trainer.best_checkpoint
andTrainer.load_checkpoint_path
. You can now directly access the best in memory checkpoint, or load an arbitrary checkpoint path to memory. (#22306)
🔨 Fixes
- Add a utility function to turn off TF autosharding (#21887)
- Fix fault tolerance for Tensorflow training (#22508)
- Train utility methods (
train.report()
, etc.) can now be called outside of a Train session (#21969) - Fix accuracy calculation for CIFAR example (#22292)
- Better error message for placement group time out (#22845)
📖 Documentation
- Update docs for ray.train.torch import (#22555)
- Clarify shuffle documentation in
prepare_data_loader
(#22876) - Denote
train.torch.get_device
as a Public API (#22024) - Minor fixes on Ray Train user guide doc (#22379)
Serve
🎉 New Features
- Deployment Graph API is now in alpha. It provides a way to build, test and deploy complex inference graph composed of many deployments. (#23177, #23252, #23301, #22840, #22710, #22878, #23208, #23290, #23256, #23324, #23289, #23285, #22473, #23125, #23210)
- New experimental REST API and CLI for creating and managing deployments. (
#22839, #22257, #23198, #23027, #22039, #22547, #22578, #22611, #22648, #22714, #22805, #22760, #22917, #23059, #23195, #23265, #23157, #22706, #23017, #23026, #23215) - New sets of HTTP adapters making it easy to build simple application, as well as Ray AI Runtime model wrappers in alpha. (#22913, #22914, #22915, #22995)
- New
health_check
API for end to end user provided health check. (#22178, #22121, #22297)
🔨 Fixes
- Autoscaling algorithm will now relingquish most idle nodes when scaling down (#22669)
- Serve can now manage Java replicas (#22628)
- Added a hands-on self-contained MLflow and Ray Serve deployment example (#22192)
- Added
root_path
setting tohttp_options
(#21090) - Remove
shard_key
,http_method
, andhttp_headers
inServeHandle
(#21590)
Dashboard
🔨Fixes:
- Update CPU and memory reporting in kubernetes. (#21688)
Thanks
Many thanks to all those who contributed to this release!
@edoakes, @pcmoritz, @jiaodong, @iycheng, @krfricke, @smorad, @kfstorm, @jjyyxx, @rodrigodelazcano, @scv119, @dmatrix, @avnishn, @fyrestone, @clarkzinzow, @wumuzi520, @gramhagen, @XuehaiPan, @iasoon, @birgerbr, @n30111, @tbabej, @Zyiqin-Miranda, @suquark, @pdames, @tupui, @ArturNiederfahrenhorst, @ashione, @ckw017, @siddgoel, @Catch-Bull, @vicyap, @spolcyn, @stephanie-wang, @mopga, @Chong-Li, @jjyao, @raulchen, @sven1977, @nikitavemuri, @jbedorf, @mattip, @bveeramani, @czgdp1807, @dependabot[bot], @Fabien-Couthouis, @willfrey, @mwtian, @SlowShip, @Yard1, @WangTaoTheTonic, @Wendi-anyscale, @kaushikb11, @kennethlien, @acxz, @DmitriGekhtman, @matthewdeng, @mraheja, @orcahmlee, @richardliaw, @dsctt, @yupbank, @Jeffwan, @gjoliver, @jovany-wang, @clay4444, @shrekris-anyscale, @jwyyy, @kyle-chen-uber, @simon-mo, @ericl, @amogkam, @jianoaix, @rkooo567, @maxpumperla, @architkulkarni, @chenk008, @xwjiang2010, @robertnishihara, @qicosmos, @sriram-anyscale, @SongGuyang, @jon-chuang, @wuisawesome, @valiantljk, @simonsays1980, @ijrsvt