Release Highlights

The streaming backend for Ray Datasets is in Developer Preview. It is designed to enable terabyte-scale ML inference and training workloads. Please contact us if you'd like to try it out on your workload, or you can find the preview guide here: https://docs.google.com/document/d/1BXd1cGexDnqHAIVoxTnV3BV0sklO9UXqPwSdHukExhY/edit
New Information Architecture (beta): We’ve restructured the Ray dashboard to be organized around user personas and workflows instead of entities.

Ray Libraries

Ray AIR

💫Enhancements:

Add set_preprocessor method to Checkpoint (#31721)
Rename Keras callback and its parameters to be more descriptive (#31627)
Deprecate MlflowTrainableMixin in favor of setup_mlflow() function (#31295)
W&B
- Have train_loop_config logged as a config (#31901)
- Allow users to exclude config values with WandbLoggerCallback (#31624)
- Rename WandB save_checkpoints to upload_checkpoints (#31582)
- Add hook to get project/group for W&B integration (#31035, 31643)
- Use Ray actors instead of multiprocessing for WandbLoggerCallback (#30847)
- Update WandbLoggerCallback example (#31625)
Predictor
- Place predictor kwargs in object store (#30932)
- Delegate BatchPredictor stage fusion to Datasets (#31585)
- Rename DLPredictor.call_model tensor parameter to inputs (#30574)
- Add use_gpu to HuggingFacePredictor (#30945)
Checkpoints
- Various Checkpoint improvements (#30948)
- Implement lazy checkpointing for same-node case (#29824)
- Automatically strip "module." from state dict (#30705)
- Allow user to pass model to TensorflowCheckpoint.get_model (#31203)

🔨 Fixes:

Fix and improve support for HDFS remote storage. (#31940)
Use specified Preprocessor configs when using stream API. (#31725)
Support nested Chain in BatchPredictor (#31407)

📖Documentation:

Restructure API References (#32535)
API Deprecations (#31777, #31867)
Various fixes to docstrings, documentation, and examples (#30782, #30791)

🏗 Architecture refactoring:

Use NodeAffinitySchedulingPolicy for scheduling (#32016)
Internal resource management refactor (#30777, #30016)

Ray Data Processing

🎉 New Features:

Lazy execution by default (#31286)
Introduce streaming execution backend (#31579)
Introduce DatasetIterator (#31470)
Add per-epoch preprocessor (#31739)
Add TorchVisionPreprocessor (#30578)
Persist Dataset statistics automatically to log file (#30557)

💫Enhancements:

Async batch fetching for map_batches (#31576)
Add informative progress bar names to map_batches (#31526)
Provide an size bytes estimate for mongodb block (#31930)
Add support for dynamic block splitting to actor pool (#31715)
Improve str/repr of Dataset to include execution plan (#31604)
Deal with nested Chain in BatchPredictor (#31407)
Allow MultiHotEncoder to encode arrays (#31365)
Allow specify batch_size when reading Parquet file (#31165)
Add zero-copy batch API for ds.map_batches() (#30000)
Text dataset should save texts in ArrowTable format (#30963)
Return ndarray dicts for single-column tabular datasets (#30448)
Execute randomize_block_order eagerly if it's the last stage for ds.schema() (#30804)

🔨 Fixes:

Don't drop first dataset when peeking DatasetPipeline (#31513)
Handle np.array(dtype=object) constructor for ragged ndarrays (#31670)
Emit warning when starting Dataset execution with no CPU resources available (#31574)
Fix the bug of eagerly clearing up input blocks (#31459)
Fix Imputer failing with categorical dtype (#31435)
Fix schema unification for Datasets with ragged Arrow arrays (#31076)
Fix Discretizers transforming ignored cols (#31404)
Fix to_tf when the input feature_columns is a list. (#31228)
Raise error message if user calls Dataset.iter (#30575)

📖Documentation:

Refactor Ray Data API documentation (#31204)
Add seealso to map-related methods (#30579)

Ray Train

🎉 New Features:

Add option for per-epoch preprocessor (#31739)

💫Enhancements:

Change default NCCL_SOCKET_IFNAME to blacklist veth (#31824)
Introduce DatasetIterator for bulk and streaming ingest (#31470)
Clarify which RunConfig is used when there are multiple places to specify it (#31959)
Change ScalingConfig to be optional for DataParallelTrainers if already in Tuner param_space (#30920)

🔨 Fixes:

Use specified Preprocessor configs when using stream API. (#31725)
Fix off-by-one AIR Trainer checkpoint ID indexing on restore (#31423)
Force GBDTTrainer to use distributed loading for Ray Datasets (#31079)
Fix bad case in ScalingConfig->RayParams (#30977)
Don't raise TuneError on fail_fast="raise" (#30817)
Report only once in SklearnTrainer (#30593)
Ensure GBDT PGFs match passed ScalingConfig (#30470)

📖Documentation:

Restructure API References (#32535)
Remove Ray Client references from Train docs/examples (#32321)
Various fixes to docstrings, documentation, and examples (#29463, #30492, #30543, #30571, #30782, #31692, #31735)

🏗 Architecture refactoring:

API Deprecations (#31763)

Ray Tune

💫Enhancements:

Improve trainable serialization error (#31070)
Add support for Nevergrad optimizer with extra parameters (#31015)
Add timeout for experiment checkpoint syncing to cloud (#30855)
Move validate_upload_dir to Syncer (#30869)
Enable experiment restore from moved cloud uri (#31669)
Save and restore stateful callbacks as part of experiment checkpoint (#31957)

🔨 Fixes:

Do not default to reuse_actors=True when mixins are used (#31999)
Only keep cached actors if search has not ended (#31974)
Fix best trial in ProgressReporter with nan (#31276)
Make ResultGrid return cloud checkpoints (#31437)
Wait for final experiment checkpoint sync to finish (#31131)
Fix CheckpointConfig validation for function trainables (#31255)
Fix checkpoint directory assignment for new checkpoints created after restoring a function trainable (#31231)
Fix AxSearch save and nan/inf result handling (#31147)
Fix AxSearch search space conversion for fixed list hyperparameters (#31088)
Restore searcher and scheduler properly on Tuner.restore (#30893)
Fix progress reporter sort_by_metric with nested metrics (#30906)
Don't raise TuneError on fail_fast="raise" (#30817)
Fix duplicate printing when trial is done (#30597)

📖Documentation:

Restructure API references (#32449)
Remove Ray Client references from Tune docs/examples (#32321)
Various fixes to docstrings, documentation, and examples (#29581, #30782, #30571, #31045, #31793, #32505)

🏗 Architecture refactoring:

Deprecate passing a custom trial executor (#31792)
Move signal handling into separate method (#31004)
Update staged resources in a fixed counter for faster lookup (#32087)
Rename overwrite_trainable argument in Tuner restore to trainable (#32059)

Ray Serve

🎉 New Features:

Serve python API to support multi application (#31589)

💫Enhancements:

Add exponential backoff when retrying replicas (#31436)
Enable Log Rotation on Serve (#31844)
Use tasks/futures for asyncio.wait (#31608)
Change target_num_ongoing_requests_per_replica to positive float (#31378)

🔨 Fixes:

Upgrade deprecated calls (#31839)
Change Gradio integration to take a builder function to avoid serialization issues (#31619)
Add initial health check before marking a replica as RUNNING (#31189)

📖Documentation:

Document end-to-end timeout in Serve (#31769)
Document Gradio visualization (#28310)

RLlib

🎉 New Features:

Gymnasium is now supported. (Notes)
Connectors are now activated by default (#31693, 30388, 31618, 31444, 31092)
Contribution of LeelaChessZero algorithm for playing chess in a MultiAgent env. (#31480)

💫Enhancements:

[RLlib] Error out if action_dict is empty in MultiAgentEnv. (#32129)
[RLlib] Upgrade tf eager code to no longer use experimental_relax_shapes (but reduce_retracing instead). (#29214)
[RLlib] Reduce SampleBatch counting complexity (#30936)
[RLlib] Use PyTorch vectorized max() and sum() in SampleBatch.init when possible (#28388)
[RLlib] Support multi-gpu CQL for torch (tf already supported). (#31466)
[RLlib] Introduce IMPALA off_policyness test with GPU (#31485)
[RLlib] Properly serialize and restore StateBufferConnector states for policy stashing (#31372)
[RLlib] Clean up deprecated concat_samples calls (#31391)
[RLlib] Better support MultiBinary spaces by treating Tuples as superset of them in ComplexInputNet. (#28900)
[RLlib] Add backward compatibility to MeanStdFilter to restore from older checkpoints. (#30439)
[RLlib] Clean up some signatures for compute_actions. (#31241)
[RLlib] Simplify logging configuration. (#30863)
[RLlib] Remove native Keras Models. (#30986)
[RLlib] Convert PolicySpec to a readable format when converting to_dict(). (#31146)
[RLlib] Issue 30394: Add proper __str__() method to PolicyMap. (#31098)
[RLlib] Issue 30840: Option to only checkpoint policies that are trainable. (#31133)
[RLlib] Deprecate (delete) contrib folder. (#30992)
[RLlib] Better behavior if user does not specify stopping condition in RLLib CLI. (#31078)
[RLlib] PolicyMap LRU cache enhancements: Swap out policies (instead of GC'ing and recreating) + use Ray object store (instead of file system). (#29513)
[RLlib] AlgorithmConfig.overrides() to replace multiagent->policies->config and evaluation_config dicts. (#30879)
[RLlib] deprecation_warning(.., error=True) should raise ValueError, not DeprecationWarning. (#30255)
[RLlib] Add gym.spaces.Text serialization. (#30794)
[RLlib] Convert MultiAgentBatch to SampleBatch in offline_rl.py. (#30668)
[RLlib; Tune] Make Algorithm.train() return Tune-style config dict (instead of AlgorithmConfig object). (#30591)

🔨 Fixes:

[RLlib] Fix waterworld example and test (#32117)
[RLlib] Change Waterworld v3 to v4 and reinstate indep. MARL test case w/ pettingzoo. (#31820)
[RLlib] Fix OPE checkpointing. Save method name in configuration dict. (#31778)
[RLlib] Fix worker state restoration. (#31644)
[RLlib] Replace ordinary pygame imports by try_import_..(). (#31332)
[RLlib] Remove crude VR checks in agent collector. (#31564)
[RLlib] Fixed the 'RestoreWeightsCallback' example script. (#31601)
[RLlib] Issue 28428: QMix not working w/ GPUs. (#31299)
[RLlib] Fix using yaml files with empty stopping conditions. (#31363)
[RLlib] Issue 31174: Move all checks into AlgorithmConfig.validate() (even simple ones) to avoid errors when using tune hyperopt objects. (#31396)
[RLlib] Fix tensorflow_probability imports. (#31331)
[RLlib] Issue 31323: BC/MARWIL/CQL do work with multi-GPU (but config validation prevents them from running in this mode). (#31393)
[RLlib] Issue 28849: DT fails with num_gpus=1. (#31297)
[RLlib] Fix PolicyMap.__del__() to also remove a deleted policy ID from the internal deque. (#31388)
[RLlib] Use get_model_v2() instead of get_model() with MADDPG. (#30905)
[RLlib] Policy mapping fn can not be called with keyword arguments. (#31141)
[RLlib] Issue 30213: Appending RolloutMetrics to sampler outputs should happen after(!) all callbacks (such that custom metrics for last obs are still included). (#31102)
[RLlib] Make convert_to_torch tensor adhere to docstring. (#31095)
[RLlib] Fix convert to torch tensor (#31023)
[RLlib] Issue 30221: random policy does not handle nested spaces. (#31025)
[RLlib] Fix crashing remote envs example (#30562)
[RLlib] Recursively look up the original space from obs_space (#30602)

📖Documentation:

[RLlib; docs] Change links and references in code and docs to "Farama foundation's gymnasium" (from "OpenAI gym"). (#32061)

Ray Core and Ray Clusters

Ray Core

🎉 New Features:

Task Events Backend: Ray aggregates all submitted task information to provide better observability (#31840, #31761, #31278, #31247, #31316, #30934, #30979, #31207, #30867, #30829, #31524, #32157). This will back up features like task state API, advanced progress bar, and Ray timeline.

💫Enhancements:

Remote generator now works for ray actors and ray clients (#31700, #31710).
Revamp default scheduling strategy, improve worker startup performance up to 8x for embarrassingly parallel workloads (#31934, #31868).
Worker code clean up and allow workers lazy bind to jobs (#31836, #31846, #30349, #31375).
A single Ray cluster can scale up to 2000 nodes and 20k actors(#32131, #30131, #31939, #30166, #30460, #30563).
Out-of-memory prevention enhancement is now GA with more robust worker killing policies and better user experiences (#32217, #32361, #32219, #31768, #32107, #31976, #31272, #31509, #31230).

🔨 Fixes:

Improve garbage collection upon job termination (#32127, #31155)
Fix opencensus protobuf bug (#31632)
Support python 3.10 for runtime_env conda (#30970)
Fix crashes and memory leaks (#31640, #30476, #31488, #31917, #30761, #31018)

📖Documentation:

Deprecation (#31845, #31140, #31528)

Ray Clusters

💫Enhancements:

[observability] Better memory formatting for ray status and autoscaler (#32337)
[autoscaler] Add flag to disable periodic cluster status log. (#31869)

🔨 Fixes:

[observability][autoscaler] Ensure pending nodes is reset to 0 after scaling (#32085)
Make ~/.bashrc optional in cluster launcher commands (#32393)

📖Documentation:

Improvements to job submission
Remove references to Ray Client

Dashboard

🎉 New Features:

New Information Architecture (beta): We’ve restructured the Ray dashboard to be organized around user personas and workflows instead of entities. For developers, the jobs and actors tab will be most useful. For infrastructure engineers, the cluster tab may be more valuable.
Advanced progress bar: Tasks visualization that allow you to see the progress of all your ray tasks
Timeline view: We’ve added a button to download detailed timeline data about your ray job. Then, one can click a link and use the perfetto open-source visualization tool to visualize the timeline data.
More metadata tables. You can now see placement groups, tasks, actors, and other information related to your jobs.

📖Documentation:

We’ve restructured the documentation to make the dashboard documentation more prominent
We’ve improved the documentation around setting up Prometheus and Grafana for metrics.

Many thanks to all those who contributed to this release!

@minerharry, @scottsun94, @iycheng, @DmitriGekhtman, @jbedorf, @krfricke, @simonsays1980, @eltociear, @xwjiang2010, @ArturNiederfahrenhorst, @richardliaw, @avnishn, @WeichenXu123, @Capiru, @davidxia, @andreapiso, @amogkam, @sven1977, @scottjlee, @kylehh, @yhna940, @rickyyx, @sihanwang41, @n30111, @Yard1, @sriram-anyscale, @Emiyalzn, @simran-2797, @cadedaniel, @harelwa, @ijrsvt, @clarng, @pabloem, @bveeramani, @lukehsiao, @angelinalg, @dmatrix, @sijieamoy, @simon-mo, @jbesomi, @YQ-Wang, @larrylian, @c21, @AndreKuu, @maxpumperla, @architkulkarni, @wuisawesome, @justinvyu, @zhe-thoughts, @matthewdeng, @peytondmurray, @kevin85421, @tianyicui-tsy, @cassidylaidlaw, @gvspraveen, @scv119, @kyuyeonpooh, @Siraj-Qazi, @jovany-wang, @ericl, @shrekris-anyscale, @Catch-Bull, @jianoaix, @christy, @MisterLin1995, @kouroshHakha, @pcmoritz, @csko, @gjoliver, @clarkzinzow, @SongGuyang, @ckw017, @ddelange, @alanwguo, @Dhul-Husni, @Rohan138, @rkooo567, @fzyzcjy, @chaokunyang, @0x2b3bfa0, @zoltan-fedor, @Chong-Li, @crypdick, @jjyao, @emmyscode, @stephanie-wang, @starpit, @smorad, @nikitavemuri, @zcin, @tbukic, @ayushthe1, @mattip

ray-project/ray ray-2.3.0 Ray-2.3.0 on GitHub

Release Highlights

Ray Libraries

Ray AIR

Ray Data Processing

Ray Train

Ray Tune

Ray Serve

RLlib

Ray Core and Ray Clusters

Ray Core

Ray Clusters

Dashboard

ray-project/ray ray-2.3.0
Ray-2.3.0

on GitHub