Release Highlights

Ray AI Runtime (AIR)
- Better support for Image-based workloads.
  - Ray Datasets read_images() API for loading data.
  - Numpy-based API for user-defined functions in Preprocessor.
- Ability to read TFRecord input.
  - Ray Datasets read_tfrecords() API to read TFRecord files.
Ray Serve:
- Add support for gRPC endpoint (alpha release). Instead of using an HTTP server, Ray Serve supports gRPC protocol and users can bring their own schema for their use case.
RLlib:
- Introduce decision transformer (DT) algorithm.
- New hook for callbacks with on_episode_created().
- Learning rate schedule to SimpleQ and PG.
Ray Core:
- Ray OOM prevention (alpha release).
- Support dynamic generators as task return values.
Dashboard:
- Time series metrics support.
- Export configuration files can be used in Prometheus or Grafana instances.
- New progress bar in job detail view.

Ray Libraries

Ray AIR

💫Enhancements:

Improve readability of training failure output (#27946, #28333, #29143)
Auto-enable GPU for Predictors (#26549)
Add ability to create TorchCheckpoint from state dict (#27970)
Add ability to create TensorflowCheckpoint from saved model/h5 format (#28474)
Add attribute to retrieve URI from Checkpoint (#28731)
Add all allowable types to WandB Callback (#28888)

🔨 Fixes:

Handle nested metrics properly as scoring attribute (#27715)
Fix serializability of Checkpoints (#28387, #28895, #28935)

📖Documentation:

Miscellaneous updates to documentation and examples (#28067, #28002, #28189, #28306, #28361, #28364, #28631, #28800)

🏗 Architecture refactoring:

Deprecate Checkpoint.to_object_ref and Checkpoint.from_object_ref (#28318)
Deprecate legacy train/tune functions in favor of Session (#28856)

Ray Data Processing

🎉 New Features:

Add read_images (#29177)
Add read_tfrecords (#28430)
Add NumPy batch format to Preprocessor and BatchMapper (#28418)
Ragged tensor extension type (#27625)
Add KBinsDiscretizer Preprocessor (#28389)

💫Enhancements:

Simplify to_tf interface (#29028)
Add metadata override and inference in Dataset.to_dask() (#28625)
Prune unused columns before aggregate (#28556)
Add Dataset.default_batch_format (#28434)
Add partitioning parameter to read_ functions (#28413)
Deprecate "native" batch format in favor of "default" (#28489)
Support None partition field name (#28417)
Re-enable Parquet sampling and add progress bar (#28021)
Cap the number of stats kept in StatsActor and purge in FIFO order if the limit exceeded (#27964)
Customized serializer for Arrow JSON ParseOptions in read_json (#27911)
Optimize groupby/mapgroups performance (#27805)
Improve size estimation of image folder data source (#27219)
Use detached lifetime for stats actor (#25271)
Pin _StatsActor to the driver node (#27765)
Better error message for partition filtering if no file found (#27353)
Make Concatenator deterministic (#27575)
Change FeatureHasher input schema to expect token counts (#27523)
Avoid unnecessary reads when truncating a dataset with ds.limit() (#27343)
Hide tensor extension from UDFs (#27019)
Add repr to AIR classes (#27006)

🔨 Fixes:

Add upper bound to pyarrow version check (#29674) (#29744)
Fix map_groups to work with different output type (#29184)
read_csv not filter out files by default (#29032)
Check columns when adding rows to TableBlockBuilder (#29020)
Fix the peak memory usage calculation (#28419)
Change sampling to use same API as read Parquet (#28258)
Fix column assignment in Concatenator for Pandas 1.2. (#27531)
Doing partition filtering in reader constructor (#27156)
Fix split ownership (#27149)

📖Documentation:

Clarify dataset transformation. (#28482)
Update map_batches documentation (#28435)
Improve docstring and doctest for read_parquet (#28488)
Activate dataset doctests (#28395)
Document using a different separator for read_csv (#27850)
Convert custom datetime column when reading a CSV file (#27854)
Improve preprocessor documentation (#27215)
Improve limit() and take() docstrings (#27367)
Reorganize the tensor data support docs (#26952)
Fix nyc_taxi_basic_processing notebook (#26983)

Ray Train

🎉 New Features:

Add FullyShardedDataParallel support to TorchTrainer (#28096)

💫Enhancements:

Add rich notebook repr for DataParallelTrainer (#26335)
Fast fail if training loop raises an error on any worker (#28314)
Use torch.encode_data with HorovodTrainer when torch is imported (#28440)
Automatically set NCCL_SOCKET_IFNAME to use ethernet (#28633)
Don't add Trainer resources when running on Colab (#28822)
Support large checkpoints and other arguments (#28826)

🔨 Fixes:

Fix and improve HuggingFaceTrainer (#27875, #28154, #28170, #28308, #28052)
Maintain dtype info in LightGBMPredictor (#28673)
Fix prepare_model (#29104)
Fix train.torch.get_device() (#28659)

📖Documentation:

Clarify LGBM/XGB Trainer documentation (#28122)
Improve Hugging Face notebook example (#28121)
Update Train API reference and docs (#28192)
Mention FSDP in HuggingFaceTrainer docs (#28217)

🏗 Architecture refactoring:

Improve Trainer modularity for extensibility (#28650)

Ray Tune

🎉 New Features:

Add Tuner.get_results() to retrieve results after restore (#29083)

💫Enhancements:

Exclude files in sync_dir_between_nodes, exclude temporary checkpoints (#27174)
Add rich notebook output for Tune progress updates (#26263)
Add logdir to W&B run config (#28454)
Improve readability for long column names in table output (#28764)
Add functionality to recover from latest available checkpoint (#29099)
Add retry logic for restoring trials (#29086)

🔨 Fixes:

Re-enable progress metric detection (#28130)
Add timeout to retry_fn to catch hanging syncs (#28155)
Correct PB2’s beta_t parameter implementation (#28342)
Ignore directory exists errors to tackle race conditions (#28401)
Correctly overwrite files on restore (#28404)
Disable pytorch-lightning multiprocessing per default (#28335)
Raise error if scheduling an empty PlacementGroupFactory#28445
Fix trial cleanup after x seconds, set default to 600 (#28449)
Fix trial checkpoint syncing after recovery from other node (#28470)
Catch empty hyperopt search space, raise better Tuner error message (#28503)
Fix and optimize sample search algorithm quantization logic (#28187)
Support tune.with_resources for class methods (#28596)
Maintain consistent Trial/TrialRunner state when pausing and resuming trial with PBT (#28511)
Raise better error for incompatible gcsfs version (#28772)
Ensure that exploited in-memory checkpoint is used by trial with PBT (#28509)
Fix Tune checkpoint tracking for minimizing metrics (#29145)

📖Documentation:

Miscelleanous documentation fixes (#27117, #28131, #28210, #28400, #28068, #28809)
Add documentation around trial/experiment checkpoint (#28303)
Add basic parallel execution guide for Tune (#28677)
Add example PBT notebook (#28519)

🏗 Architecture refactoring:

Store SyncConfig and CheckpointConfig in Experiment and Trial (#29019)

Ray Serve

🎉 New Features:

Added gRPC direct ingress support [alpha version] (#28175)
Serve cli can provide kubernetes formatted output (#28918)
Serve cli can provide user config output without default value (#28313)

💫Enhancements:

Enrich more benchmarks
image objection with resnet50 mode with image preprocessing (#29096)
gRPC vs HTTP inference performance (#28175)
Add health check metrics to reflect the replica health status (#29154)

🔨 Fixes:

Fix memory leak issues during inference (#29187)
Fix unexpected http options omit warning when using serve cli to start the ray serve (#28257)
Fix unexpected long poll exceptions (#28612)

📖Documentation:

Add e2e fault tolerance instructions (#28721)
Add Direct Ingress instructions (#29149)
Bunch of doc improvements on “dev workflow”, “custom resources”, “serve cli” etc (#29147, #28708, #28529, #28527)

RLlib

🎉 New Features:

Decision Transformer (DT) Algorithm added (#27890, #27889, #27872, #27829).
Callbacks now have a new hook on_episode_created(). (#28600)
Added learning rate schedule to SimpleQ and PG. (#28381)

💫Enhancements:

Soft target network update is now supported by all off-policy algorithms (e.g DQN, DDPG, etc.) (#28135)
Stop RLlib from "silently" selecting atari preprocessors. (#29011)
Improved offline RL and off-policy evaluation performance (#28837, #28834, #28593, #28420, #28136, #28013, #27356, #27161, #27451).
Escalated old deprecation warnings to errors (#28807, #28795, #28733, #28697).
Others: #27619, #27087.

🔨 Fixes:

Various bug fixes: #29077, #28811, #28637, #27785, #28703, #28422, #28405, #28358, #27540, #28325, #28357, #28334, #27090, #28133, #27981, #27980, #26666, #27390, #27791, #27741, #27424, #27544, #27459, #27572, #27255, #27304, #26629, #28166, #27864, #28938, #28845, #28588, #28202, #28201, #27806

📖Documentation:

Connectors. (#27528)
Training step API. (#27344)
Others: #28299, #27460

Ray Workflows

🔨 Fixes:

Fixed the object loss due to driver exit (#29092)
Change the name in step to task_id (#28151)

Ray Core and Ray Clusters

Ray Core

🎉 New Features:

Ray OOM prevention feature alpha release! If your Ray jobs suffer from OOM issues, please give it a try.
Support dynamic generators as task return values. (#29082 #28864 #28291)

💫Enhancements:

Fix spread scheduling imbalance issues (#28804 #28551 #28551)
Widening range of grpcio versions allowed (#28623)
Support encrypted redis connection. (#29109)
Upgrade redis from 6.x to 7.0.5. (#28936)
Batch ScheduleAndDispatchTasks calls (#28740)

🔨 Fixes:

More robust spilled object deletion (#29014)
Fix the initialization/destruction order between reference_counter_ and node change subscription (#29108)
Suppress the logging error when python exits and actor not deleted (#27300)
Mark run_function_on_all_workers as deprecated until we get rid of this (#29062)
Remove unused args for default_worker.py (#28177)
Don't include script directory in sys.path if it's started via python -m (#28140)
Handling edge cases of max_cpu_fraction argument (#27035)
Fix out-of-band deserialization of actor handle (#27700)
Allow reuse of cluster address if Ray is not running (#27666)
Fix a uncaught exception upon deallocation for actors (#27637)
Support placement_group=None in PlacementGroupSchedulingStrategy (#27370)

📖Documentation:

Ray 2.0 white paper is published.
Revamp ray core docs (#29124 #29046 #28953 #28840 #28784 #28644 #28345 #28113 #27323 #27303)
Fix cluster docs (#28056 #27062)
CLI Reference Documentation Revamp (#27862)

Ray Clusters

💫Enhancements:

Distinguish Kubernetes deployment stacks (#28490)

📖Documentation:

State intent to remove legacy Ray Operator (#29178)
Improve KubeRay migration notes (#28672)
Add FAQ for cluster multi-tenancy support (#29279)

Dashboard

🎉 New Features:

Time series metrics are now built into the dashboard
Ray now exports some default configuration files which can be used for your Prometheus or Grafana instances. This includes default metrics which show common information important to your Ray application.
New progress bar is shown in the job detail view. You can see how far along your ray job is.

🔨 Fixes:

Fix to prometheus exporter producing a slightly incorrect format.
Fix several performance issues and memory leaks

📖Documentation:

Added additional documentation on the new time series and the metrics page

Many thanks to all those who contributed to this release!

@sihanwang41, @simon-mo, @avnishn, @MyeongKim, @markrogersjr, @christy, @xwjiang2010, @kouroshHakha, @zoltan-fedor, @wumuzi520, @alanwguo, @Yard1, @liuyang-my, @charlesjsun, @DevJake, @matteobettini, @jonathan-conder-sm, @mgerstgrasser, @guidj, @JiahaoYao, @Zyiqin-Miranda, @jvanheugten, @aallahyar, @SongGuyang, @clarng, @architkulkarni, @Rohan138, @heyitsmui, @mattip, @ArturNiederfahrenhorst, @maxpumperla, @vale981, @krfricke, @DmitriGekhtman, @amogkam, @richardliaw, @maldil, @zcin, @jianoaix, @cool-RR, @kira-lin, @gramhagen, @c21, @jiaodong, @sijieamoy, @tupui, @ericl, @anabranch, @se4ml, @suquark, @dmatrix, @jjyao, @clarkzinzow, @smorad, @rkooo567, @jovany-wang, @edoakes, @XiaodongLv, @klieret, @rozsasarpi, @scottsun94, @ijrsvt, @bveeramani, @chengscott, @jbedorf, @kevin85421, @nikitavemuri, @sven1977, @acxz, @stephanie-wang, @PaulFenton, @WangTaoTheTonic, @cadedaniel, @nthai, @wuisawesome, @rickyyx, @artemisart, @peytondmurray, @pingsutw, @olipinski, @davidxia, @stestagg, @yaxife, @scv119, @mwtian, @yuanchi2807, @ntlm1686, @shrekris-anyscale, @cassidylaidlaw, @gjoliver, @ckw017, @hakeemta, @ilee300a, @avivhaber, @matthewdeng, @afarid, @pcmoritz, @Chong-Li, @Catch-Bull, @justinvyu, @iycheng

ray-project/ray ray-2.1.0 Ray-2.1.0 on GitHub

Release Highlights

Ray Libraries

Ray AIR

Ray Data Processing

Ray Train

Ray Tune

Ray Serve

RLlib

Ray Workflows

Ray Core and Ray Clusters

Ray Core

Ray Clusters

Dashboard

ray-project/ray ray-2.1.0
Ray-2.1.0

on GitHub