Release Highlights

Ray 2.10 release brings important stability improvements and enhancements to Ray Data, with Ray Data becoming generally available (GA).

[Data] Ray Data becomes generally available with stability improvements in streaming execution, reading and writing data, better tasks concurrency control, and debuggability improvement with dashboard, logging and metrics visualization.
[RLlib] “New API Stack” officially announced as alpha for PPO and SAC.
[Serve] Added a default autoscaling policy set via num_replicas=”auto” (#42613).
[Serve] Added support for active load shedding via max_queued_requests (#42950).
[Serve] Added replica queue length caching to the DeploymentHandle scheduler (#42943).
- This should improve overhead in the Serve proxy and handles.
- max_ongoing_requests (max_concurrent_queries) is also now strictly enforced (#42947).
- If you see any issues, please report them on GitHub and you can disable this behavior by setting: RAY_SERVE_ENABLE_QUEUE_LENGTH_CACHE=0.
[Serve] Renamed the following parameters. Each of the old names will be supported for another release before removal.
- max_concurrent_queries -> max_ongoing_requests
- target_num_ongoing_requests_per_replica -> target_ongoing_requests
- downscale_smoothing_factor -> downscaling_factor
- upscale_smoothing_factor -> upscaling_factor
[Serve] WARNING: the following default values will change in Ray 2.11:
- Default for max_ongoing_requests will change from 100 to 5.
- Default for target_ongoing_requests will change from 1 to 2.
[Core] Autoscaler v2 is in alpha and can be tried out with Kuberay. It has improved observability and stability compared to v1.
[Train] Added support for accelerator types via ScalingConfig(accelerator_type).
[Train] Revamped the XGBoostTrainer and LightGBMTrainer to no longer depend on xgboost_ray and lightgbm_ray. A new, more flexible API will be released in a future release.
[Train/Tune] Refactored local staging directory to remove the need for local_dir and RAY_AIR_LOCAL_CACHE_DIR.

Ray Libraries

Ray Data

🎉 New Features:

Streaming execution stability improvement to avoid memory issue, including per-operator resource reservation, streaming generator output buffer management, and better runtime resource estimation (#43026, #43171, #43298, #43299, #42930, #42504)
Metadata read stability improvement to avoid AWS transient error, including retry on application-level exception, spread tasks across multiple nodes, and configure retry interval (#42044, #43216, #42922, #42759).
Allow tasks concurrency control for read, map, and write APIs (#42849, #43113, #43177, #42637)
Data dashboard and statistics improvement with more runtime metrics for each components (#43790, #43628, #43241, #43477, #43110, #43112)
Allow to specify application-level error to retry for actor task (#42492)
Add num_rows_per_file parameter to file-based writes (#42694)
Add DataIterator.materialize (#43210)
Skip schema call in DataIterator.to_tf if tf.TypeSpec is provided (#42917)
Add option to append for Dataset.write_bigquery (#42584)
Deprecate legacy components and classes (#43575, #43178, #43347, #43349, #43342, #43341, #42936, #43144, #43022, #43023)

💫 Enhancements:

Restructure stdout logging for better readability (#43360)
Add a more performant way to read large TFRecord datasets (#42277)
Modify ImageDatasource to use Image.BILINEAR as the default image resampling filter (#43484)
Reduce internal stack trace output by default (#43251)
Perform incremental writes to Parquet files (#43563)
Warn on excessive driver memory usage during shuffle ops (#42574)
Distributed reads for ray.data.from_huggingface (#42599)
Remove Stage class and related usages (#42685)
Improve stability of reading JSON files to avoid PyArrow errors (#42558, #42357)

🔨 Fixes:

Turn off actor locality by default (#44124)
Normalize block types before internal multi-block operations (#43764)
Fix memory metrics for OutputSplitter (#43740)
Fix race condition issue in OpBufferQueue (#43015)
Fix early stop for multiple Limit operators. (#42958)
Fix deadlocks caused by Dataset.streaming_split for job hanging (#42601)

📖 Documentation:

Revamp Ray Data documentation for GA (#44006, #44007, #44008, #44098, #44168, #44093, #44105)

Ray Train

🎉 New Features:

Add support for accelerator types via ScalingConfig(accelerator_type) for improved worker scheduling (#43090)

💫 Enhancements:

Add a backend-specific context manager for train_func for setup/teardown logic (#43209)
Remove DEFAULT_NCCL_SOCKET_IFNAME to simplify network configuration (#42808)
Colocate Trainer with rank 0 Worker for to improve scheduling behavior (#43115)

🔨 Fixes:

Enable scheduling workers with memory resource requirements (#42999)
Make path behavior OS-agnostic by using Path.as_posix over os.path.join (#42037)
[Lightning] Fix resuming from checkpoint when using RayFSDPStrategy (#43594)
[Lightning] Fix deadlock in RayTrainReportCallback (#42751)
[Transformers] Fix checkpoint reporting behavior when get_latest_checkpoint returns None (#42953)

📖 Documentation:

Enhance docstring and user guides for train_loop_config (#43691)
Clarify in ray.train.report docstring that it is not a barrier (#42422)
Improve documentation for prepare_data_loader shuffle behavior and set_epoch (#41807)

🏗 Architecture refactoring:

Simplify XGBoost and LightGBM Trainer integrations. Implemented XGBoostTrainer and LightGBMTrainer as DataParallelTrainer. Removed dependency on xgboost_ray and lightgbm_ray. (#42111, #42767, #43244, #43424)
Refactor local staging directory to remove the need for local_dir and RAY_AIR_LOCAL_CACHE_DIR. Add isolation between driver and distributed worker artifacts so that large files written by workers are not uploaded implicitly. Results are now only written to storage_path, rather than having another copy in the user’s home directory (~/ray_results). (#43369, #43403, #43689)
Split overloaded ray.train.torch.get_device into another get_devices API for multi-GPU worker setup (#42314)
Refactor restoration configuration to be centered around storage_path (#42853, #43179)
Deprecations related to SyncConfig (#42909)
Remove deprecated preprocessor argument from Trainers (#43146, #43234)
Hard-deprecate MosaicTrainer and remove SklearnTrainer (#42814)

Ray Tune

💫 Enhancements:

Increase the minimum number of allowed pending trials for faster auto-scaleup (#43455)
Add support to TBXLogger for logging images (#37822)
Improve validation of Experiment(config) to handle RLlib AlgorithmConfig (#42816, #42116)

🔨 Fixes:

Fix reuse_actors error on actor cleanup for function trainables (#42951)
Make path behavior OS-agnostic by using Path.as_posix over os.path.join (#42037)

📖 Documentation:

Minor documentation fixes (#42118, #41982)

🏗 Architecture refactoring:

Refactor local staging directory to remove the need for local_dir and RAY_AIR_LOCAL_CACHE_DIR. Add isolation between driver and distributed worker artifacts so that large files written by workers are not uploaded implicitly. Results are now only written to storage_path, rather than having another copy in the user’s home directory (~/ray_results). (#43369, #43403, #43689)
Deprecations related to SyncConfig and chdir_to_trial_dir (#42909)
Refactor restoration configuration to be centered around storage_path (#42853, #43179)
Add back NevergradSearch (#42305)
Clean up invalid checkpoint_dir and reporter deprecation notices (#42698)

Ray Serve

🎉 New Features:

Added support for active load shedding via max_queued_requests (#42950).
Added a default autoscaling policy set via num_replicas=”auto” (#42613).

🏗 API Changes:

Renamed the following parameters. Each of the old names will be supported for another release before removal.
- max_concurrent_queries to max_ongoing_requests
- target_num_ongoing_requests_per_replica to target_ongoing_requests
- downscale_smoothing_factor to downscaling_factor
- upscale_smoothing_factor to upscaling_factor
WARNING: the following default values will change in Ray 2.11:
- Default for max_ongoing_requests will change from 100 to 5.
- Default for target_ongoing_requests will change from 1 to 2.

💫 Enhancements:

Add RAY_SERVE_LOG_ENCODING env to set the global logging behavior for Serve (#42781).
Config Serve's gRPC proxy to allow large payload (#43114).
Add blocking flag to serve.run() (#43227).
Add actor id and worker id to Serve structured logs (#43725).
Added replica queue length caching to the DeploymentHandle scheduler (#42943).
- This should improve overhead in the Serve proxy and handles.
- max_ongoing_requests (max_concurrent_queries) is also now strictly enforced (#42947).
- If you see any issues, please report them on GitHub and you can disable this behavior by setting: RAY_SERVE_ENABLE_QUEUE_LENGTH_CACHE=0.
Autoscaling metrics (tracking ongoing and queued metrics) are now collected at deployment handles by default instead of at the Serve replicas (#42578).
- This means you can now set max_ongoing_requests=1 for autoscaling deployments and still upscale properly, because requests queued at handles are properly taken into account for autoscaling.
- You should expect deployments to upscale more aggressively during bursty traffic, because requests will likely queue up at handles during bursts of traffic.
- If you see any issues, please report them on GitHub and you can switch back to the old method of collecting metrics by setting the environment variable RAY_SERVE_COLLECT_AUTOSCALING_METRICS_ON_HANDLE=0
Improved the downscaling behavior of smoothing_factor for low numbers of replicas (#42612).
Various logging improvements (#43707, #43708, #43629, #43557).
During in-place upgrades or when replicas become unhealthy, Serve will no longer wait for old replicas to gracefully terminate before starting new ones (#43187). New replicas will be eagerly started to satisfy the target number of healthy replicas.
- This new behavior is on by default and can be turned off by setting RAY_SERVE_EAGERLY_START_REPLACEMENT_REPLICAS=0

🔨 Fixes:

Fix deployment route prefix override by default route prefix from serve run cli (#43805).
Fixed a bug causing batch methods to hang upon cancellation (#42593).
Unpinned FastAPI dependency version (#42711).
Delay proxy marking itself as healthy until it has routes from the controller (#43076).
Fixed an issue where multiplexed deployments could go into infinite backoff (#43965).
Silence noisy KeyError on disconnects (#43713).
Fixed the prometheus counter metrics emitted as gauge bug (#43795, #43901).
- All the serve counter metrics are emitted as counters with _total suffix. The old gauge metrics are still emitted for compatibility.

📖 Documentation:

Update serve logging config docs (#43483).
Added documentation for max_replicas_per_node (#42743).

RLlib

🎉 New Features:

The “new API stack” is now in alpha stage and available for PPO single- (#42272) and multi-agent and for SAC single-agent (#42571, #42570, #42568)
- ConnectorV2 API (#43669, #43680, #43040, #41074, #41212)
- Episode APIs (SingleAgentEpisode and MultiAgentEpisode) (#42009, #43275, #42296, #43818, #41631)
- EnvRunner APIs (SingleAgentEnvRunner and MultiAgentEnvRunner) (#41558, #41825, #42296, #43779)
In preparation of DQN on the new API stack: PrioritizedEpisodeReplayBuffer (#43258, #42832)

💫 Enhancements:

Old API Stack cleanups:
- Move SampleBatch column names (e.g. SampleBatch.OBS) into new class (Columns). (#43665)
- Remove old exec_plan API code. (#41585)
- Introduce OldAPIStack decorator (#43657)
- RLModule API: Add functionality to define kernel and bias initializers via config. (#42137)
Learner/LearnerGroup APIs:
- Replace Learner/LearnerGroup specific config classes (e.g. LearnerHyperparameters) with AlgorithmConfig. (#41296)
- Learner/LearnerGroup: Allow updating from Episodes. (#41235)
In preparation of DQN on the new API stack: (#43199, #43196)

🔨 Fixes:

New API Stack bug fixes: Fix policy_to_train logic (#41529), fix multi-APU for PPO on the new API stack. (#44001), Issue 40347: (#42090)
Other fixes: MultiAgentEnv would NOT call env.close() on a failed sub-env (#43664), Issue 42152 (#43317), issue 42396: (#43316), issue 41518 (#42011), issue 42385 (#43313)

📖 Documentation:

New API Stack examples: Self-play and league-based self-play (#43276), MeanStdFilter (for both single-agent and multi-agent) (#43274), Prev-actions/prev-rewards for multi-agent (#43491)
Other docs fixes and enhancements: (#43438, #41472, #42117, #43458)

Ray Core and Ray Clusters

Ray Core

🎉 New Features:

Autoscaler v2 is in alpha and can be tried out with Kuberay.
Introduced subreaper to prevent leaks of sub-processes created by user code. (#42992)

💫 Enhancements:

Ray state api get_task() now accepts ObjectRef (#43507)
Add an option to disable task tracing for task/actor (#42431)
Improved object transfer throughput. (#43434)
Ray client now compares the Ray and Python version for compatibility with the remote Ray cluster. (#42760)

🔨 Fixes:

Fixed several bugs for streaming generator (#43775, #43772, #43413)
Fixed Ray counter metrics emitted as gauge bug (#43795)
Fixed a bug where empty resource task doesn’t work with placement group (#43448)
Fixed a bug where CPU resource is not released for a blocked worker inside placement group (#43270)
Fixed GCS crashes when PG commit phase failed due to node failure (#43405)
Fixed a bug where Ray memory monitor prematurely kill tasks (#43071)
Fixed placement group resource leak (#42942)
Upgraded cloudpickle to 3.0 which fixes the incompatibility with dataclasses (#42730)

📖 Documentation:

Updated the doc for Ray accelerators support (#41849)

Ray Clusters

💫 Enhancements:

[spark] Add heap_memory param for setup_ray_cluster API, and change default value of per ray worker node config, and change default value of ray head node config for global Ray cluster (#42604)
[spark] Add global mode for ray on spark cluster (#41153)

🔨 Fixes:

[VSphere] Only deploy ovf to first host of cluster (#42258)

Thanks

Many thanks to all those who contributed to this release!

@ronyw7, @xsqian, @justinvyu, @matthewdeng, @sven1977, @thomasdesr, @veryhannibal, @klebster2, @can-anyscale, @simran-2797, @stephanie-wang, @simonsays1980, @kouroshHakha, @Zandew, @akshay-anyscale, @matschaffer-roblox, @WeichenXu123, @matthew29tang, @vitsai, @Hank0626, @anmyachev, @kira-lin, @ericl, @zcin, @sihanwang41, @peytondmurray, @raulchen, @aslonnie, @ruisearch42, @vszal, @pcmoritz, @rickyyx, @chrislevn, @brycehuang30, @alexeykudinkin, @vonsago, @shrekris-anyscale, @andrewsykim, @c21, @mattip, @hongchaodeng, @dabauxi, @fishbone, @scottjlee, @justina777, @surenyufuz, @robertnishihara, @nikitavemuri, @Yard1, @huchen2021, @shomilj, @architkulkarni, @liuxsh9, @Jocn2020, @liuyang-my, @rkooo567, @alanwguo, @KPostOffice, @woshiyyya, @n30111, @edoakes, @y-abe, @martinbomio, @jiwq, @arunppsg, @ArturNiederfahrenhorst, @kevin85421, @khluu, @JingChen23, @masariello, @angelinalg, @jjyao, @omatthew98, @jonathan-anyscale, @sjoshi6, @gaborgsomogyi, @rynewang, @ratnopamc, @chris-ray-zhang, @ijrsvt, @scottsun94, @raychen911, @franklsf95, @GeneDer, @madhuri-rai07, @scv119, @bveeramani, @anyscalesam, @zen-xu, @npuichigo

ray-project/ray ray-2.10.0 Ray-2.10.0 on GitHub

Release Highlights

Ray Libraries

Ray Data

Ray Train

Ray Tune

Ray Serve

RLlib

Ray Core and Ray Clusters

Ray Core

Ray Clusters

Thanks

ray-project/ray ray-2.10.0
Ray-2.10.0

on GitHub