Highlights:
- Python 3.10 support is now in alpha.
- Ray usage stats collection is now on by default (guarded by an opt-out prompt).
- Ray Tune can now synchronize Trial data from worker nodes via the object store (without rsync!)
- Ray Workflow comes with a new API and is integrated with Ray DAG.
Ray Autoscaler
💫Enhancements:
- CI tests for KubeRay autoscaler integration (#23365, #23383, #24195)
- Stability enhancements for KubeRay autoscaler integration (#23428)
🔨 Fixes:
- Improved GPU support in KubeRay autoscaler integration (#23383)
- Resources scheduled with the node affinity strategy are not reported to the autoscaler (#24250)
Ray Client
💫Enhancements:
- Add option to configure ray.get with >2 sec timeout (#22165)
- Return
None
from internal KV for non-existent keys (#24058)
🔨 Fixes:
- Fix deadlock by switching to
SimpleQueue
on Python 3.7 and newer in asyncdataclient
(#23995)
Ray Core
🎉 New Features:
- Ray usage stats collection is now on by default (guarded by an opt-out prompt)
- Alpha support for python 3.10 (on Linux and Mac)
- Node affinity scheduling strategy (#23381)
- Add metrics for disk and network I/O (#23546)
- Improve exponential backoff when connecting to the redis (#24150)
- Add the ability to inject a setup hook for customization of runtime_env on init (#24036)
- Add a utility to check GCS / Ray cluster health (#23382)
🔨 Fixes:
- Fixed internal storage S3 bugs (#24167)
- Ensure "get_if_exists" takes effect in the decorator. (#24287)
- Reduce memory usage for Pubsub channels that do not require total memory cap (#23985)
- Add memory buffer limit in publisher for each subscribed entity (#23707)
- Use gRPC instead of socket for GCS client health check (#23939)
- Trim size of Reference struct (#23853)
- Enable debugging into pickle backend (#23854)
🏗 Architecture refactoring:
- Gcs storage interfaces unification (#24211)
- Cleanup pickle5 version check (#23885)
- Simplify options handling (#23882)
- Moved function and actor importer away from pubsub (#24132)
- Replace the legacy ResourceSet & SchedulingResources at Raylet (#23173)
- Unification of AddSpilledUrl and UpdateObjectLocationBatch RPCs (#23872)
- Save task spec in separate table (#22650)
Ray Datasets
🎉 New Features:
- Performance improvement: the aggregation computation is vectorized (#23478)
- Performance improvement: bulk parquet file reading is optimized with the fast metadata provider (#23179)
- Performance improvement: more efficient move semantics for Datasets block processing (#24127)
- Supports Datasets lineage serialization (aka out-of-band serialization) (#23821, #23931, #23932)
- Supports native Tensor views in map processing for pure-tensor datasets (#24812)
- Implemented push-based shuffle (#24281)
🔨 Fixes:
- Documentation improvement: Getting Started page (#24860)
- Documentation improvement: FAQ (#24932)
- Documentation improvement: End to end examples (#24874)
- Documentation improvement: Feature guide - Creating Datasets (#24831)
- Documentation improvement: Feature guide - Saving Datasets (#24987)
- Documentation improvement: Feature guide - Transforming Datasets (#25033)
- Documentation improvement: Datasets APIs docstrings (#24949)
- Performance: fixed block prefetching (#23952)
- Fixed zip() for Pandas dataset (#23532)
🏗 Architecture refactoring:
- Refactored LazyBlockList (#23624)
- Added path-partitioning support for all content types (#23624)
- Added fast metadata provider and refactored Parquet datasource (#24094)
RLlib
🎉 New Features:
- Replay buffer API: First algorithms are using the new replay buffer API, allowing users to define and configure their own custom buffers or use RLlib’s built-in ones: SimpleQ, DQN (#24164, #22842, #23523, #23586)
🏗 Architecture refactoring:
- More algorithms moved into the training iteration function API (no longer using execution plans). Users can now more easily read, develop, and debug RLlib’s algorithms: A2C, APEX-DQN, CQL, DD-PPO, DQN, MARWIL + BC, PPO, QMIX , SAC, SimpleQ, SlateQ, Trainers defined in examples folder. (#22937, #23420, #23673, #24164, #24151, #23735, #24157, #23798, #23906, #24118, #22842, #24166, #23712). This will be fully completed and documented with Ray 2.0.
- Make RolloutWorkers (optionally) recoverable after failure via the new
recreate_failed_workers=True
config flag. (#23739) - POC for new TrainerConfig objects (instead of python config dicts): PPOConfig (for PPOTrainer) and PGConfig (for PGTrainer). (#24295, #23491)
- Hard-deprecate
build_trainer()
(trainer_templates.py): All custom Trainers should now sub-class from any existingTrainer
class. (#23488)
💫Enhancements:
- Add support for complex observations in CQL. (#23332)
- Bandit support for tf2. (#22838)
- Make actions sent by RLlib to the env immutable. (#24262)
- Memory leak finding toolset using tracemalloc + CI memory leak tests. (#15412)
- Enable DD-PPO to run on Windows. (#23673)
🔨 Fixes:
- APPO eager fix (APPOTFPolicy gets wrapped
as_eager()
twice by mistake). (#24268) - CQL gets stuck when deprecated
timesteps_per_iteration
is used (usemin_train_timesteps_per_reporting
instead). (#24345) - SlateQ runs on GPU (torch). (#23464)
- Other bug fixes: #24016, #22050, #23814, #24025, #23740, #23741, #24006, #24005, #24273, #22010, #24271, #23690, #24343, #23419, #23830, #24335, #24148, #21735, #24214, #23818, #24429
Ray Workflow
🎉 New Features:
🔨 Fixes:
- Fix one bug where max_retries is not aligned with ray core’s max_retries. (#22903)
🏗 Architecture refactoring:
- Integrate ray storage in workflow (#24120)
Tune
🎉 New Features:
- Add RemoteTask based sync client (#23605) (rsync not required anymore!)
- Chunk file transfers in cross-node checkpoint syncing (#23804)
- Also interrupt training when SIGUSR1 received (#24015)
- reuse_actors per default for function trainables (#24040)
- Enable AsyncHyperband to continue training for last trials after max_t (#24222)
💫Enhancements:
- Improve testing (#23229
- Improve docstrings (#23375)
- Improve documentation (#23477, #23924)
- Simplify trial executor logic (#23396
- Make
MLflowLoggerUtil
copyable (#23333) - Use new Checkpoint interface internally (#22801)
- Beautify Optional typehints (#23692)
- Improve missing search dependency info (#23691)
- Skip tmp checkpoints in analysis and read iteration from metadata (#23859)
- Treat checkpoints with nan value as worst (#23862)
- Clean up base ProgressReporter API (#24010)
- De-clutter log outputs in trial runner (#24257)
- hyperopt searcher to support tune.choice([[1,2],[3,4]]). (#24181)
🔨Fixes:
- Optuna should ignore additional results after trial termination (#23495)
- Fix PTL multi GPU link (#23589)
- Improve Tune cloud release tests for durable storage (#23277)
- Fix tensorflow distributed trainable docstring (#23590)
- Simplify experiment tag formatting, clean directory names (#23672)
- Don't include nan metrics for best checkpoint (#23820)
- Fix syncing between nodes in placement groups (#23864)
- Fix memory resources for head bundle (#23861)
- Fix empty CSV headers on trial restart (#23860)
- Fix checkpoint sorting with nan values (#23909)
- Make Timeout stopper work after restoring in the future (#24217)
- Small fixes to tune-distributed for new restore modes (#24220)
Train
Most distributed training enhancements will be captured in the new Ray AIR category!
🔨Fixes:
- Copy resources_per_worker to avoid modifying user input
- Fix
train.torch.get_device()
for fractional GPU or multiple GPU per worker case (#23763) - Fix multi node horovod bug (#22564)
- Fully deprecate Ray SGD v1 (#24038)
- Improvements to fault tolerance (#22511)
- MLflow start run under correct experiment (#23662)
- Raise helpful error when required backend isn't installed (#23583)
- Warn pending deprecation for
ray.train.Trainer
andray.tune
DistributedTrainableCreators (#24056)
📖Documentation:
- add FAQ (#22757)
Ray AIR
🎉 New Features:
HuggingFaceTrainer
&HuggingFacePredictor
(#23615, #23876)SklearnTrainer
&SklearnPredictor
(#23803, #23850)HorovodTrainer
(#23437)RLTrainer
&RLPredictor
(#23465, #24172)BatchMapper
preprocessor (#23700)Categorizer
preprocessor (#24180)BatchPredictor
(#23808)
💫Enhancements:
- Add
Checkpoint.as_directory()
for efficient checkpoint fs processing (#23908) - Add
config
toResult
, extendResultGrid.get_best_config
(#23698) - Add Scaling Config validation (#23889)
- Add tuner test. (#23364)
- Move storage handling to pyarrow.fs.FileSystem (#23370)
- Refactor
_get_unique_value_indices
(#24144) - Refactor
most_frequent
SimpleImputer
(#23706) - Set name of Trainable to match with Trainer #23697
- Use checkpoint.as_directory() instead of cleaning up manually (#24113)
- Improve file packing/unpacking (#23621)
- Make Dataset ingest configurable (#24066)
- Remove postprocess_checkpoint (#24297)
🔨Fixes:
- Better exception handling (#23695)
- Do not deepcopy RunConfig (#23499)
- reduce unnecessary stacktrace (#23475)
- Tuner should use
run_config
from Trainer per default (#24079) - Use custom fsspec handler for GS (#24008)
📖Documentation:
Serve
🎉 New Features:
- Serve logging system was revamped! Access log is now turned on by default. (#23558)
- New Gradio notebook example for Ray Serve deployments (#23494)
- Serve now includes full traceback in deployment update error message (#23752)
💫Enhancements:
- Serve Deployment Graph was enhanced with performance fixes and structural clean up. (#24199, #24026, #24065, #23984)
- End to end tutorial for deployment graph (#23512, #22771, #23536)
input_schema
is now renamed ashttp_adapter
for usability (#24353, #24191)- Progress towards a declarative REST API (#23232, #23481)
- Code cleanup and refactoring (#24067, #23578, #23934, #23759)
- Protobuf based controller API for cross language client (#23004)
🔨Fixes:
- Handle
None
inReplicaConfig
'sresource_dict
(#23851) - Set
"memory"
toNone
inray_actor_options
by default (#23619) - Make
serve.shutdown()
shutdown remote Serve applications (#23476) - Ensure replica reconfigure runs after allocation check (#24052)
- Allow cloudpickle serializable objects as init args/kwargs (#24034)
- Use controller namespace when getting actors (#23896)
Dashboard
🔨Fixes:
- Add toggle to enable showing node disk usage on K8s (#24416, #24440)
- Add job submission id as field to job snapshot (#24303)
Thanks
Many thanks to all those who contributed to this release!
@matthewdeng, @scv119, @xychu, @iycheng, @takeshi-yoshimura, @iasoon, @wumuzi520, @thetwotravelers, @maxpumperla, @krfricke, @jgiannuzzi, @kinalmehta, @avnishn, @dependabot[bot], @sven1977, @raulchen, @acxz, @stephanie-wang, @mgelbart, @xwjiang2010, @jon-chuang, @pdames, @ericl, @edoakes, @gjoseph92, @ddelange, @bkasper, @sriram-anyscale, @Zyiqin-Miranda, @rkooo567, @jbedorf, @architkulkarni, @osanseviero, @simonsays1980, @clarkzinzow, @DmitriGekhtman, @ashione, @smorad, @andenrx, @mattip, @bveeramani, @chaokunyang, @richardliaw, @larrylian, @Chong-Li, @fwitter, @shrekris-anyscale, @gjoliver, @simontindemans, @silky, @grypesc, @ijrsvt, @daikeshi, @kouroshHakha, @mwtian, @mesjou, @sihanwang41, @PavelCz, @czgdp1807, @jianoaix, @GuillaumeDesforges, @pcmoritz, @arsedler9, @n30111, @kira-lin, @ckw017, @max0x7ba, @Yard1, @XuehaiPan, @lchu-ibm, @HJasperson, @SongGuyang, @amogkam, @liuyang-my, @WangTaoTheTonic, @jovany-wang, @simon-mo, @dynamicwebpaige, @suquark, @ArturNiederfahrenhorst, @jjyao, @KepingYan, @jiaodong, @frosk1