ray-project/ray ray-1.13.0 on GitHub

Highlights:

Python 3.10 support is now in alpha.
Ray usage stats collection is now on by default (guarded by an opt-out prompt).
Ray Tune can now synchronize Trial data from worker nodes via the object store (without rsync!)
Ray Workflow comes with a new API and is integrated with Ray DAG.

Ray Autoscaler

💫Enhancements:

CI tests for KubeRay autoscaler integration (#23365, #23383, #24195)
Stability enhancements for KubeRay autoscaler integration (#23428)

🔨 Fixes:

Improved GPU support in KubeRay autoscaler integration (#23383)
Resources scheduled with the node affinity strategy are not reported to the autoscaler (#24250)

Ray Client

💫Enhancements:

Add option to configure ray.get with >2 sec timeout (#22165)
Return None from internal KV for non-existent keys (#24058)

🔨 Fixes:

Fix deadlock by switching to SimpleQueue on Python 3.7 and newer in async dataclient (#23995)

Ray Core

🎉 New Features:

Ray usage stats collection is now on by default (guarded by an opt-out prompt)
Alpha support for python 3.10 (on Linux and Mac)
Node affinity scheduling strategy (#23381)
Add metrics for disk and network I/O (#23546)
Improve exponential backoff when connecting to the redis (#24150)
Add the ability to inject a setup hook for customization of runtime_env on init (#24036)
Add a utility to check GCS / Ray cluster health (#23382)

🔨 Fixes:

Fixed internal storage S3 bugs (#24167)
Ensure "get_if_exists" takes effect in the decorator. (#24287)
Reduce memory usage for Pubsub channels that do not require total memory cap (#23985)
Add memory buffer limit in publisher for each subscribed entity (#23707)
Use gRPC instead of socket for GCS client health check (#23939)
Trim size of Reference struct (#23853)
Enable debugging into pickle backend (#23854)

🏗 Architecture refactoring:

Gcs storage interfaces unification (#24211)
Cleanup pickle5 version check (#23885)
Simplify options handling (#23882)
Moved function and actor importer away from pubsub (#24132)
Replace the legacy ResourceSet & SchedulingResources at Raylet (#23173)
Unification of AddSpilledUrl and UpdateObjectLocationBatch RPCs (#23872)
Save task spec in separate table (#22650)

Ray Datasets

🎉 New Features:

Performance improvement: the aggregation computation is vectorized (#23478)
Performance improvement: bulk parquet file reading is optimized with the fast metadata provider (#23179)
Performance improvement: more efficient move semantics for Datasets block processing (#24127)
Supports Datasets lineage serialization (aka out-of-band serialization) (#23821, #23931, #23932)
Supports native Tensor views in map processing for pure-tensor datasets (#24812)
Implemented push-based shuffle (#24281)

🔨 Fixes:

Documentation improvement: Getting Started page (#24860)
Documentation improvement: FAQ (#24932)
Documentation improvement: End to end examples (#24874)
Documentation improvement: Feature guide - Creating Datasets (#24831)
Documentation improvement: Feature guide - Saving Datasets (#24987)
Documentation improvement: Feature guide - Transforming Datasets (#25033)
Documentation improvement: Datasets APIs docstrings (#24949)
Performance: fixed block prefetching (#23952)
Fixed zip() for Pandas dataset (#23532)

🏗 Architecture refactoring:

Refactored LazyBlockList (#23624)
Added path-partitioning support for all content types (#23624)
Added fast metadata provider and refactored Parquet datasource (#24094)

RLlib

🎉 New Features:

Replay buffer API: First algorithms are using the new replay buffer API, allowing users to define and configure their own custom buffers or use RLlib’s built-in ones: SimpleQ, DQN (#24164, #22842, #23523, #23586)

🏗 Architecture refactoring:

More algorithms moved into the training iteration function API (no longer using execution plans). Users can now more easily read, develop, and debug RLlib’s algorithms: A2C, APEX-DQN, CQL, DD-PPO, DQN, MARWIL + BC, PPO, QMIX , SAC, SimpleQ, SlateQ, Trainers defined in examples folder. (#22937, #23420, #23673, #24164, #24151, #23735, #24157, #23798, #23906, #24118, #22842, #24166, #23712). This will be fully completed and documented with Ray 2.0.
Make RolloutWorkers (optionally) recoverable after failure via the new recreate_failed_workers=True config flag. (#23739)
POC for new TrainerConfig objects (instead of python config dicts): PPOConfig (for PPOTrainer) and PGConfig (for PGTrainer). (#24295, #23491)
Hard-deprecate build_trainer() (trainer_templates.py): All custom Trainers should now sub-class from any existing Trainer class. (#23488)

💫Enhancements:

Add support for complex observations in CQL. (#23332)
Bandit support for tf2. (#22838)
Make actions sent by RLlib to the env immutable. (#24262)
Memory leak finding toolset using tracemalloc + CI memory leak tests. (#15412)
Enable DD-PPO to run on Windows. (#23673)

🔨 Fixes:

APPO eager fix (APPOTFPolicy gets wrapped as_eager() twice by mistake). (#24268)
CQL gets stuck when deprecated timesteps_per_iteration is used (use min_train_timesteps_per_reporting instead). (#24345)
SlateQ runs on GPU (torch). (#23464)
Other bug fixes: #24016, #22050, #23814, #24025, #23740, #23741, #24006, #24005, #24273, #22010, #24271, #23690, #24343, #23419, #23830, #24335, #24148, #21735, #24214, #23818, #24429

Ray Workflow

🎉 New Features:

Workflow step is deprecated (#23796, #23728, #23456, #24210)

🔨 Fixes:

Fix one bug where max_retries is not aligned with ray core’s max_retries. (#22903)

🏗 Architecture refactoring:

Integrate ray storage in workflow (#24120)

Tune

🎉 New Features:

Add RemoteTask based sync client (#23605) (rsync not required anymore!)
Chunk file transfers in cross-node checkpoint syncing (#23804)
Also interrupt training when SIGUSR1 received (#24015)
reuse_actors per default for function trainables (#24040)
Enable AsyncHyperband to continue training for last trials after max_t (#24222)

💫Enhancements:

Improve testing (#23229
Improve docstrings (#23375)
Improve documentation (#23477, #23924)
Simplify trial executor logic (#23396
Make MLflowLoggerUtil copyable (#23333)
Use new Checkpoint interface internally (#22801)
Beautify Optional typehints (#23692)
Improve missing search dependency info (#23691)
Skip tmp checkpoints in analysis and read iteration from metadata (#23859)
Treat checkpoints with nan value as worst (#23862)
Clean up base ProgressReporter API (#24010)
De-clutter log outputs in trial runner (#24257)
hyperopt searcher to support tune.choice([[1,2],[3,4]]). (#24181)

🔨Fixes:

Optuna should ignore additional results after trial termination (#23495)
Fix PTL multi GPU link (#23589)
Improve Tune cloud release tests for durable storage (#23277)
Fix tensorflow distributed trainable docstring (#23590)
Simplify experiment tag formatting, clean directory names (#23672)
Don't include nan metrics for best checkpoint (#23820)
Fix syncing between nodes in placement groups (#23864)
Fix memory resources for head bundle (#23861)
Fix empty CSV headers on trial restart (#23860)
Fix checkpoint sorting with nan values (#23909)
Make Timeout stopper work after restoring in the future (#24217)
Small fixes to tune-distributed for new restore modes (#24220)

Train

Most distributed training enhancements will be captured in the new Ray AIR category!

🔨Fixes:

Copy resources_per_worker to avoid modifying user input
Fix train.torch.get_device() for fractional GPU or multiple GPU per worker case (#23763)
Fix multi node horovod bug (#22564)
Fully deprecate Ray SGD v1 (#24038)
Improvements to fault tolerance (#22511)
MLflow start run under correct experiment (#23662)
Raise helpful error when required backend isn't installed (#23583)
Warn pending deprecation for ray.train.Trainer and ray.tune DistributedTrainableCreators (#24056)

📖Documentation:

add FAQ (#22757)

Ray AIR

🎉 New Features:

HuggingFaceTrainer & HuggingFacePredictor (#23615, #23876)
SklearnTrainer & SklearnPredictor (#23803, #23850)
HorovodTrainer (#23437)
RLTrainer & RLPredictor (#23465, #24172)
BatchMapper preprocessor (#23700)
Categorizer preprocessor (#24180)
BatchPredictor (#23808)

💫Enhancements:

Add Checkpoint.as_directory() for efficient checkpoint fs processing (#23908)
Add config to Result, extend ResultGrid.get_best_config (#23698)
Add Scaling Config validation (#23889)
Add tuner test. (#23364)
Move storage handling to pyarrow.fs.FileSystem (#23370)
Refactor _get_unique_value_indices (#24144)
Refactor most_frequent SimpleImputer (#23706)
Set name of Trainable to match with Trainer #23697
Use checkpoint.as_directory() instead of cleaning up manually (#24113)
Improve file packing/unpacking (#23621)
Make Dataset ingest configurable (#24066)
Remove postprocess_checkpoint (#24297)

🔨Fixes:

Better exception handling (#23695)
Do not deepcopy RunConfig (#23499)
reduce unnecessary stacktrace (#23475)
Tuner should use run_config from Trainer per default (#24079)
Use custom fsspec handler for GS (#24008)

📖Documentation:

Add distributed torch_geometric example (#23580)
GNN example cleanup (#24080)

Serve

🎉 New Features:

Serve logging system was revamped! Access log is now turned on by default. (#23558)
New Gradio notebook example for Ray Serve deployments (#23494)
Serve now includes full traceback in deployment update error message (#23752)

💫Enhancements:

Serve Deployment Graph was enhanced with performance fixes and structural clean up. (#24199, #24026, #24065, #23984)
End to end tutorial for deployment graph (#23512, #22771, #23536)
input_schema is now renamed as http_adapter for usability (#24353, #24191)
Progress towards a declarative REST API (#23232, #23481)
Code cleanup and refactoring (#24067, #23578, #23934, #23759)
Protobuf based controller API for cross language client (#23004)

🔨Fixes:

Handle None in ReplicaConfig's resource_dict (#23851)
Set "memory" to None in ray_actor_options by default (#23619)
Make serve.shutdown() shutdown remote Serve applications (#23476)
Ensure replica reconfigure runs after allocation check (#24052)
Allow cloudpickle serializable objects as init args/kwargs (#24034)
Use controller namespace when getting actors (#23896)

Dashboard

🔨Fixes:

Add toggle to enable showing node disk usage on K8s (#24416, #24440)
Add job submission id as field to job snapshot (#24303)

Thanks
Many thanks to all those who contributed to this release!
@matthewdeng, @scv119, @xychu, @iycheng, @takeshi-yoshimura, @iasoon, @wumuzi520, @thetwotravelers, @maxpumperla, @krfricke, @jgiannuzzi, @kinalmehta, @avnishn, @dependabot[bot], @sven1977, @raulchen, @acxz, @stephanie-wang, @mgelbart, @xwjiang2010, @jon-chuang, @pdames, @ericl, @edoakes, @gjoseph92, @ddelange, @bkasper, @sriram-anyscale, @Zyiqin-Miranda, @rkooo567, @jbedorf, @architkulkarni, @osanseviero, @simonsays1980, @clarkzinzow, @DmitriGekhtman, @ashione, @smorad, @andenrx, @mattip, @bveeramani, @chaokunyang, @richardliaw, @larrylian, @Chong-Li, @fwitter, @shrekris-anyscale, @gjoliver, @simontindemans, @silky, @grypesc, @ijrsvt, @daikeshi, @kouroshHakha, @mwtian, @mesjou, @sihanwang41, @PavelCz, @czgdp1807, @jianoaix, @GuillaumeDesforges, @pcmoritz, @arsedler9, @n30111, @kira-lin, @ckw017, @max0x7ba, @Yard1, @XuehaiPan, @lchu-ibm, @HJasperson, @SongGuyang, @amogkam, @liuyang-my, @WangTaoTheTonic, @jovany-wang, @simon-mo, @dynamicwebpaige, @suquark, @ArturNiederfahrenhorst, @jjyao, @KepingYan, @jiaodong, @frosk1

ray-project/ray ray-1.13.0 Ray-1.13.0 on GitHub

Highlights:

Ray Autoscaler

Ray Client

Ray Core

Ray Datasets

RLlib

Ray Workflow

Tune

Train

Ray AIR

Serve

Dashboard

ray-project/ray ray-1.13.0
Ray-1.13.0

on GitHub