github ray-project/ray ray-1.13.0
Ray-1.13.0

latest releases: ray-2.22.0, ray-2.21.0, ray-2.20.0...
23 months ago

Highlights:

  • Python 3.10 support is now in alpha.
  • Ray usage stats collection is now on by default (guarded by an opt-out prompt).
  • Ray Tune can now synchronize Trial data from worker nodes via the object store (without rsync!)
  • Ray Workflow comes with a new API and is integrated with Ray DAG.

Ray Autoscaler

💫Enhancements:

  • CI tests for KubeRay autoscaler integration (#23365, #23383, #24195)
  • Stability enhancements for KubeRay autoscaler integration (#23428)

🔨 Fixes:

  • Improved GPU support in KubeRay autoscaler integration (#23383)
  • Resources scheduled with the node affinity strategy are not reported to the autoscaler (#24250)

Ray Client

💫Enhancements:

  • Add option to configure ray.get with >2 sec timeout (#22165)
  • Return None from internal KV for non-existent keys (#24058)

🔨 Fixes:

  • Fix deadlock by switching to SimpleQueue on Python 3.7 and newer in async dataclient (#23995)

Ray Core

🎉 New Features:

  • Ray usage stats collection is now on by default (guarded by an opt-out prompt)
  • Alpha support for python 3.10 (on Linux and Mac)
  • Node affinity scheduling strategy (#23381)
  • Add metrics for disk and network I/O (#23546)
  • Improve exponential backoff when connecting to the redis (#24150)
  • Add the ability to inject a setup hook for customization of runtime_env on init (#24036)
  • Add a utility to check GCS / Ray cluster health (#23382)

🔨 Fixes:

  • Fixed internal storage S3 bugs (#24167)
  • Ensure "get_if_exists" takes effect in the decorator. (#24287)
  • Reduce memory usage for Pubsub channels that do not require total memory cap (#23985)
  • Add memory buffer limit in publisher for each subscribed entity (#23707)
  • Use gRPC instead of socket for GCS client health check (#23939)
  • Trim size of Reference struct (#23853)
  • Enable debugging into pickle backend (#23854)

🏗 Architecture refactoring:

  • Gcs storage interfaces unification (#24211)
  • Cleanup pickle5 version check (#23885)
  • Simplify options handling (#23882)
  • Moved function and actor importer away from pubsub (#24132)
  • Replace the legacy ResourceSet & SchedulingResources at Raylet (#23173)
  • Unification of AddSpilledUrl and UpdateObjectLocationBatch RPCs (#23872)
  • Save task spec in separate table (#22650)

Ray Datasets

🎉 New Features:

  • Performance improvement: the aggregation computation is vectorized (#23478)
  • Performance improvement: bulk parquet file reading is optimized with the fast metadata provider (#23179)
  • Performance improvement: more efficient move semantics for Datasets block processing (#24127)
  • Supports Datasets lineage serialization (aka out-of-band serialization) (#23821, #23931, #23932)
  • Supports native Tensor views in map processing for pure-tensor datasets (#24812)
  • Implemented push-based shuffle (#24281)

🔨 Fixes:

  • Documentation improvement: Getting Started page (#24860)
  • Documentation improvement: FAQ (#24932)
  • Documentation improvement: End to end examples (#24874)
  • Documentation improvement: Feature guide - Creating Datasets (#24831)
  • Documentation improvement: Feature guide - Saving Datasets (#24987)
  • Documentation improvement: Feature guide - Transforming Datasets (#25033)
  • Documentation improvement: Datasets APIs docstrings (#24949)
  • Performance: fixed block prefetching (#23952)
  • Fixed zip() for Pandas dataset (#23532)

🏗 Architecture refactoring:

  • Refactored LazyBlockList (#23624)
  • Added path-partitioning support for all content types (#23624)
  • Added fast metadata provider and refactored Parquet datasource (#24094)

RLlib

🎉 New Features:

  • Replay buffer API: First algorithms are using the new replay buffer API, allowing users to define and configure their own custom buffers or use RLlib’s built-in ones: SimpleQ, DQN (#24164, #22842, #23523, #23586)

🏗 Architecture refactoring:

  • More algorithms moved into the training iteration function API (no longer using execution plans). Users can now more easily read, develop, and debug RLlib’s algorithms: A2C, APEX-DQN, CQL, DD-PPO, DQN, MARWIL + BC, PPO, QMIX , SAC, SimpleQ, SlateQ, Trainers defined in examples folder. (#22937, #23420, #23673, #24164, #24151, #23735, #24157, #23798, #23906, #24118, #22842, #24166, #23712). This will be fully completed and documented with Ray 2.0.
  • Make RolloutWorkers (optionally) recoverable after failure via the new recreate_failed_workers=True config flag. (#23739)
  • POC for new TrainerConfig objects (instead of python config dicts): PPOConfig (for PPOTrainer) and PGConfig (for PGTrainer). (#24295, #23491)
  • Hard-deprecate build_trainer() (trainer_templates.py): All custom Trainers should now sub-class from any existing Trainer class. (#23488)

💫Enhancements:

  • Add support for complex observations in CQL. (#23332)
  • Bandit support for tf2. (#22838)
  • Make actions sent by RLlib to the env immutable. (#24262)
  • Memory leak finding toolset using tracemalloc + CI memory leak tests. (#15412)
  • Enable DD-PPO to run on Windows. (#23673)

🔨 Fixes:

Ray Workflow

🎉 New Features:

🔨 Fixes:

  • Fix one bug where max_retries is not aligned with ray core’s max_retries. (#22903)

🏗 Architecture refactoring:

  • Integrate ray storage in workflow (#24120)

Tune

🎉 New Features:

  • Add RemoteTask based sync client (#23605) (rsync not required anymore!)
  • Chunk file transfers in cross-node checkpoint syncing (#23804)
  • Also interrupt training when SIGUSR1 received (#24015)
  • reuse_actors per default for function trainables (#24040)
  • Enable AsyncHyperband to continue training for last trials after max_t (#24222)

💫Enhancements:

  • Improve testing (#23229
  • Improve docstrings (#23375)
  • Improve documentation (#23477, #23924)
  • Simplify trial executor logic (#23396
  • Make MLflowLoggerUtil copyable (#23333)
  • Use new Checkpoint interface internally (#22801)
  • Beautify Optional typehints (#23692)
  • Improve missing search dependency info (#23691)
  • Skip tmp checkpoints in analysis and read iteration from metadata (#23859)
  • Treat checkpoints with nan value as worst (#23862)
  • Clean up base ProgressReporter API (#24010)
  • De-clutter log outputs in trial runner (#24257)
  • hyperopt searcher to support tune.choice([[1,2],[3,4]]). (#24181)

🔨Fixes:

  • Optuna should ignore additional results after trial termination (#23495)
  • Fix PTL multi GPU link (#23589)
  • Improve Tune cloud release tests for durable storage (#23277)
  • Fix tensorflow distributed trainable docstring (#23590)
  • Simplify experiment tag formatting, clean directory names (#23672)
  • Don't include nan metrics for best checkpoint (#23820)
  • Fix syncing between nodes in placement groups (#23864)
  • Fix memory resources for head bundle (#23861)
  • Fix empty CSV headers on trial restart (#23860)
  • Fix checkpoint sorting with nan values (#23909)
  • Make Timeout stopper work after restoring in the future (#24217)
  • Small fixes to tune-distributed for new restore modes (#24220)

Train

Most distributed training enhancements will be captured in the new Ray AIR category!

🔨Fixes:

  • Copy resources_per_worker to avoid modifying user input
  • Fix train.torch.get_device() for fractional GPU or multiple GPU per worker case (#23763)
  • Fix multi node horovod bug (#22564)
  • Fully deprecate Ray SGD v1 (#24038)
  • Improvements to fault tolerance (#22511)
  • MLflow start run under correct experiment (#23662)
  • Raise helpful error when required backend isn't installed (#23583)
  • Warn pending deprecation for ray.train.Trainer and ray.tune DistributedTrainableCreators (#24056)

📖Documentation:

Ray AIR

🎉 New Features:

💫Enhancements:

  • Add Checkpoint.as_directory() for efficient checkpoint fs processing (#23908)
  • Add config to Result, extend ResultGrid.get_best_config (#23698)
  • Add Scaling Config validation (#23889)
  • Add tuner test. (#23364)
  • Move storage handling to pyarrow.fs.FileSystem (#23370)
  • Refactor _get_unique_value_indices (#24144)
  • Refactor most_frequent SimpleImputer (#23706)
  • Set name of Trainable to match with Trainer #23697
  • Use checkpoint.as_directory() instead of cleaning up manually (#24113)
  • Improve file packing/unpacking (#23621)
  • Make Dataset ingest configurable (#24066)
  • Remove postprocess_checkpoint (#24297)

🔨Fixes:

  • Better exception handling (#23695)
  • Do not deepcopy RunConfig (#23499)
  • reduce unnecessary stacktrace (#23475)
  • Tuner should use run_config from Trainer per default (#24079)
  • Use custom fsspec handler for GS (#24008)

📖Documentation:

  • Add distributed torch_geometric example (#23580)
  • GNN example cleanup (#24080)

Serve

🎉 New Features:

  • Serve logging system was revamped! Access log is now turned on by default. (#23558)
  • New Gradio notebook example for Ray Serve deployments (#23494)
  • Serve now includes full traceback in deployment update error message (#23752)

💫Enhancements:

🔨Fixes:

  • Handle None in ReplicaConfig's resource_dict (#23851)
  • Set "memory" to None in ray_actor_options by default (#23619)
  • Make serve.shutdown() shutdown remote Serve applications (#23476)
  • Ensure replica reconfigure runs after allocation check (#24052)
  • Allow cloudpickle serializable objects as init args/kwargs (#24034)
  • Use controller namespace when getting actors (#23896)

Dashboard

🔨Fixes:

  • Add toggle to enable showing node disk usage on K8s (#24416, #24440)
  • Add job submission id as field to job snapshot (#24303)

Thanks
Many thanks to all those who contributed to this release!
@matthewdeng, @scv119, @xychu, @iycheng, @takeshi-yoshimura, @iasoon, @wumuzi520, @thetwotravelers, @maxpumperla, @krfricke, @jgiannuzzi, @kinalmehta, @avnishn, @dependabot[bot], @sven1977, @raulchen, @acxz, @stephanie-wang, @mgelbart, @xwjiang2010, @jon-chuang, @pdames, @ericl, @edoakes, @gjoseph92, @ddelange, @bkasper, @sriram-anyscale, @Zyiqin-Miranda, @rkooo567, @jbedorf, @architkulkarni, @osanseviero, @simonsays1980, @clarkzinzow, @DmitriGekhtman, @ashione, @smorad, @andenrx, @mattip, @bveeramani, @chaokunyang, @richardliaw, @larrylian, @Chong-Li, @fwitter, @shrekris-anyscale, @gjoliver, @simontindemans, @silky, @grypesc, @ijrsvt, @daikeshi, @kouroshHakha, @mwtian, @mesjou, @sihanwang41, @PavelCz, @czgdp1807, @jianoaix, @GuillaumeDesforges, @pcmoritz, @arsedler9, @n30111, @kira-lin, @ckw017, @max0x7ba, @Yard1, @XuehaiPan, @lchu-ibm, @HJasperson, @SongGuyang, @amogkam, @liuyang-my, @WangTaoTheTonic, @jovany-wang, @simon-mo, @dynamicwebpaige, @suquark, @ArturNiederfahrenhorst, @jjyao, @KepingYan, @jiaodong, @frosk1

Don't miss a new ray release

NewReleases is sending notifications on new releases.