Highlights
- Ray SGD v2 is now in alpha! The v2 version introduces APIs that focus on ease of use and composability. Check out the docs here, and the migration guide from v1 to v2 here.
- If you are using Ray SGD v2, we’d love to hear your feedback here!
- Ray Workflows is now in alpha! Check out the docs here and try it out for your large-scale data science, ML, and long-running business workflows. Thanks to our early adopters for the feedback so far and the ongoing contributions from IBM Research.
- We have made major enhancements to C++ API! While we are still busy hardening the feature for production usage, please check out the docs here, try it out, and help provide feedback!
Ray Autoscaler
💫Enhancements:
- Improvement to logging and code structure #18180
- Default head node type to 0 max_workers #17757
- Modifications to accommodate custom node providers #17312
🔨 Fixes:
- Helm chart configuration fixes #17678 #18123
- GCP autoscaler config fix #18653
- Allow attaching to uninitialized head node for debugging #17688
- Syncing files with Docker head node fixed #16515
Ray Client
🎉 New Features:
ray.init()
args can be forwarded to remote server (#17776)- Allow multiple client connections from one driver (#17942)
- gRPC channel credentials can now be configured from ray.init (#18425, #18365)
- Ray Client will attempt to recover connections on certain gRPC failures (#18329)
💫Enhancements
- Less confusing client RPC errors (#18278)
- Use a single RPC to fetch ClientObjectRefs passed in a list (#16944)
- Increase timeout for ProxyManager.get_channel (#18350)
🔨 Fixes:
- Fix mismatched debug log ID formats (#17597)
- Fix confusing error messages when client scripts exit (#17969)
Ray Core
🎉 New Features:
- Major enhancements in the C++ API!
- This API library enables you to build a C++ distributed system easily, just like the Python API and the Java API.
- Run
pip install -U ray[cpp]
to install Ray with C++ API support. - Run
ray cpp --help
to learn how to use it. - For more details, check out the docs here and see the tab “C++”.
🔨 Fixes:
- Bug fixes for thread-safety / reference count issues / placement group (#18401, #18746, #18312, #17802, #18526, #17863, #18419, #18463, #18193, #17774, #17772, #17670, #17620, #18584, #18646, #17634, #17732)
- Better format for object loss errors / task & actor logs (#18742, #18577, #18105, #18292, #17971, #18166)
- Improved the ray status output for placement groups (#18289, #17892)
- Improved the function export performance (#18284)
- Support more Ray core metrics such as RPC call latencies (#17578)
- Improved error messages and logging for runtime environments (#18451, #18092, #18088, #18084, #18496, #18083)
Ray Data Processing
🎉 New Features:
- Add support for reading partitioned Parquet datasets (#17716)
- Add dataset unioning (#17793)
- Add support for splitting a dataset at row indices (#17990)
- Add from_numpy() and to_numpy() APIs (#18146)
- Add support for splitting a dataset pipeline at row indices (#18243)
- Add Modin integration (from_modin() and to_modin()) (#18122)
- Add support for datasets with tensor columns (#18301)
- Add RayDP (Spark-on-Ray) integration (from_spark() and to_spark()) (#17340)
💫Enhancements
- Drop empty tables when read Parquet fragments in order to properly support filter expressions when reading partitioned Parquet datasets (#18098)
- Retry application-level errors in Datasets (#18296)
- Create a directory on write if it doesn’t exist (#18435)
- URL encode paths if they are URLs (#18440)
- Guard against a dataset pipeline being read multiple times on accident (#18682)
- Reduce working set size during random shuffles by eagerly destroying intermediate datasets (#18678)
- Add manual round-robin resource-based load balancing option to read and shuffle stages (#18678)
🔨 Fixes:
- Fix JSON writing so IO roundtrip works (#17691)
- Fix schema subsetting on column selection during Parquet reads (#18361)
- Fix Dataset.iter_batches() dropping batches when prefetching (#18441)
- Fix filesystem inference on path containing space (#18644)
🏗 Architecture refactoring:
- Port write side of IO layer to use file-based datasources (#18135)
RLlib
🎉 New Features:
- Replay buffers: Add config option to store contents in checkpoints (store_buffer_in_checkpoints=True). (#17999)
- Add support for multi-GPU to DDPG. (#17789)
💫Enhancements:
- Support for running evaluation and training in parallel, thereby only evaluating as many episodes as the training loop takes (
evaluation_num_episodes=”auto”
). (#18380) - Enhanced stability: Started nightly multi-GPU (2) learning tests for most algos (tf + torch), including LSTM and attention net setups.
🏗 Architecture refactoring:
- Make MultiAgentEnv inherit gym.Env to avoid direct class type manipulation (#18156)
- SampleBatch: Add support for nested data (+Docstring- and API cleanups). (#17485)
- Add
policies
arg to callback:on_episode_step
(already exists in all other episode-related callbacks) (#18119) - Add
worker
arg (optional) topolicy_mapping_fn
. (#18184)
🔨 Fixes:
- Fix Atari learning test regressions (2 bugs) and 1 minor attention net bug. (#18306)
- Fix n-step > 1 postprocessing bug (issues 17844, 18034). (#18358)
- Fix crash when using StochasticSampling exploration (most PG-style algos) w/ tf and numpy version > 1.19.5 (#18366)
- Strictly run
evaluation_num_episodes
episodes each evaluation run (no matter the other eval config settings). (#18335) - Issue 17706: AttributeError: 'numpy.ndarray' object has no attribute 'items'" on certain turn-based MultiAgentEnvs with Dict obs space. (#17735)
- Issue 17900: Set
seed
in single vectorized sub-envs properly, ifnum_envs_per_worker > 1
(#18110) - Fix R2D2 (torch) multi-GPU issue. (#18550)
- Fix
final_scale
's default value to 0.02 (see OrnsteinUhlenbeck exploration). (#18070) - Ape-X doesn't take the value of
prioritized_replay
into account (#17541) - Issue 17653: Torch multi-GPU (>1) broken for LSTMs. (#17657)
- Issue 17667: CQL-torch + GPU not working (due to simple_optimizer=False; must use simple optimizer!). (#17742)
- Add locking to PolicyMap in case it is accessed by a RolloutWorker and the same worker's AsyncSampler or the main LearnerThread. (#18444)
- Other fixes and enhancements: #18591, #18381, #18670, #18705, #18274, #18073, #18017, #18389, #17896, #17410, #17891, #18368, #17778, #18494, #18466, #17705, #17690, #18254, #17701, #18544, #17889, #18390, #18428, #17821, #17955, #17666, #18423, #18040, #17867, #17583, #17822, #18249, #18155, #18065, #18540, #18367, #17960, #17895, #18467, #17928, #17485, #18307, #18043, #17640, #17702, #15849, #18340
Tune
💫Enhancements:
- Usability improvements when trials appear to be stuck in PENDING state forever when the cluster has insufficient resources. (#18611, #17957, #17533)
- Searchers and Tune Callbacks now have access to some experiment settings information. (#17724, #17794)
- Improve HyperOpt KeyError message when metric was not found. (#18549)
- Allow users to configure bootstrap for docker syncer. (#17786)
- Allow users to update trial resources on resume. (#17975)
- Add max_concurrent_trials argument to tune.run. (#17905)
- Type hint TrialExecutor. Use Abstract Base Class. (#17584)
- Add developer/stability annotations. (#17442)
🔨Fixes:
- Placement group stability issues. (#18706, #18391, #18338)
- Fix a DurableTrainable checkpointing bug. (#18318)
- Fix a trial reset bug if a RLlib algorithm with default resources is used. (#18209)
- Fix hyperopt points to evaluate for nested lists. (#18113)
- Correctly validate initial points for random search. (#17282)
- Fix local mode. Add explicit concurrency limiter for local mode. (#18023)
- Sanitize trial checkpoint filename. (#17985)
- Explicitly instantiate skopt categorical spaces. (#18005)
SGD (v2)
Ray SGD v2 is now in Alpha! The v2 version introduces APIs that focus on ease of use and composability. Check out the docs here, and the migration guide from v1 to v2 here. If you are using Ray SGD v2, we’d love to hear your feedback here!
🎉 New Features:
- Ray SGD v2
📖 Documentation:
Serve
↗️Deprecation and API changes:
serve.start(http_host=..., http_port=..., http_middlewares=...)
has been deprecated since Ray 1.2.0. They are now removed in favor ofserve.start(http_options={“host”: …, “port”: …, “middlewares”: …)
. (#17762)- Remove deprecated ServeRequest API (#18120)
- Remove deprecated endpoints API (#17989)
🎉 New Features:
🔨Fixes:
- Better serve constructor failure handling (#16922, #18402)
- Fix get_handle execution from threads (#18198)
- Remove requirement to specify namespace for serve.start(detached=True) (#17470)
🏗 Architecture refactoring:
- Progress towards replica autoscaling (#18658)
Dashboard
🎉 New Features:
- Ray system events are now published in experimental dashboard (#18330, pop #18698)
- Actor page will now show actors with PENDING_CREATION status (#18666)
Thanks
Many thanks to all those who contributed to this release!
@scottsun94, @hngenc, @iycheng, @asm582, @jkterry1, @ericl, @thomasdesr, @ryanlmelvin, @ellimac54, @Bam4d, @gjoliver, @juliusfrost, @simon-mo, @ashione, @RaphaelCS, @simonsays1980, @suquark, @jjyao, @lixin-wei, @77loopin, @Ivorforce, @DmitriGekhtman, @dependabot[bot], @souravraha, @robertnishihara, @richardliaw, @SongGuyang, @rkooo567, @edoakes, @jsuarez5341, @zhisbug, @clarkzinzow, @triciasfu, @architkulkarni, @akern40, @liuyang-my, @krfricke, @amogkam, @Jingyu-Peng, @xwjiang2010, @nikitavemuri, @hauntsaninja, @fyrestone, @navneet066, @ijrsvt, @mwtian, @sasha-s, @raulchen, @holdenk, @qicosmos, @Yard1, @yuduber, @mguarin0, @MissiontoMars, @stephanie-wang, @stefanbschneider, @sven1977, @AmeerHajAli, @matthewdeng, @chenk008, @jiaodong, @clay4444, @ckw017, @tchordia, @ThomasLecat, @Chong-Li, @jmakov, @jovany-wang, @tdhopper, @kfstorm, @wgifford, @mxz96102, @WangTaoTheTonic, @lada-kunc, @scv119, @kira-lin, @wuisawesome