Ray 2.4 - Generative AI and LLM support

Over the last few months, we have seen a flurry of innovative activity around generative AI models and large language models (LLM). To continue our effort to ensure Ray provides a pivotal compute substrate for generative AI workloads and addresses the challenges (as explained in our blog series), we have invested engineering efforts in this release to ensure that these open source LLM models and workloads are accessible to the open source community and performant with Ray.

This release includes new examples for training, batch inference, and serving with your own LLM.

Generative AI and LLM Examples

Ray Train enhancements

We're introducing the LightningTrainer, allowing you to scale your PyTorch Lightning on Ray. As part of our continued effort for seamless integration and ease of use, we have enhanced and replaced our existing ray_lightning integration, which was widely adopted, with the latest changes to Pytorch Lighting.
we’re releasing an AccelerateTrainer, allowing you to run HuggingFace Accelerate and DeepSpeed on Ray with minimal code changes. This Trainer integrates with the rest of the Ray ecosystem—including the ability to run distributed hyperparameter tuning with each trial being a distributed training job.

Ray Data highlights

Streaming execution is enabled by default, providing users with a more efficient data processing pipeline that can handle larger datasets and minimize memory consumption. Check out the docs here: (doc)
We've implemented asynchronous batch prefetching of Dataset.iter_batches (doc), improving performance by fetching data in parallel while the main thread continues processing, thus reducing waiting time.
Support reading SQL databases (doc), enabling users to seamlessly integrate relational databases into their Ray Data workflows.
Introduced support for reading WebDataset (doc), a common format for high-performance deep learning training jobs.

Ray Serve highlights

Multi-app CLI & REST API support is now available, allowing users to manage multiple applications with different configurations within a single Ray Serve deployment. This simplifies deployment and scaling processes for users with multiple applications. (doc)
Enhanced logging and metrics for Serve applications, giving users better visibility into their application's performance and facilitating easier debugging and monitoring.
(doc)

Other enhancements

Ray 2.4 is the last version that supports Python 3.6
We've also added a brand new landing page

Ray Libraries

Ray AIR

💫Enhancements:

Add nightly test for alpa opt 30b inference. (#33419)
Add a sanity checking release test for Alpa and ray nightly. (#32995)
Add TorchDetectionPredictor (#32199)
Add artifact_location, run_name to MLFlow integration (#33641)
Add *path properties to Result and ResultGrid (#33410)
Make Preprocessor.transform lazy by default (#32872)
Make BatchPredictor lazy (#32510, #32796)
Use a configurable ray temp directory for the TempFileLock util (#32862)
Add collate_fn to iter_torch_batches (#32412)
Allow users to pass Callable[[torch.Tensor], torch.Tensor] to TorchVisionTransform (#32383)
Automatically move DatasetIterator torch tensors to correct device (#31753)

🔨 Fixes:

Fix use_gpu with HuggingFacePredictor (#32333)
Make Keras Callback raise DeprecationWarning (#33775)
Pin framework to tf in AIR rl offline trainer example (#33750)
Fix test_tracked_actor (#33075)
Label Checkpoint.from_checkpoint as developer API (#33094)
Don't make lock files when moving dirs (#32924)
avoid inconsistency of create filesystem from uri for hdfs case (#30611)
Fix DatasetIterator backwards compability (#32526)
Fix CountVectorizer failing with big data (#32351)
Fix NoneType error loading TorchCheckpoint through from_uri. (#32386)
Fix dtype type hint in DLPredictor methods (#32198)
Allow None in set_preprocessor (#33088)

📖Documentation:

Add dreambooth example + release test (#33025)
GPT-J fine tuning with DeepSpeed example (#33090)
GPT-J Serving Example (#33114)
Add object detection example (#31553)
Add computer vision guide (#32885)
Add Pytorch ResNet batch prediction example (#32470)
Large model inference examples (#32874)
Add data ingestion clarification for AIR converting existing pytorch code example (#32058)
Add BatchPredictor.from_checkpoint to docs (#32877)
Fix wording of Many model training guidance (#32319)
Add required dependencies for batch prediction notebook (#33897)
Added link to preprocessors in Ray AIR Key Concepts page (#33526)
Rewording in analyze_tuning_results.ipynb (#32671)

🏗 Architecture refactoring:

Deprecations for 2.4 (#33765)
Deprecate TensorflowCheckpoint.get_model model_definition parameter (#33776)

Ray Data Processing

🎉 New Features:

Enable streaming execution by default (#32493)
Support asynchronous batch prefetching of Dataset.iter_batches() (#33620)
Introduce Dataset.materialize() API (#34184)
Add Dataset.streaming_split() API (#32991)
Support reading SQL databases with ray.data.read_sql() (#33353)
Support reading WebDataset with ray.data.read_webdataset() (#33336)
Allow generator UDFs for Dataset.map_batches() and flat_map() (#32767)
Add collate_fn to Dataset.iter_torch_batches() (#32412)
Add support for ignore_missing_paths in reading Datasource (#33126)

💫Enhancements:

Deprecate Dataset.lazy() (#33812)
Deprecate Dataset.dataset_format() (#33437)
Deprecate Dataset.fully_executed() and Dataset.is_fully_executed() (#33342)
Deprecate Datasource.do_write() (#32015)
Add iter_rows to DatasetIterator (#33180)
Support optional tf_schema parameter in read_tfrecords() and write_tfrecords() (#32857)
Add Dataset.iter_batches(batch_format=None) support, which will yield batches in the current batch format with zero copies (#33562)
Add meta_provider parameter into read_images (#33791)
Add missing passthrough args to Dataset.read_images() (#32942)
Allow read_binary_files(output_arrow_format=True) to return Arrow format (#33780)
Improve performance of DefaultFileMetaProvider (#33117)
Improved naming of Ray Data map tasks for dashboard (#32585)
Support different numbers of blocks/rows per block in Dataset.zip() (#32795)
No preserve order by default for streaming execution (#32300)
Make write an operator as part of the execution plan (#32015)
Optimize metadata creation for RangeDatasource (#33712)
Add telemetry for Ray Data (#32896)
Make BatchPredictor lazy (#32510)
Make Preprocessor.transform lazy by default (#32872)
Allow users to pass Callable[[torch.Tensor], torch.Tensor] to TorchVisionTransform (#32383)
Automatically move DatasetIterator torch tensors to correct device (#31753)
Promote _create_strict_ragged_ndarray to public API (#31975)
Add support for string tensor columns in ArrowTensorArray and ArrowVariableShapedTensorArray (#32143)

🔨 Fixes:

Data layer performance/bug fixes and tweaks (#32744)
Clean up RAY_DATASET_FORCE_LOCAL_METADATA flag (#32483)
Fix Datasource write_results type (#33936)
Add objects GC in dataset iterator (#34030, #34141)
Fix to_pandas failure on datasets returned by from_spark() (#32968)
Fix zip stage to preserve order when executing the other side (#33649)
Fix _get_read_tasks to use NodeAffinitySchedulingStrategy (#33212)
Guard against using ipywidgets in google colab (#32841)
Fix from_items parallelism to create the expected number of blocks (#32821)
Always preserve order for the BulkExecutor (#32437)

📖Documentation:

Fix Ragged Tensor Documentation (#33029)

Ray Train

🎉 New Features:

The LightningTrainer has been revamped
- [Doc] LightningTrainer end-2-end starter example [no_early_kickoff] (#33494)
- Lightning Trainer Release tests + docstring sample test (#33323)
- Support metric logging and checkpointing for LightningTrainer (#33183)
- Add LightningTrainer to support Pytorch Lightning DDP training <Part 1>. (#33161)
- Add LightningPredictor to support batch prediction (#33196)
- Support metric logging and checkpointing for LightningTrainer (#33183)
Implement AccelerateTrainer (#33269)
Add Trainer.restore API for train experiment-level fault tolerance (#31920)
Ray Train telemetry to collect the AIR trainer type (#33277)

💫Enhancements:

Recommend Trainer.restore on errors raised by trainer.fit() (#33610)
Improve lazy checkpointing (#32233)
Sort CUDA_VISIBLE_DEVICES (#33159)
Support returning multiple devices in train.torch.get_device() (#32893)
Set torch.distributed env vars (#32450)
Use the actual task name being executed for _RayTrainWorker__execute. (#33065)

🔨 Fixes:

Fix the import path for LightningTrainer to be compatible with Pytorch Lightning 2.0. (#34033)
Pin framework to tf in AIR rl offline trainer example (#33750)
Fix HF Trainer with DatasetIterator, handle device_map (#32955)
Fix failing test_torch_trainer (#32963)
Deflake test_gpu by sorting the devices (#33002)
(Bandaid) Mitigate OOMs on checkpointing (#33089)
Empty cache on the correct device (#33603)

📖Documentation:

[Doc] Improve AIR Lightning APIs and docstrings (#33895)
Update quickstart example to use dataloader (#33050)
add intro content from training module (#32088)
Add back important Ray Train integration methods (for Torch/TF) (#32551)
Restructure API reference (#32360)
Add Pytorch ResNet finetuning starter example (#32936)

🏗 Architecture refactoring:

Hard deprecate Backend encode_data/decode_data (#33784)

Ray Tune

🎉 New Features:

Add new experimental execution path
- Add TuneController (#33499)
- Refactor TrialRunner to separate out executor calls (#33448)
- Move trainable_kwargs generation to trial.py (#33249)
- Move experiment state/checkpoint/resume management into a separate file (#32457)
- Cache ready futures in RayTrialExecutor (#32093)
- Use generic _ObjectCache for actor reuse (#33045)
- Event manager part 2: Implementation (#31811)
Add new experimental output format
- Add new console output code path (behind feature flag) (#33609)
- Fix new output path with new execution path (#33880)
- Fix OrderedDict import. (#33709)
- Fix Ray Tune output v2 failures (#33697)
- Update wording to "Logical Resource Usage". (#33312)
- clean up tune/train result output (#32234)
- Add unit test to experimental/output.py (#33767)
Ray Tune Telemetry to collect entrypoint and searchers/scheduler usage (#33740)

💫Enhancements:

Experiment restore/resume
- Allow re-specifying param space in Tuner.restore (#32317)
- Replace reference values in a config dict with placeholders (#31927)
- Add Tuner.can_restore(path) utility for checking if an experiment exists at a path/uri (#32003)
- Update Tuner.restore usage to prepare for trainable becoming a required arg (#32912)
- Fix resuming from cloud storage (+ test) (#32504)
Syncing to cloud storage
- Sync trial artifacts to cloud (#32334)
- Fix ensure directory in bucket path sync (#33692)
- Sync less often and only wait at end of experiment (#32155)
- Unrevert "Add more comprehensive support for remote_checkpoint_dir w/ url params (#32479)" (#32576)
- Use on_experiment_end hook for the final wait of SyncCallback sync processes (#33390)
- Cleanup path-related properties in experiment classes (#33370)
- Update trainable remote_checkpoint_dir upon actor reuse (#32420)
- Add use_threads=False in pyarrow syncing (#32256)
Better support for multi-tenancy
- Prefix global object registry with job ID to avoid conflicts in multi tenancy (#33095)
- Add test for multi-tenancy workaround and documentation to FAQ (#32560)
- release test for nested air (tune) oom (#31768)
Fault tolerance improvements
- Add Tune worker fault tolerance test (#33473)
- Improve logging, unify trial retry logic, improve trial restore retry test. (#32242)
Integrations
- [wandb] Wait for WandbLoggerCallback actors to finish uploading to wandb on experiment end (#33174)
- [air] Aim logger (#32041)

🔨 Fixes:

ExperimentAnalysis: Ignore empty checkpoints but don't fail (#33770)
TrialRunner checkpointing shouldn't fail if ray.data.Dataset w/o lineage captured in trial config (#33565)
Evict object cache for actor re-use when search ended (#33593)
Raise warning (not an exception) if metadata file not found in checkpoint (#33123)
remove deep copy in trial.__getstate__ (#32624)
Fix "ValueError: I/O operation on closed file" (#31269)

📖Documentation:

Restructure API reference (#32311)
Don't recommend tune.run API in logging messages when using the Tuner (#33642)
Split "Tune stopping and resuming" into two user guides (#33495)
Remove Ray Client references from Tune and Train docs/examples (#32299)
Add tune checkpoint user guide. (#33145)
improve log_to_file doc. (#32128)
Fix broken Tune links to overview and intergration (#32442)

🏗 Architecture refactoring:

Deprecation cycle
- Hard deprecate Tune MLflow/W&B mixin/callbacks (#33782)
- Fix two tests after structure refactor deprecation (#32517)
- Remove deprecated Resources class (#32490)
- Structure refactor: Raise on import of old modules (#32486)

Ray Serve

🎉 New Features:

Multi-app supports CLI and REST API.(#33347, #33490, #33300, #33216, #33013)

💫Enhancements:

Add telemetry for lightweight config updates (#34039)
Deployment & replica info automatically exist in user customized metrics.(33451)
Add route and request id in the ray serve log entry.(33365)
Add telemetry for common Serve usage patterns (#33505)
Add log_to_stderr option to logger and improve internal logging (#33597)
Make http retries tunable (#32532)
Extend configurable HTTP options (#33160)
Prevent mixing single/multi-app config deployment (#33340)
Expose FastAPI docs path (#32863)
Add http request latency (#32839)

🔨 Fixes:

Recover the pending actors during the controller failures (#33890)
Fix tensorarray to numpy conversion (#34115)
Allow app rename when redeploying config (#33385)
Fix traceback string for RayTaskErrors when deploying serve app (#33120)

📖Documentation:

Add serve example documentation for object detection, stable diffusion and (#33164)

RLlib

🎉 New Features:

RLModule API is available in Alpha. See details here. PPO has been migrated to this API but in a limited mode.
Catalog API is revamped to be consistent with RLModule. See details here.

💫Enhancements:

Default framework is now torch instead of tf. (#33603)
Hard deprecate the old rllib/agent folder (#33242)

🔨 Fixes:

[RLlib] Don't serialize config in Policy states (unless needed for msgpack-type checkpoints). (#33865)
[RLlib] Fix MultiCallbacks class: To be used only with utility function that returns a class to use in the config. (#33863)
[RLlib] DM control suite wrapper fix: dtype of obs needs to be pinned to float32. (#33876)
[RLlib] Fix apex dqn deprecated add_batch call (#33814)
[RLlib] AlgorithmConfig.update_from_dict needs to work for MultiCallbacks. (#33796)
[RLlib] Add dist_inputs to action sampler fn returns in TorchPolicyV2 (#33795)

📖Documentation:

Rewritten the API documentation for better discoverability.
- [RLlib][Docs] Restructure RLModule API page (#33363)
- [RLlib][Docs] Restructure Replay buffer API page (#33359)
- [RLlib][Docs] Restructure Utils API page (#33358)
- [RLlib][Docs] Restructure Sampler's API page (#33357)
- [RLlib][Docs] Restructure Modelv2's API page (#33356)
- [RLlib][Docs] Restructure Algorithm's API page (#33345)
- [RLlib][Docs] Restructure Policy's API page (#33344)
[RLlib] Fix Getting Started example never returning (#33140)

Ray Core

🎉 New Features:

Ray officially support scale to up to 2000 nodes. See scalability envelope for more details.
Ray introduces an experimental API RAY_preload_python_modules to preload Python modules before tasks or actors are scheduled. This will eventually reduce startup time of Ray workloads that import large libraries. Please try it out and share feedback in #ray-preload-modules-feedback in the Ray Slack. To enable, configure the modules to preload via RAY_preload_python_modules=torch,tensorflow when starting Ray.

💫Enhancements:

Mark raylet unhealthy if GCS can't recognize it. (#34216)
Improve the workflow finding Redis leader.(#34183)
Improve Redis related observability when failed. (#33842)
Improve the serialization error for tasks, actors and ray.put (#33660)
Experimental preload_python_modules flag for preloading modules in default_worker (#33572)
Remove actor deletion upon job termination (#31019)
Better support per worker gpu usage from the cluster view. (#33515)
Task backend - Profile events capping (#33321)
Fifo worker killing policy (#33430)
Write ray address even if ray node is started with --block (#32961)
Turn on light weight resource broadcasting. (#32625)
Add opt-in flag for Windows and OSX clusters, update ray start output to match docs (#32409)

🔨 Fixes:

Fix arm64 wheels builds ((#34320)
Fix ray start command output(#34273)
Partially address Ray child process leaks by killing all child processes in the CoreWorker shutdown sequence. (#34181)
Ignore resource usage update from unknown node (#33619)
Fix keepalive in grpc client #33986
Autosummary class by default (#32983)
Fix non default dashboard_agent_listen_port not used when connected to the node (#33834)
Allow using local wheels to run release tests. (#32739)
Fix the error message when storage is not set. (#33581)
Fixing lint issue in benchmark_worker_startup (#33440)
Pin json-schema < 4.18 (#33412)
Fix demand leak when worker failed (#31175)
Remove some usage of deprecated runtime context apis (#33236)
Remove dead SchedulingResources class (#33250)
Release lock before sleeping (#33221)
Remove asyncio.ensure_future call in run_async_func_in_event_loop(#32932)
Upgrade gtest to 1.13 (#32858)
Update OpenCensus (#32553)
Remove usage_lib.LibUsageRecorder (#32806)
Fix the race condition in the new resource broadcasting. (#32798)
Task backend - disable verbose print. (#32764)
Building py37+cu118 and using cu116 in default ray-ml image (#32636)
Do not set flushing thread niceness for task backend #32439
Fix gRPC callback API destruction issues. (#32151)
Fix comments and a corner case in #32302 (#32323)
Script to compare perf metrics between releases (#32290)

📖Documentation:

Rewrite the placement group documentation (#34302)
Add tips of writing fault tolerant Ray applications (#32191)
Removed docs referring to ray client. (#32209)
Improve the streaming_split pydoc (#33424)
Add doc link for logs dedup (#33879)

Ray Clusters

💫Enhancements:

Added end to end release tests for example AWS cluster launcher YAML files (#32670)

Dashboard

🎉 New Features:

Ray serve releases its own dedicated dashboard! See the documentation for more details.
You can now access the error messages from every task and actor from the Ray dashboard.
Better out of memory debugging support. See the out of memory troubleshooting guide for more details.

🔨 Fixes:

Add the OOM failure graph (#34129)
Improve the existing OOM metrics (#33453)
Task backend - increase worker side GC limit to 100k (#33563)
Add device index to the GPU metrics (#33328)
Hide failed nodes by default. (#33455)
Add worker startup & initialization time to state API + use it for many_tasks (#31916)
Fix per component metrics bugs. (#33450)
Fix the incorrect object store size from dashboard vs metrics

Many thanks to all those who contributed to this release!

@zjf2012, @christy, @fyrestone, @avnishn, @scottjlee, @sijieamoy, @jjyao, @sven1977, @jamesclark-Zapata, @cadedaniel, @jovany-wang, @pcmoritz, @MaskRay, @csivanich, @augray, @wuisawesome, @Wendi-anyscale, @maxpumperla, @shawnpanda, @DmitriGekhtman, @yuduber, @gjoliver, @ju2ez, @clarkzinzow, @brycehuang30, @iycheng, @justinvyu, @dmatrix, @edoakes, @tmbdev, @scottsun94, @jianoaix, @cool-RR, @prrajput1199, @amogkam, @ckw017, @alanwguo, @architkulkarni, @chaowanggg, @AmeerHajAli, @stephanie-wang, @bewestphal, @matthew29tang, @dbczumar, @sihanwang41, @ericl, @soumitrak, @matthewdeng, @Catch-Bull, @peytondmurray, @XiaodongLv, @bveeramani, @YQ-Wang, @Linniem, @ProjectsByJackHe, @woshiyyya, @c21, @shrekris-anyscale, @zcin, @Yard1, @can-anyscale, @kouroshHakha, @robertnishihara, @richardliaw, @krfricke, @shomilj, @ArturNiederfahrenhorst, @ijrsvt, @GokuMohandas, @jbedorf, @xwjiang2010, @anydayeol, @clarng, @davidxia, @rickyyx, @Siraj-Qazi, @kira-lin, @scv119, @chengscott, @angelinalg, @rkooo567, @rshin, @deanwampler, @gramhagen, @larrylian, @WeichenXu123, @simonsays1980

ray-project/ray ray-2.4.0 Ray-2.4.0 on GitHub

Ray 2.4 - Generative AI and LLM support

Generative AI and LLM Examples

Ray Train enhancements

Ray Data highlights

Ray Serve highlights

Other enhancements

Ray Libraries

Ray AIR

Ray Data Processing

Ray Train

Ray Tune

Ray Serve

RLlib

Ray Core

Ray Clusters

Dashboard

ray-project/ray ray-2.4.0
Ray-2.4.0

on GitHub