Ray 2.4 - Generative AI and LLM support
Over the last few months, we have seen a flurry of innovative activity around generative AI models and large language models (LLM). To continue our effort to ensure Ray provides a pivotal compute substrate for generative AI workloads and addresses the challenges (as explained in our blog series), we have invested engineering efforts in this release to ensure that these open source LLM models and workloads are accessible to the open source community and performant with Ray.
This release includes new examples for training, batch inference, and serving with your own LLM.
Generative AI and LLM Examples
- GPT-J (LLM) fine-tuning with Microsoft DeepSpeed and Ray Train
- GPT-J-6B Batch Prediction with Ray Data
- GPT-J-6B Serving with Ray Serve
- Stable Diffusion (Dreambooth) fine-tuning with Ray Train
- Stable Diffusion Batch Prediction with Ray Data
- Stable Diffusion Serving with Ray Serve
Ray Train enhancements
- We're introducing the LightningTrainer, allowing you to scale your PyTorch Lightning on Ray. As part of our continued effort for seamless integration and ease of use, we have enhanced and replaced our existing ray_lightning integration, which was widely adopted, with the latest changes to Pytorch Lighting.
- we’re releasing an AccelerateTrainer, allowing you to run HuggingFace Accelerate and DeepSpeed on Ray with minimal code changes. This Trainer integrates with the rest of the Ray ecosystem—including the ability to run distributed hyperparameter tuning with each trial being a distributed training job.
Ray Data highlights
- Streaming execution is enabled by default, providing users with a more efficient data processing pipeline that can handle larger datasets and minimize memory consumption. Check out the docs here: (doc)
- We've implemented asynchronous batch prefetching of Dataset.iter_batches (doc), improving performance by fetching data in parallel while the main thread continues processing, thus reducing waiting time.
- Support reading SQL databases (doc), enabling users to seamlessly integrate relational databases into their Ray Data workflows.
- Introduced support for reading WebDataset (doc), a common format for high-performance deep learning training jobs.
Ray Serve highlights
- Multi-app CLI & REST API support is now available, allowing users to manage multiple applications with different configurations within a single Ray Serve deployment. This simplifies deployment and scaling processes for users with multiple applications. (doc)
- Enhanced logging and metrics for Serve applications, giving users better visibility into their application's performance and facilitating easier debugging and monitoring.
(doc)
Other enhancements
- Ray 2.4 is the last version that supports Python 3.6
- We've also added a brand new landing page
Ray Libraries
Ray AIR
💫Enhancements:
- Add nightly test for alpa opt 30b inference. (#33419)
- Add a sanity checking release test for Alpa and ray nightly. (#32995)
- Add
TorchDetectionPredictor
(#32199) - Add
artifact_location
,run_name
to MLFlow integration (#33641) - Add
*path
properties toResult
andResultGrid
(#33410) - Make
Preprocessor.transform
lazy by default (#32872) - Make
BatchPredictor
lazy (#32510, #32796) - Use a configurable ray temp directory for the
TempFileLock
util (#32862) - Add
collate_fn
toiter_torch_batches
(#32412) - Allow users to pass
Callable[[torch.Tensor], torch.Tensor]
toTorchVisionTransform
(#32383) - Automatically move
DatasetIterator
torch tensors to correct device (#31753)
🔨 Fixes:
- Fix
use_gpu
withHuggingFacePredictor
(#32333) - Make Keras
Callback
raiseDeprecationWarning
(#33775) - Pin framework to tf in AIR rl offline trainer example (#33750)
- Fix test_tracked_actor (#33075)
- Label
Checkpoint.from_checkpoint
as developer API (#33094) - Don't make lock files when moving dirs (#32924)
- avoid inconsistency of create filesystem from uri for hdfs case (#30611)
- Fix
DatasetIterator
backwards compability (#32526) - Fix
CountVectorizer
failing with big data (#32351) - Fix NoneType error loading TorchCheckpoint through
from_uri
. (#32386) - Fix
dtype
type hint inDLPredictor
methods (#32198) - Allow None in
set_preprocessor
(#33088)
📖Documentation:
- Add dreambooth example + release test (#33025)
- GPT-J fine tuning with DeepSpeed example (#33090)
- GPT-J Serving Example (#33114)
- Add object detection example (#31553)
- Add computer vision guide (#32885)
- Add Pytorch ResNet batch prediction example (#32470)
- Large model inference examples (#32874)
- Add data ingestion clarification for AIR converting existing pytorch code example (#32058)
- Add
BatchPredictor.from_checkpoint
to docs (#32877) - Fix wording of Many model training guidance (#32319)
- Add required dependencies for batch prediction notebook (#33897)
- Added link to preprocessors in Ray AIR Key Concepts page (#33526)
- Rewording in analyze_tuning_results.ipynb (#32671)
🏗 Architecture refactoring:
- Deprecations for 2.4 (#33765)
- Deprecate
TensorflowCheckpoint.get_model
model_definition
parameter (#33776)
Ray Data Processing
🎉 New Features:
- Enable streaming execution by default (#32493)
- Support asynchronous batch prefetching of Dataset.iter_batches() (#33620)
- Introduce Dataset.materialize() API (#34184)
- Add Dataset.streaming_split() API (#32991)
- Support reading SQL databases with ray.data.read_sql() (#33353)
- Support reading WebDataset with ray.data.read_webdataset() (#33336)
- Allow generator UDFs for Dataset.map_batches() and flat_map() (#32767)
- Add
collate_fn
to Dataset.iter_torch_batches() (#32412) - Add support for
ignore_missing_paths
in reading Datasource (#33126)
💫Enhancements:
- Deprecate Dataset.lazy() (#33812)
- Deprecate Dataset.dataset_format() (#33437)
- Deprecate Dataset.fully_executed() and Dataset.is_fully_executed() (#33342)
- Deprecate Datasource.do_write() (#32015)
- Add iter_rows to DatasetIterator (#33180)
- Support optional
tf_schema
parameter in read_tfrecords() and write_tfrecords() (#32857) - Add Dataset.iter_batches(batch_format=None) support, which will yield batches in the current batch format with zero copies (#33562)
- Add meta_provider parameter into read_images (#33791)
- Add missing passthrough args to Dataset.read_images() (#32942)
- Allow read_binary_files(output_arrow_format=True) to return Arrow format (#33780)
- Improve performance of DefaultFileMetaProvider (#33117)
- Improved naming of Ray Data map tasks for dashboard (#32585)
- Support different numbers of blocks/rows per block in Dataset.zip() (#32795)
- No preserve order by default for streaming execution (#32300)
- Make write an operator as part of the execution plan (#32015)
- Optimize metadata creation for RangeDatasource (#33712)
- Add telemetry for Ray Data (#32896)
- Make BatchPredictor lazy (#32510)
- Make Preprocessor.transform lazy by default (#32872)
- Allow users to pass Callable[[torch.Tensor], torch.Tensor] to
TorchVisionTransform
(#32383) - Automatically move DatasetIterator torch tensors to correct device (#31753)
- Promote _create_strict_ragged_ndarray to public API (#31975)
- Add support for string tensor columns in
ArrowTensorArray
andArrowVariableShapedTensorArray
(#32143)
🔨 Fixes:
- Data layer performance/bug fixes and tweaks (#32744)
- Clean up RAY_DATASET_FORCE_LOCAL_METADATA flag (#32483)
- Fix Datasource write_results type (#33936)
- Add objects GC in dataset iterator (#34030, #34141)
- Fix to_pandas failure on datasets returned by from_spark() (#32968)
- Fix zip stage to preserve order when executing the other side (#33649)
- Fix _get_read_tasks to use NodeAffinitySchedulingStrategy (#33212)
- Guard against using ipywidgets in google colab (#32841)
- Fix from_items parallelism to create the expected number of blocks (#32821)
- Always preserve order for the BulkExecutor (#32437)
📖Documentation:
- Fix Ragged Tensor Documentation (#33029)
Ray Train
🎉 New Features:
- The LightningTrainer has been revamped
- [Doc] LightningTrainer end-2-end starter example [no_early_kickoff] (#33494)
- Lightning Trainer Release tests + docstring sample test (#33323)
- Support metric logging and checkpointing for LightningTrainer (#33183)
- Add LightningTrainer to support Pytorch Lightning DDP training <Part 1>. (#33161)
- Add LightningPredictor to support batch prediction (#33196)
- Support metric logging and checkpointing for LightningTrainer (#33183)
- Implement
AccelerateTrainer
(#33269) - Add
Trainer.restore
API for train experiment-level fault tolerance (#31920) - Ray Train telemetry to collect the AIR trainer type (#33277)
💫Enhancements:
- Recommend
Trainer.restore
on errors raised bytrainer.fit()
(#33610) - Improve lazy checkpointing (#32233)
- Sort
CUDA_VISIBLE_DEVICES
(#33159) - Support returning multiple devices in
train.torch.get_device()
(#32893) - Set
torch.distributed
env vars (#32450) - Use the actual task name being executed for _RayTrainWorker__execute. (#33065)
🔨 Fixes:
- Fix the import path for LightningTrainer to be compatible with Pytorch Lightning 2.0. (#34033)
- Pin framework to tf in AIR rl offline trainer example (#33750)
- Fix HF Trainer with
DatasetIterator
, handledevice_map
(#32955) - Fix failing
test_torch_trainer
(#32963) - Deflake
test_gpu
by sorting the devices (#33002) - (Bandaid) Mitigate OOMs on checkpointing (#33089)
- Empty cache on the correct device (#33603)
📖Documentation:
- [Doc] Improve AIR Lightning APIs and docstrings (#33895)
- Update quickstart example to use dataloader (#33050)
- add intro content from training module (#32088)
- Add back important Ray Train integration methods (for Torch/TF) (#32551)
- Restructure API reference (#32360)
- Add Pytorch ResNet finetuning starter example (#32936)
🏗 Architecture refactoring:
- Hard deprecate Backend encode_data/decode_data (#33784)
Ray Tune
🎉 New Features:
- Add new experimental execution path
- Add TuneController (#33499)
- Refactor TrialRunner to separate out executor calls (#33448)
- Move trainable_kwargs generation to trial.py (#33249)
- Move experiment state/checkpoint/resume management into a separate file (#32457)
- Cache ready futures in RayTrialExecutor (#32093)
- Use generic _ObjectCache for actor reuse (#33045)
- Event manager part 2: Implementation (#31811)
- Add new experimental output format
- Add new console output code path (behind feature flag) (#33609)
- Fix new output path with new execution path (#33880)
- Fix
OrderedDict
import. (#33709) - Fix Ray Tune output v2 failures (#33697)
- Update wording to "Logical Resource Usage". (#33312)
- clean up tune/train result output (#32234)
- Add unit test to
experimental/output.py
(#33767)
- Ray Tune Telemetry to collect entrypoint and searchers/scheduler usage (#33740)
💫Enhancements:
- Experiment restore/resume
- Allow re-specifying param space in
Tuner.restore
(#32317) - Replace reference values in a config dict with placeholders (#31927)
- Add
Tuner.can_restore(path)
utility for checking if an experiment exists at a path/uri (#32003) - Update
Tuner.restore
usage to prepare fortrainable
becoming a required arg (#32912) - Fix resuming from cloud storage (+ test) (#32504)
- Allow re-specifying param space in
- Syncing to cloud storage
- Sync trial artifacts to cloud (#32334)
- Fix ensure directory in bucket path sync (#33692)
- Sync less often and only wait at end of experiment (#32155)
- Unrevert "Add more comprehensive support for remote_checkpoint_dir w/ url params (#32479)" (#32576)
- Use
on_experiment_end
hook for the final wait ofSyncCallback
sync processes (#33390) - Cleanup path-related properties in experiment classes (#33370)
- Update trainable
remote_checkpoint_dir
upon actor reuse (#32420) - Add
use_threads=False
in pyarrow syncing (#32256)
- Better support for multi-tenancy
- Fault tolerance improvements
- Integrations
🔨 Fixes:
- ExperimentAnalysis: Ignore empty checkpoints but don't fail (#33770)
- TrialRunner checkpointing shouldn't fail if
ray.data.Dataset
w/o lineage captured in trial config (#33565) - Evict object cache for actor re-use when search ended (#33593)
- Raise warning (not an exception) if metadata file not found in checkpoint (#33123)
- remove deep copy in
trial.__getstate__
(#32624) - Fix "ValueError: I/O operation on closed file" (#31269)
📖Documentation:
- Restructure API reference (#32311)
- Don't recommend
tune.run
API in logging messages when using theTuner
(#33642) - Split "Tune stopping and resuming" into two user guides (#33495)
- Remove Ray Client references from Tune and Train docs/examples (#32299)
- Add tune checkpoint user guide. (#33145)
- improve
log_to_file
doc. (#32128) - Fix broken Tune links to overview and intergration (#32442)
🏗 Architecture refactoring:
- Deprecation cycle
Ray Serve
🎉 New Features:
💫Enhancements:
- Add telemetry for lightweight config updates (#34039)
- Deployment & replica info automatically exist in user customized metrics.(33451)
- Add route and request id in the ray serve log entry.(33365)
- Add telemetry for common Serve usage patterns (#33505)
- Add
log_to_stderr
option to logger and improve internal logging (#33597) - Make http retries tunable (#32532)
- Extend configurable HTTP options (#33160)
- Prevent mixing single/multi-app config deployment (#33340)
- Expose FastAPI docs path (#32863)
- Add http request latency (#32839)
🔨 Fixes:
- Recover the pending actors during the controller failures (#33890)
- Fix tensorarray to numpy conversion (#34115)
- Allow app rename when redeploying config (#33385)
- Fix traceback string for RayTaskErrors when deploying serve app (#33120)
📖Documentation:
- Add serve example documentation for object detection, stable diffusion and (#33164)
RLlib
🎉 New Features:
- RLModule API is available in Alpha. See details here. PPO has been migrated to this API but in a limited mode.
- Catalog API is revamped to be consistent with RLModule. See details here.
💫Enhancements:
- Default framework is now torch instead of tf. (#33603)
- Hard deprecate the old rllib/agent folder (#33242)
🔨 Fixes:
- [RLlib] Don't serialize config in Policy states (unless needed for msgpack-type checkpoints). (#33865)
- [RLlib] Fix MultiCallbacks class: To be used only with utility function that returns a class to use in the config. (#33863)
- [RLlib] DM control suite wrapper fix: dtype of obs needs to be pinned to float32. (#33876)
- [RLlib] Fix apex dqn deprecated add_batch call (#33814)
- [RLlib] AlgorithmConfig.update_from_dict needs to work for MultiCallbacks. (#33796)
- [RLlib] Add dist_inputs to action sampler fn returns in TorchPolicyV2 (#33795)
📖Documentation:
- Rewritten the API documentation for better discoverability.
- [RLlib][Docs] Restructure RLModule API page (#33363)
- [RLlib][Docs] Restructure Replay buffer API page (#33359)
- [RLlib][Docs] Restructure Utils API page (#33358)
- [RLlib][Docs] Restructure Sampler's API page (#33357)
- [RLlib][Docs] Restructure Modelv2's API page (#33356)
- [RLlib][Docs] Restructure Algorithm's API page (#33345)
- [RLlib][Docs] Restructure Policy's API page (#33344)
- [RLlib] Fix Getting Started example never returning (#33140)
Ray Core
🎉 New Features:
- Ray officially support scale to up to 2000 nodes. See scalability envelope for more details.
- Ray introduces an experimental API RAY_preload_python_modules to preload Python modules before tasks or actors are scheduled. This will eventually reduce startup time of Ray workloads that import large libraries. Please try it out and share feedback in #ray-preload-modules-feedback in the Ray Slack. To enable, configure the modules to preload via RAY_preload_python_modules=torch,tensorflow when starting Ray.
💫Enhancements:
- Mark raylet unhealthy if GCS can't recognize it. (#34216)
- Improve the workflow finding Redis leader.(#34183)
- Improve Redis related observability when failed. (#33842)
- Improve the serialization error for tasks, actors and ray.put (#33660)
- Experimental preload_python_modules flag for preloading modules in default_worker (#33572)
- Remove actor deletion upon job termination (#31019)
- Better support per worker gpu usage from the cluster view. (#33515)
- Task backend - Profile events capping (#33321)
- Fifo worker killing policy (#33430)
- Write ray address even if ray node is started with --block (#32961)
- Turn on light weight resource broadcasting. (#32625)
- Add opt-in flag for Windows and OSX clusters, update ray start output to match docs (#32409)
🔨 Fixes:
- Fix arm64 wheels builds ((#34320)
- Fix ray start command output(#34273)
- Partially address Ray child process leaks by killing all child processes in the CoreWorker shutdown sequence. (#34181)
- Ignore resource usage update from unknown node (#33619)
- Fix keepalive in grpc client #33986
- Autosummary class by default (#32983)
- Fix non default dashboard_agent_listen_port not used when connected to the node (#33834)
- Allow using local wheels to run release tests. (#32739)
- Fix the error message when storage is not set. (#33581)
- Fixing lint issue in benchmark_worker_startup (#33440)
- Pin json-schema < 4.18 (#33412)
- Fix demand leak when worker failed (#31175)
- Remove some usage of deprecated runtime context apis (#33236)
- Remove dead SchedulingResources class (#33250)
- Release lock before sleeping (#33221)
- Remove asyncio.ensure_future call in run_async_func_in_event_loop(#32932)
- Upgrade gtest to 1.13 (#32858)
- Update OpenCensus (#32553)
- Remove usage_lib.LibUsageRecorder (#32806)
- Fix the race condition in the new resource broadcasting. (#32798)
- Task backend - disable verbose print. (#32764)
- Building py37+cu118 and using cu116 in default ray-ml image (#32636)
- Do not set flushing thread niceness for task backend #32439
- Fix gRPC callback API destruction issues. (#32151)
- Fix comments and a corner case in #32302 (#32323)
- Script to compare perf metrics between releases (#32290)
📖Documentation:
- Rewrite the placement group documentation (#34302)
- Add tips of writing fault tolerant Ray applications (#32191)
- Removed docs referring to ray client. (#32209)
- Improve the streaming_split pydoc (#33424)
- Add doc link for logs dedup (#33879)
Ray Clusters
💫Enhancements:
- Added end to end release tests for example AWS cluster launcher YAML files (#32670)
Dashboard
🎉 New Features:
- Ray serve releases its own dedicated dashboard! See the documentation for more details.
- You can now access the error messages from every task and actor from the Ray dashboard.
- Better out of memory debugging support. See the out of memory troubleshooting guide for more details.
🔨 Fixes:
- Add the OOM failure graph (#34129)
- Improve the existing OOM metrics (#33453)
- Task backend - increase worker side GC limit to 100k (#33563)
- Add device index to the GPU metrics (#33328)
- Hide failed nodes by default. (#33455)
- Add worker startup & initialization time to state API + use it for many_tasks (#31916)
- Fix per component metrics bugs. (#33450)
- Fix the incorrect object store size from dashboard vs metrics
Many thanks to all those who contributed to this release!
@zjf2012, @christy, @fyrestone, @avnishn, @scottjlee, @sijieamoy, @jjyao, @sven1977, @jamesclark-Zapata, @cadedaniel, @jovany-wang, @pcmoritz, @MaskRay, @csivanich, @augray, @wuisawesome, @Wendi-anyscale, @maxpumperla, @shawnpanda, @DmitriGekhtman, @yuduber, @gjoliver, @ju2ez, @clarkzinzow, @brycehuang30, @iycheng, @justinvyu, @dmatrix, @edoakes, @tmbdev, @scottsun94, @jianoaix, @cool-RR, @prrajput1199, @amogkam, @ckw017, @alanwguo, @architkulkarni, @chaowanggg, @AmeerHajAli, @stephanie-wang, @bewestphal, @matthew29tang, @dbczumar, @sihanwang41, @ericl, @soumitrak, @matthewdeng, @Catch-Bull, @peytondmurray, @XiaodongLv, @bveeramani, @YQ-Wang, @Linniem, @ProjectsByJackHe, @woshiyyya, @c21, @shrekris-anyscale, @zcin, @Yard1, @can-anyscale, @kouroshHakha, @robertnishihara, @richardliaw, @krfricke, @shomilj, @ArturNiederfahrenhorst, @ijrsvt, @GokuMohandas, @jbedorf, @xwjiang2010, @anydayeol, @clarng, @davidxia, @rickyyx, @Siraj-Qazi, @kira-lin, @scv119, @chengscott, @angelinalg, @rkooo567, @rshin, @deanwampler, @gramhagen, @larrylian, @WeichenXu123, @simonsays1980