Release Highlights

This release features Ray Compiled Graph (beta). Ray Compiled Graph gives you a classic Ray Core-like API, but with (1) less than 50us system overhead for workloads that repeatedly execute the same task graph; and (2) native support for GPU-GPU communication via NCCL. Ray Compiled Graph APIs simplify high-performance multi-GPU workloads such as LLM inference and training. The beta release refines the API, enhances stability, and adds or improves features like visualization, profiling and experimental GPU compute/computation overlap. For more information, refer to Ray documentation: https://docs.ray.io/en/latest/ray-core/compiled-graph/ray-compiled-graph.html
The experimental Ray Workflows library has been deprecated and will be removed in a future version of Ray. Ray Workflows has been marked experimental since its inception and hasn’t been maintained due to the Ray team focusing on other priorities. If you are using Ray Workflows, we recommend pinning your Ray version to 2.44.

Ray Libraries

Ray Data

🎉 New Features:

Add Iceberg write support through pyiceberg (#50590 )
[LLM] Various feature enhancements to Ray Data LLM, including LoRA support #50804 and structured outputs #50901

💫 Enhancements:

Add dataset/operator state, progress, total metrics (#50770)
Make chunk combination threshold configurable (#51200)
Store average memory use per task in OpRuntimeMetrics (#51126)
Avoid unnecessary conversion to Numpy when creating Arrow/Pandas blocks (#51238)
Append-mode API for preprocessors -- #50848, #50847, #50642, #50856, #50584. Note that vectorizers and hashers now output a single column instead 1 column per feature. In the near future, we will be graduating preprocessors to beta.

🔨 Fixes:

Fixing Map Operators to avoid unconditionally overriding generator's back-pressure configuration (#50900)
Fix filter expr equating negative numbers (#50932)
Fix error message for override_num_blocks when reading from a HuggingFace Dataset (#50998)
Make num_blocks in repartition optional (#50997)
Always pin the seed when doing file-based random shuffle (#50924)
Fix StandardScaler to handle NaN stats (#51281)

Ray Train

🎉 New Features:

Implement state export API (#50622, #51085, #51177)

💫 Enhancements:

Folded v2.XGBoostTrainer API into the public trainer class as an alternate constructor (#50045)
Created a default ScalingConfig if one is not provided to the trainer (#51093)
Improved TrainingFailedError message (#51199)
Utilize FailurePolicy factory (#51067)

🔨 Fixes:

Fixed trainer import deserialization when captured within a Ray task (#50862)
Fixed serialize import test for Python 3.12 (#50963)
Fixed RunConfig deprecation message in Tune being emitted in trainer.fit usage (#51198)

📖 Documentation:

[Train V2] Updated API references (#51222)
[Train V2] Updated persistent storage guide (#51202)
[Train V2] Updated user guides for metrics, checkpoints, results, and experiment tracking (#51204)
[Train V2] Added updated Train + Tune user guide (#51048)
[Train V2] Added updated fault tolerance user guide (#51083)
Improved HF Transformers example (#50896)
Improved Train DeepSpeed example (#50906)
Use correct mean and standard deviation norm values in image tutorials (#50240)

🏗 Architecture refactoring:

Deprecated Torch AMP wrapper utilities (#51066)
Hid private functions of train context to avoid abuse (#50874)
Removed ray storage dependency and deprecated RAY_STORAGE env var configuration option (#50872)
Moved library usage tests out of core (#51161)

Ray Tune

📖 Documentation:

Various improvements to Tune Pytorch CIFAR tutorial (#50316)
Various improvements to the Ray Tune XGBoost tutorial (#50455)
Various enhancements to Tune Keras example (#50581)
Minor improvements to Hyperopt tutorial (#50697)
Various improvements to LightGBM tutorial (#50704)
Fixed non-runnable Optuna tutorial (#50404)
Added documentation for Asynchronous HyperBand Example in Tune (#50708)
Replaced reuse actors example with a fuller demonstration (#51234)
Fixed broken PB2/RLlib example (#51219)
Fixed typo and standardized equations across the two APIs (#51114)
Improved PBT example (#50870)
Removed broken links in documentation (#50995, #50996)

🏗 Architecture refactoring:

Removed ray storage dependency and deprecated RAY_STORAGE env var configuration option (#50872)
Moved library usage tests out of core (#51161)

Ray Serve

🎉 New Features:

Faster bulk imperative Serve Application deploys (#49168)
[LLM] Add gen-config (#51235)

💫 Enhancements:

Clean up shutdown behavior of serve (#51009)
Add additional_log_standard_attrs to serve logging config (#51144)
[LLM] remove asyncache and cachetools from dependencies (#50806)
[LLM] remove backoff dependency (#50822)
[LLM] Remove asyncio_timeout from ray[llm] deps on python<3.11 (#50815)
[LLM] Made JSON validator a singleton and jsonref packages lazy imported (#50821)
[LLM] Reuse AutoscalingConfig and DeploymentConfig from Serve (#50871)
[LLM] Use pyarrow FS for cloud remote storage interaction (#50820)
[LLM] Add usage telemetry for serve.llm (#51221)

🔨 Fixes:

Exclude redirects from request error count (#51130)
[LLM] Fix the wrong device_capability issue in vllm on quantized models (#51007)
[LLM] add gen-config related data file to the package (#51347)

📖 Documentation:

[LLM] Fix quickstart serve LLM docs (#50910)
[LLM] update build_openai_app to include yaml example (#51283)
[LLM] remove old vllm+serve doc (#51311)

RLlib

💫 Enhancements:

APPO/IMPALA accelerate:
- LearnerGroup should not pickle remote functions on each update-call; Refactor LearnerGroup and Learner APIs. (#50665)
- EnvRunner sync enhancements. (#50918 )
- Various other speedups: #51302, #50923, #50919, #50791
Unify namings for actor managers' outstanding in-flight requests metrics. (#51159)
Add timers to env step, forward pass, and complete connector pipelines runs. (#51160)

🔨 Fixes:

Multi-agent env vectorization:
- Fix MultiAgentEnvRunner env check bug. (#50891 )
- Add single_action_space and single_observation_space to VectorMultiAgentEnv. (#51096)
Other fixes: #51255, #50920, #51369

📖 Documentation:

Smaller fixes: #51015, #51219

Ray Core and Ray Clusters

Ray Core

🎉 New Features:

Enhanced uv support (#51233)

💫 Enhancements:

Made infeasible task errors much more obvious (#45909)
Log rotation for workers, runtime env agent, and dashboard agent (#50759, #50877, #50909)
Support customizing gloo timeout (#50223)
Support torch profiling in Compiled Graph (#51022)
Change default tensor deserialization in Compiled Graph (#50778)
Use current node id if no node is specified on ray drain-node (#51134)

🔨 Fixes:

Fixed an issue where the raylet continued to have high CPU overhead after a job was terminated (#49999).
Fixed compiled graph buffer release issues (#50434)
Improved logic for ray.wait on object store objects (#50680)
Ray metrics performing validation the same validation as Prometheus for invalid names (#40586)
Make executor a long-running Python thread (#51016)
Fix plasma client memory leak (#51051)
Fix using ray.actor.exit_actor() from within an async background thread (#49451)
Fix UV hook to support Ray Job submission (#51150)
Fix resource leakage after ray job is finished (#49999)
Use the correct way to check whether an actor task is running (#51158)
Controllably destroy CUDA events in GPUFuture’s (Compiled Graph) (#51090)
Avoid creating a thread pool with 0 threads (#50837)
Fix the logic to calculate the number of workers based on the TPU version (#51227)

📖 Documentation:

Updated error message and anti-pattern when forking new processes in worker processes (#50705)
Compiled Graph API Documentation (#50754)
Doc for nsight and torch profile for Compiled Graph (#51037)
Compiled Graph Troubleshooting Doc (#51030)
Completion of of Compiled Graph Docs (#51206)
Updated jemalloc profiling doc (#51031)
Add information about standard Python logger attributes (#51038)
Add description for named placement groups to require a namespace (#51285)
Deprecation warnings for Ray Workflows and cluster-wide storage (#51309)

Ray Clusters

🎉 New Features:

Add cuda 12.8 images (#51210)

💫 Enhancements:

Add Pod names to the output of ray status -v (#51192)

🔨 Fixes:

Fix autoscaler v1 crash from infeasible strict spread placement groups (#39691)

🏗 Architecture refactoring:

Refactor autoscaler v2 log formatting (#49350)
Update yaml example for CoordinatorSenderNodeProvider (#51292)

Dashboard

🎉 New Features:

Discover TPU logs on the Ray Dashboard (#47737)

🔨 Fixes:

Return the correct error message when trying to kill non-existent actors (#51341)

Many thanks to all those who contributed to this release!
@crypdick, @rueian, @justinvyu, @MortalHappiness, @CheyuWu, @GeneDer, @dayshah, @lk-chen, @matthewdeng, @co63oc, @win5923, @sven1977, @akshay-anyscale, @ShaochenYu-YW, @gvspraveen, @bveeramani, @jakac, @VamshikShetty, @raulchen, @PaulFenton, @elimelt, @comaniac, @qinyiyan, @ruisearch42, @nadongjun, @AndyUB, @israbbani, @hongpeng-guo, @laysfire, @alexeykudinkin, @Drice1999, @harborn, @scottsun94, @abrarsheikh, @martinbomio, @MengjinYan, @HollowMan6, @orcahmlee, @kenchung285, @csy1204, @noemotiovon, @jujipotle, @davidxia, @kevin85421, @hcc429, @edoakes, @kouroshHakha, @omatthew98, @alanwguo, @farridav, @aslonnie, @simonsays1980, @pcmoritz, @terraflops1048576, @JoshKarpel, @SumanthRH, @sijieamoy, @zcin, @can-anyscale, @akyang-anyscale, @angelinalg, @saihaj, @jjyao, @anmscale, @ryanaoleary, @dentiny, @jimmyxie-figma, @stephanie-wang, @khluu, @maofagui

ray-project/ray ray-2.44.0 Ray-2.44.0 on GitHub

Release Highlights

Ray Libraries

Ray Data

Ray Train

Ray Tune

Ray Serve

RLlib

Ray Core and Ray Clusters

Ray Core

Ray Clusters

Dashboard

ray-project/ray ray-2.44.0
Ray-2.44.0

on GitHub