ray-project/ray ray-2.53.0 on GitHub

Highlights

Ray plans to drop support for Pydantic V1 starting version 2.56.0. Please see this RFC for details.
Ray Data now has support for bounded reading from Kafka and improved Iceberg support.

Ray Data

🎉 New Features

Autoscaling: New utilization-based cluster autoscaler for Ray Data workloads (#59353, #59362, #59366). To use this new autoscaler set RAY_DATA_CLUSTER_AUTOSCALER=V2.
Kafka Datasource: Add Kafka as a native datasource for data ingestion (#58592)
Dataset summary API: Add Dataset.summary() API for quick dataset inspection (#58862)
Iceberg support: Add Iceberg schema evolution, upsert, and overwrite support (#59210, #59335)
Graceful error handling: Add should_continue_on_error for graceful error handling in batch inference (#59212)
Datetime compute expressions: Add datetime compute expressions support (#58740)
Grouped with_column expressions: Enable expressions for grouped with_column in Ray Data (#58231)
Parallelized collation: Parallelize DefaultCollateFn, arrow_batch_to_tensors (#58821)

💫 Enhancements

Optimized Autoscaler Step Size: Optimize autoscaler to support configurable step size for actor pool scaling (#58726)
Improved Streaming Repartition: Improve streaming repartition performance (#58728)
Actor init retry: Add actor retry if there's a failure in __init__ (#59105)
Fused Repartition + MapBatches: Fuse StreamingRepartition with MapBatches operators to scale collate (#59108)
Combined repartitions: Combine consecutive repartitions for efficiency (#59145)
Prefetch buffering: Handle prefetch buffering in iter_batches (#58657)
HashShuffle block breakdown: HashShuffleAggregator breaks down blocks on finalize (#58603)
Backpressure tuning: Tune concurrency cap backpressure object store budget ratio (#58813)
Non-string ApproximateTopK: Support non-string items for ApproximateTopK aggregator (#58659)
Lance version support: Add version support to read_lance() (#58895)
Dashboard metrics: Add time_to_first_batch and get_ref_bundles metrics to data dashboard (#58912)
Iter prefetched bytes stats: Add iter_prefetched_bytes statistics tracking (#58900)
Configurable batching for iter_batches: Add configurable batching for resolve_block_refs to speed up iter_batches (#58467)
Improved dashboard metrics: Improve Ray Data dashboard metrics display (#58667)
Histogram percentiles: Update Ray Data histograms to show percentiles in data dashboard (#58650)
Deprecated API removal: Remove deprecated read_parquet_bulk API (#58970)
Block shaping option: Add disable block shaping option to BlockOutputBuffer (#58757)
Removed concurrency lock: Remove concurrency lock for better performance (#56798)

🔨 Fixes

Fixes to Unique: Fix support of list types for Unique aggregator (#58916)
Parquet NaN fix: Fix reading from written parquet for numpy with NaNs (#59172)
Hash Shuffle empty block: Fix empty block sort in hash shuffle operator (#58836)
Hive partitioning pushdown: Fix pushdown optimizations with Hive partitioning (#58723)
Object Store usage reporting: Fix obj_store_mem_max_pending_output_per_task reporting (#58864)
Pyarrow FileSystem serialization fix: Handle filesystem serialization issue in get_parquet_dataset (#57047)
Azure UC SAS: Handle Azure UC user delegation SAS (#59393)
Async UDF Thread Cleanup: Close threads from async UDF after actor died (#59261)
Object Locality Default: Default return 0s for object locality instead of -1s (#58754)

📖 Documentation

Added contributing guide to Ray Data documentation (#58589)
Added download expression to key user journeys in documentation (#59417)
Added Kafka user guide (#58881)
Added unstructured data templates from Ray Summit 2025 (#57063)
Improved instructions for reading Hugging Face datasets (#58492, #58832)
Refined batch-format guidance in docs (#58971)
Exposed vision_preprocess and vision_postprocess in VLM docs (#59012)
Added upgrading huggingface_hub instruction (#59109)
Added scaling out expensive collation functions doc (#58993)

Ray Serve

🎉 New Features

Deployment topology visibility. Exposes deployment dependency graphs in Serve REST API, allowing users to visualize and understand the DAG structure of their applications. (#58355)
External autoscaler integration. Adds external_scaler_enabled flag to application config, enabling third-party autoscalers to control replica counts. (#57727, #57698)
Node rank and local rank support. Extends replica rank system to track node-level and per-node local ranks, enabling better distributed serving coordination for multi-node deployments. (#58477, #58479)
Custom batch size function. Allows users to define custom functions for computing logical batch sizes in @serve.batch, useful when batch items have varying weights (e.g., token counts in LLM inference). (#59059)
Stateful application-level autoscaling. Adds policy state persistence for custom autoscaling policies, allowing policies to maintain state across control-loop iterations. (#59118)
New autoscaling, batching, and routing metrics. Adds Prometheus metrics for autoscaling decisions (ray_serve_deployment_target_replicas, ray_serve_autoscaling_decision_replicas), batching statistics, and router queue latency for improved observability. (#59220, #59232, #59233)

💫 Enhancements

Smarter downscaling behavior. Prioritizes stopping most recently scaled-up replicas during downscale, preserving long-lived replicas that are optimally placed and fully warmed up. (#52929)
Autoscaling performance optimizations. Short-circuits metric aggregation for single time series cases (O(n log n) → O(1)) and lazily evaluates expensive autoscaling context fields to reduce controller CPU usage. (#58962, #58963)
Route matching cleanup. Removes redundant route matching logic from replicas since correct route values are now included in RequestMetadata. Also allows multiple methods (GET, PUT) corresponding to a route. (#58927)
Deployment wrapper metadata preservation. Wrapper classes from decorators like @ingress now preserve original class metadata (__qualname__, __module__, __doc__, __annotations__). (#58478)
Improved type annotations. Enhances generic type annotations on DeploymentHandle, DeploymentResponse, and DeploymentResponseGenerator for better IDE support and type inference. Adds .result() stub to DeploymentResponseGenerator to fix static typing errors. (#59363, #58522)

🔨 Fixes

YAML serialization for autoscaling enums. Fixes RepresenterError when using serve build with AggregationFunction enum values in autoscaling config. (#58509)
Autoscaling context timestamp fix. Correctly sets last_scale_up_time and last_scale_down_time on autoscaling context. (#59057)
Deadlock in chained deployment responses. Fixes hang when awaiting intermediate DeploymentResponse objects in a chain of deployment calls from different event loops. (#59385)
FastAPI class-based view inheritance. Fixes make_fastapi_class_based_view to properly handle inherited methods. (#59410)

📖 Documentation

Async I/O best practices guide. New documentation covering async programming patterns and best practices for Ray Serve deployments. (#58909)
Replica scheduling guide. New documentation covering compact scheduling, placement groups, custom resources, and guidance on when to use each feature. (#59114)

Ray Train

🎉 New Features

Worker Placement with Label Selectors: Added label_selector to ScalingConfig. This allows users to control worker placement by targeting specific labeled nodes in the cluster. (#58845, #59414)
Multihost JaxTrainer on GPU: Introduced support for JaxTrainer running on GPU machines. (#58322)
Checkpoint Consistency Modes: Added CheckpointConsistencyMode to get_all_reported_checkpoints, providing options for handling checkpoint retrieval consistency. (#58271)
Per-Dataset Execution Options: DataConfig now supports setting execution_options on a per-dataset basis for finer-grained control over data loading. (#58717)

💫 Enhancements

Nested Metrics Support: Result.get_best_checkpoint now supports nested metrics, allowing for more flexible metric tracking and checkpoint selection. (#58537)
Non-Blocking Checkpoint Retrieval: get_all_reported_checkpoints no longer blocks when only metrics are reported. (#58870)
Improved Resource Cleanup: Implemented eager cleanup of data resources and placement groups upon training run failures or aborts, preventing resource leaks. (#58325, #58515)

🔨 Fixes

MLflow Compatibility: Updated setup_mlflow API to ensure full compatibility with Ray Train V2. (#58705)
Validation for Checkpoint Uploads: A ValueError is now raised if checkpoint_upload_fn fails to return a valid checkpoint. (#58863)

📖 Documentation

New API Documentation: Added comprehensive documentation for the ray.train.get_all_reported_checkpoints method. (#58946)

Ray Tune

💫 Enhancements:

Nested Metrics Support: Result.get_best_checkpoint now supports nested metrics, allowing for more flexible metric tracking and checkpoint selection. (#58537)

Ray LLM

💫 Enhancements

Cloud filesystem restructuring with provider-specific implementations (#58469)
Bump transformers to 4.57.3 (#58980)
Ray Data LLM config refactor (#58298)
Update vllm_engine.py to check for VLLM_USE_V1 attribute (#58820)
Infer VLLM_RAY_PER_WORKER_GPUS from fractional placement-group bundles automatically (#58949)

🔨 Fixes

Fix LLM DP release test configuration (#59090)

Ray RLlib

🎉 New Features

DreamerV3: allow num_env_runners \> 0 (#58495)

💫 Enhancements

🔥 MetricsLogger tweaks+ Stats rewrite (#56838)
move restart message into EnvRunner (#56750)
make “Footsies” less verbose (optionally) (#58939)
update an AlgorithmConfig deprecated argument with incorrect behavior/semantics (#59138)
Examples/docs cleanup:
- merge tuned examples into examples/ (#58893)
- move old API examples (#59159)
- move example run scripts (#59160)
- remove Torch 2.x doc tied to removed benchmarks (#59173)
- remove rllib/benchmark(s) folder from RLlib directory (#59158)
Testing / CI & infra cleanup (part of a larger effort to organize + harden RLlib testing):
- clean up tests folder layout in favor of /component/tests (#58890)
- re-enable and fix nightly tests for APPO on Atari and MuJoCo (#58853)
- re-enable all RLlib doctests (#58974)
- add pytest reporting hook (pytest_runtest_makereport) across tests (#59003)
- add/enable RLlib Py3.10 CI lane (#59226)
- fix as-release-test silently failing (#59386)
- fix recursive imports in old test-utils location (#59435)
- Remove asv.conf.json (#58934)
- Update requirement for byod_rllib.sh (#59157)

🔨 Fixes

Fix custom model-config mismatch between EnvRunner and Learner (#58739)
MultiAgentEnvRunner: prevent double-calling connectors (#58931)
Error handling: log or raise when a case is not fully handled (#58889)
Error handling: error out when data cannot be loaded (#59002)
Assorted RLlib bugfixes (#59386)

📖 Documentation

Update APPO paper reference to link to IMPACT paper (#58935)

Ray Core

🎉 New Features

Support zero-copy serialization for read-only PyTorch tensors via RAY_ENABLE_ZERO_COPY_TORCH_TENSORS (#57639)
Add .rayignore file support for controlling cluster uploads (#58500)
Improve large-scale resource view synchronization through sync message batching (#57641)
Autoscaler with cloud resource availability awareness (#58623)
Token authentication UX improvements with new AuthenticationError exception (#58737)
Support X-Ray-Authorization fallback header for auth token in dashboard (#58819)

💫 Enhancements

Limit core worker gRPC reply threads to 2 by default via RAY_core_worker_num_server_call_thread (#58771)
Make accessor node address and liveliness cache thread safe (#58947)
Create OtlpGrpcMetricExporter wrapper to log export failures (#58929)
Print detailed exception information when failing to report events (#58953)
Simplify local/global GC logic (#58671)
Surface correct error message when get_if_exists=True for actor lookup (#58628)
Throw AuthenticationError from Python for token loading errors (#59031)
Use secrets.token_hex(32) to generate auth tokens (#58818)
Remove AUTH_MODE=token check in get-auth-token CLI (#58848)
Introduce core chaos network release tests (#58868)

🔨 Fixes

Fix grpc_authentication_server_interceptors streaming response handling (#59104)
Fix handle leak in IsProcessAlive on Windows (#59106)
Fix counter metric default branch for RAY_enable_open_telemetry (#59095)
Fix leaking metric recorder in tests (#58952)
Fix crash when using JVM HDFS by adding RAY_DISABLE_FAILURE_SIGNAL_HANDLER option (#58984)
Fix heap corruption in RayletClient causing driver crash (use-after-free) (#58660)
Use shared_ptr for pins_in_flight_ to prevent use-after-free (#58744)
Remove deprecated add_command_alias (#58719)
Remove cluster_full_of_actors_detected_* fields (unused in autoscaler v2) (#59052)

📖 Documentation

Add token-auth.md documentation page (#58829)
Update KubeRay authentication guide to use native Ray token authentication (#58729)

Dashboard

💫 Enhancements

Add time_to_first_batch and get_ref_bundles metrics to data dashboard (#58912)
Update Ray Data histograms to show percentiles grouped by operator (#58650)

Ray Wheels and Images

Upgraded rich, cupy-cuda12x, and memray (#58983)
Upgraded lxml to 6.0.2 (#58808)
Upgraded requests from 2.32.3 to 2.32.5 (#58724)
Added openlineage-python in the dependency set (#58724)

Thanks

Thank you to everyone who contributed to this release!
@xinyuangui2, @harshit-anyscale, @Sparks0219, @israbbani, @siyuanfoundation, @robertnishihara, @thomasdesr, @spencer-p, @aslonnie, @ZacAttack, @soodoshll, @marosset, @simeetnayan81, @soffer-anyscale, @abrarsheikh, @400Ping, @richo-anyscale, @as-jding, @rueian, @kshanmol, @yancanmao, @zzchun, @coqian, @matthewdeng, @Future-Outlier, @YoussefEssDS, @ykdojo, @pseudo-rnd-thoughts, @lowdy1, @ArturNiederfahrenhorst, @myandpr, @komikndr, @machichima, @RisinT96, @curiosity-hyf, @alanwguo, @CaiZhanqi, @Aydin-ab, @MengjinYan, @suzuri-lollipop, @jeffreyjeffreywang, @rushikeshadhav, @alexeykudinkin, @meAmitPatil, @zcin, @teddygood, @elliot-barn, @dayshah, @srinathk10, @XLC127, @simonsays1980, @kevin85421, @bveeramani, @kunling-anyscale, @khluu, @andrew-anyscale, @KaisennHu, @kouroshHakha, @ryankert01, @pavitrabhalla, @jjyao, @dragongu, @SolitaryThinker, @justinrmiller, @wxwmd, @Haustle-v, @TimothySeah, @goutamvenkat-anyscale, @liulehui, @raulchen, @HassamSheikh, @Priya-753, @vaishdho1, @dancingactor, @daiping8, @eloaf, @JasonLi1909, @rayci-bot, @richardliaw, @SheldonTsen, @Yicheng-Lu-llll, @ktyxx, @pschmutz, @iamjustinhsu, @ahao-anyscale, @cem-anyscale, @eicherseiji, @edoakes, @rajeshg007, @arki05, @andrewsykim, @nrghosh, @ryanaoleary, @kyuds, @Daraan, @can-anyscale, @sampan-s-nayak, @xyuzh, @owenowenisme

ray-project/ray ray-2.53.0 Ray-2.53.0 on GitHub

Highlights

Ray Data

Ray Serve

Ray Train

Ray Tune

Ray LLM

Ray RLlib

Ray Core

Dashboard

Ray Wheels and Images

Thanks

ray-project/ray ray-2.53.0
Ray-2.53.0

on GitHub