github ray-project/ray ray-2.53.0
Ray-2.53.0

one day ago

Highlights

  • Ray plans to drop support for Pydantic V1 starting version 2.56.0. Please see this RFC for details.
  • Ray Data now has support for bounded reading from Kafka and improved Iceberg support.

Ray Data

πŸŽ‰ New Features

  • Autoscaling: New utilization-based cluster autoscaler for Ray Data workloads (#59353, #59362, #59366). To use this new autoscaler set RAY_DATA_CLUSTER_AUTOSCALER=V2.
  • Kafka Datasource: Add Kafka as a native datasource for data ingestion (#58592)
  • Dataset summary API: Add Dataset.summary() API for quick dataset inspection (#58862)
  • Iceberg support: Add Iceberg schema evolution, upsert, and overwrite support (#59210, #59335)
  • Graceful error handling: Add should_continue_on_error for graceful error handling in batch inference (#59212)
  • Datetime compute expressions: Add datetime compute expressions support (#58740)
  • Grouped with_column expressions: Enable expressions for grouped with_column in Ray Data (#58231)
  • Parallelized collation: Parallelize DefaultCollateFn, arrow_batch_to_tensors (#58821)

πŸ’« Enhancements

  • Optimized Autoscaler Step Size: Optimize autoscaler to support configurable step size for actor pool scaling (#58726)
  • Improved Streaming Repartition: Improve streaming repartition performance (#58728)
  • Actor init retry: Add actor retry if there's a failure in __init__ (#59105)
  • Fused Repartition + MapBatches: Fuse StreamingRepartition with MapBatches operators to scale collate (#59108)
  • Combined repartitions: Combine consecutive repartitions for efficiency (#59145)
  • Prefetch buffering: Handle prefetch buffering in iter_batches (#58657)
  • HashShuffle block breakdown: HashShuffleAggregator breaks down blocks on finalize (#58603)
  • Backpressure tuning: Tune concurrency cap backpressure object store budget ratio (#58813)
  • Non-string ApproximateTopK: Support non-string items for ApproximateTopK aggregator (#58659)
  • Lance version support: Add version support to read_lance() (#58895)
  • Dashboard metrics: Add time_to_first_batch and get_ref_bundles metrics to data dashboard (#58912)
  • Iter prefetched bytes stats: Add iter_prefetched_bytes statistics tracking (#58900)
  • Configurable batching for iter_batches: Add configurable batching for resolve_block_refs to speed up iter_batches (#58467)
  • Improved dashboard metrics: Improve Ray Data dashboard metrics display (#58667)
  • Histogram percentiles: Update Ray Data histograms to show percentiles in data dashboard (#58650)
  • Deprecated API removal: Remove deprecated read_parquet_bulk API (#58970)
  • Block shaping option: Add disable block shaping option to BlockOutputBuffer (#58757)
  • Removed concurrency lock: Remove concurrency lock for better performance (#56798)

πŸ”¨ Fixes

  • Fixes to Unique: Fix support of list types for Unique aggregator (#58916)
  • Parquet NaN fix: Fix reading from written parquet for numpy with NaNs (#59172)
  • Hash Shuffle empty block: Fix empty block sort in hash shuffle operator (#58836)
  • Hive partitioning pushdown: Fix pushdown optimizations with Hive partitioning (#58723)
  • Object Store usage reporting: Fix obj_store_mem_max_pending_output_per_task reporting (#58864)
  • Pyarrow FileSystem serialization fix: Handle filesystem serialization issue in get_parquet_dataset (#57047)
  • Azure UC SAS: Handle Azure UC user delegation SAS (#59393)
  • Async UDF Thread Cleanup: Close threads from async UDF after actor died (#59261)
  • Object Locality Default: Default return 0s for object locality instead of -1s (#58754)

πŸ“– Documentation

  • Added contributing guide to Ray Data documentation (#58589)
  • Added download expression to key user journeys in documentation (#59417)
  • Added Kafka user guide (#58881)
  • Added unstructured data templates from Ray Summit 2025 (#57063)
  • Improved instructions for reading Hugging Face datasets (#58492, #58832)
  • Refined batch-format guidance in docs (#58971)
  • Exposed vision_preprocess and vision_postprocess in VLM docs (#59012)
  • Added upgrading huggingface_hub instruction (#59109)
  • Added scaling out expensive collation functions doc (#58993)

Ray Serve

πŸŽ‰ New Features

  • Deployment topology visibility. Exposes deployment dependency graphs in Serve REST API, allowing users to visualize and understand the DAG structure of their applications. (#58355)
  • External autoscaler integration. Adds external_scaler_enabled flag to application config, enabling third-party autoscalers to control replica counts. (#57727, #57698)
  • Node rank and local rank support. Extends replica rank system to track node-level and per-node local ranks, enabling better distributed serving coordination for multi-node deployments. (#58477, #58479)
  • Custom batch size function. Allows users to define custom functions for computing logical batch sizes in @serve.batch, useful when batch items have varying weights (e.g., token counts in LLM inference). (#59059)
  • Stateful application-level autoscaling. Adds policy state persistence for custom autoscaling policies, allowing policies to maintain state across control-loop iterations. (#59118)
  • New autoscaling, batching, and routing metrics. Adds Prometheus metrics for autoscaling decisions (ray_serve_deployment_target_replicas, ray_serve_autoscaling_decision_replicas), batching statistics, and router queue latency for improved observability. (#59220, #59232, #59233)

πŸ’« Enhancements

  • Smarter downscaling behavior. Prioritizes stopping most recently scaled-up replicas during downscale, preserving long-lived replicas that are optimally placed and fully warmed up. (#52929)
  • Autoscaling performance optimizations. Short-circuits metric aggregation for single time series cases (O(n log n) β†’ O(1)) and lazily evaluates expensive autoscaling context fields to reduce controller CPU usage. (#58962, #58963)
  • Route matching cleanup. Removes redundant route matching logic from replicas since correct route values are now included in RequestMetadata. Also allows multiple methods (GET, PUT) corresponding to a route. (#58927)
  • Deployment wrapper metadata preservation. Wrapper classes from decorators like @ingress now preserve original class metadata (__qualname__, __module__, __doc__, __annotations__). (#58478)
  • Improved type annotations. Enhances generic type annotations on DeploymentHandle, DeploymentResponse, and DeploymentResponseGenerator for better IDE support and type inference. Adds .result() stub to DeploymentResponseGenerator to fix static typing errors. (#59363, #58522)

πŸ”¨ Fixes

  • YAML serialization for autoscaling enums. Fixes RepresenterError when using serve build with AggregationFunction enum values in autoscaling config. (#58509)
  • Autoscaling context timestamp fix. Correctly sets last_scale_up_time and last_scale_down_time on autoscaling context. (#59057)
  • Deadlock in chained deployment responses. Fixes hang when awaiting intermediate DeploymentResponse objects in a chain of deployment calls from different event loops. (#59385)
  • FastAPI class-based view inheritance. Fixes make_fastapi_class_based_view to properly handle inherited methods. (#59410)

πŸ“– Documentation

  • Async I/O best practices guide. New documentation covering async programming patterns and best practices for Ray Serve deployments. (#58909)
  • Replica scheduling guide. New documentation covering compact scheduling, placement groups, custom resources, and guidance on when to use each feature. (#59114)

Ray Train

πŸŽ‰ New Features

  • Worker Placement with Label Selectors: Added label_selector to ScalingConfig. This allows users to control worker placement by targeting specific labeled nodes in the cluster. (#58845, #59414)
  • Multihost JaxTrainer on GPU: Introduced support for JaxTrainer running on GPU machines. (#58322)
  • Checkpoint Consistency Modes: Added CheckpointConsistencyMode to get_all_reported_checkpoints, providing options for handling checkpoint retrieval consistency. (#58271)
  • Per-Dataset Execution Options: DataConfig now supports setting execution_options on a per-dataset basis for finer-grained control over data loading. (#58717)

πŸ’« Enhancements

  • Nested Metrics Support: Result.get_best_checkpoint now supports nested metrics, allowing for more flexible metric tracking and checkpoint selection. (#58537)
  • Non-Blocking Checkpoint Retrieval: get_all_reported_checkpoints no longer blocks when only metrics are reported. (#58870)
  • Improved Resource Cleanup: Implemented eager cleanup of data resources and placement groups upon training run failures or aborts, preventing resource leaks. (#58325, #58515)

πŸ”¨ Fixes

  • MLflow Compatibility: Updated setup_mlflow API to ensure full compatibility with Ray Train V2. (#58705)
  • Validation for Checkpoint Uploads: A ValueError is now raised if checkpoint_upload_fn fails to return a valid checkpoint. (#58863)

πŸ“– Documentation

  • New API Documentation: Added comprehensive documentation for the ray.train.get_all_reported_checkpoints method. (#58946)

Ray Tune

πŸ’« Enhancements:

  • Nested Metrics Support: Result.get_best_checkpoint now supports nested metrics, allowing for more flexible metric tracking and checkpoint selection. (#58537)

Ray LLM

πŸ’« Enhancements

  • Cloud filesystem restructuring with provider-specific implementations (#58469)
  • Bump transformers to 4.57.3 (#58980)
  • Ray Data LLM config refactor (#58298)
  • Update vllm_engine.py to check for VLLM_USE_V1 attribute (#58820)
  • Infer VLLM_RAY_PER_WORKER_GPUS from fractional placement-group bundles automatically (#58949)

πŸ”¨ Fixes

  • Fix LLM DP release test configuration (#59090)

Ray RLlib

πŸŽ‰ New Features

  • DreamerV3: allow num_env_runners \> 0 (#58495)

πŸ’« Enhancements

  • πŸ”₯ MetricsLogger tweaks+ Stats rewrite (#56838)
  • move restart message into EnvRunner (#56750)
  • make β€œFootsies” less verbose (optionally) (#58939)
  • update an AlgorithmConfig deprecated argument with incorrect behavior/semantics (#59138)
  • Examples/docs cleanup:
    • merge tuned examples into examples/ (#58893)
    • move old API examples (#59159)
    • move example run scripts (#59160)
    • remove Torch 2.x doc tied to removed benchmarks (#59173)
    • remove rllib/benchmark(s) folder from RLlib directory (#59158)
  • Testing / CI & infra cleanup (part of a larger effort to organize + harden RLlib testing):
    • clean up tests folder layout in favor of /component/tests (#58890)
    • re-enable and fix nightly tests for APPO on Atari and MuJoCo (#58853)
    • re-enable all RLlib doctests (#58974)
    • add pytest reporting hook (pytest_runtest_makereport) across tests (#59003)
    • add/enable RLlib Py3.10 CI lane (#59226)
    • fix as-release-test silently failing (#59386)
    • fix recursive imports in old test-utils location (#59435)
    • Remove asv.conf.json (#58934)
    • Update requirement for byod_rllib.sh (#59157)

πŸ”¨ Fixes

  • Fix custom model-config mismatch between EnvRunner and Learner (#58739)
  • MultiAgentEnvRunner: prevent double-calling connectors (#58931)
  • Error handling: log or raise when a case is not fully handled (#58889)
  • Error handling: error out when data cannot be loaded (#59002)
  • Assorted RLlib bugfixes (#59386)

πŸ“– Documentation

  • Update APPO paper reference to link to IMPACT paper (#58935)

Ray Core

πŸŽ‰ New Features

  • Support zero-copy serialization for read-only PyTorch tensors via RAY_ENABLE_ZERO_COPY_TORCH_TENSORS (#57639)
  • Add .rayignore file support for controlling cluster uploads (#58500)
  • Improve large-scale resource view synchronization through sync message batching (#57641)
  • Autoscaler with cloud resource availability awareness (#58623)
  • Token authentication UX improvements with new AuthenticationError exception (#58737)
  • Support X-Ray-Authorization fallback header for auth token in dashboard (#58819)

πŸ’« Enhancements

  • Limit core worker gRPC reply threads to 2 by default via RAY_core_worker_num_server_call_thread (#58771)
  • Make accessor node address and liveliness cache thread safe (#58947)
  • Create OtlpGrpcMetricExporter wrapper to log export failures (#58929)
  • Print detailed exception information when failing to report events (#58953)
  • Simplify local/global GC logic (#58671)
  • Surface correct error message when get_if_exists=True for actor lookup (#58628)
  • Throw AuthenticationError from Python for token loading errors (#59031)
  • Use secrets.token_hex(32) to generate auth tokens (#58818)
  • Remove AUTH_MODE=token check in get-auth-token CLI (#58848)
  • Introduce core chaos network release tests (#58868)

πŸ”¨ Fixes

  • Fix grpc_authentication_server_interceptors streaming response handling (#59104)
  • Fix handle leak in IsProcessAlive on Windows (#59106)
  • Fix counter metric default branch for RAY_enable_open_telemetry (#59095)
  • Fix leaking metric recorder in tests (#58952)
  • Fix crash when using JVM HDFS by adding RAY_DISABLE_FAILURE_SIGNAL_HANDLER option (#58984)
  • Fix heap corruption in RayletClient causing driver crash (use-after-free) (#58660)
  • Use shared_ptr for pins_in_flight_ to prevent use-after-free (#58744)
  • Remove deprecated add_command_alias (#58719)
  • Remove cluster_full_of_actors_detected_* fields (unused in autoscaler v2) (#59052)

πŸ“– Documentation

  • Add token-auth.md documentation page (#58829)
  • Update KubeRay authentication guide to use native Ray token authentication (#58729)

Dashboard

πŸ’« Enhancements

  • Add time_to_first_batch and get_ref_bundles metrics to data dashboard (#58912)
  • Update Ray Data histograms to show percentiles grouped by operator (#58650)

Ray Wheels and Images

  • Upgraded rich, cupy-cuda12x, and memray (#58983)
  • Upgraded lxml to 6.0.2 (#58808)
  • Upgraded requests from 2.32.3 to 2.32.5 (#58724)
  • Added openlineage-python in the dependency set (#58724)

Thanks

Thank you to everyone who contributed to this release!
@xinyuangui2, @harshit-anyscale, @Sparks0219, @israbbani, @siyuanfoundation, @robertnishihara, @thomasdesr, @spencer-p, @aslonnie, @ZacAttack, @soodoshll, @marosset, @simeetnayan81, @soffer-anyscale, @abrarsheikh, @400Ping, @richo-anyscale, @as-jding, @rueian, @kshanmol, @yancanmao, @zzchun, @coqian, @matthewdeng, @Future-Outlier, @YoussefEssDS, @ykdojo, @pseudo-rnd-thoughts, @lowdy1, @ArturNiederfahrenhorst, @myandpr, @komikndr, @machichima, @RisinT96, @curiosity-hyf, @alanwguo, @CaiZhanqi, @Aydin-ab, @MengjinYan, @suzuri-lollipop, @jeffreyjeffreywang, @rushikeshadhav, @alexeykudinkin, @meAmitPatil, @zcin, @teddygood, @elliot-barn, @dayshah, @srinathk10, @XLC127, @simonsays1980, @kevin85421, @bveeramani, @kunling-anyscale, @khluu, @andrew-anyscale, @KaisennHu, @kouroshHakha, @ryankert01, @pavitrabhalla, @jjyao, @dragongu, @SolitaryThinker, @justinrmiller, @wxwmd, @Haustle-v, @TimothySeah, @goutamvenkat-anyscale, @liulehui, @raulchen, @HassamSheikh, @Priya-753, @vaishdho1, @dancingactor, @daiping8, @eloaf, @JasonLi1909, @rayci-bot, @richardliaw, @SheldonTsen, @Yicheng-Lu-llll, @ktyxx, @pschmutz, @iamjustinhsu, @ahao-anyscale, @cem-anyscale, @eicherseiji, @edoakes, @rajeshg007, @arki05, @andrewsykim, @nrghosh, @ryanaoleary, @kyuds, @Daraan, @can-anyscale, @sampan-s-nayak, @xyuzh, @owenowenisme

Don't miss a new ray release

NewReleases is sending notifications on new releases.