github ray-project/ray ray-2.51.0
Ray-2.51.0

one day ago

Release Highlights

Ray Train:

  • Ray Train v2 is now enabled by default! Ray Train v2 provides usability and stability improvements, as well as new features. For more details, see the REP and Migration Guide. To disable Ray Train v2, set the environment variable RAY_TRAIN_V2_ENABLED=0.

Ray Serve:

  • Application-level autoscaling: Introduces custom autoscaling policies that operate across all deployments in an application, enabling coordinated scaling decisions based on aggregate metrics. This is a significant advancement over per-deployment autoscaling, allowing for more intelligent resource management at the application level.
  • Enhanced autoscaling capabilities with replica-level metrics: Wires up AutoscalingContext with total_running_requests, total_queued_requests, and total_num_requests, plus adds support for min, max, and time-weighted average aggregation functions. These improvements give users fine-grained control to implement sophisticated custom autoscaling policies based on real-time workload metrics.

Ray Libraries

Ray Data

🎉 New Features:

  • Added enhanced support for Unity Catalog integration (#57954, #58049)
  • New expression evaluator infrastructure for improved query optimization (#57778, #57855)
  • Support for SaveMode in write operations (#57946)
  • Added approximate quantile aggregator (#57598)
  • MCAP datasource support for robotics data (#55716)
  • Callback-based stat computation for preprocessors and ValueCounter (#56848)
  • Support for multiple download URIs with improved error handling (#57775)

💫 Enhancements:

  • Improved projection pushdown handling with renamed columns (#58033, #58037, #58040, #58071)
  • Enhanced hash-shuffle performance with better retry policies (#57572)
  • Streamlined concurrency parameter semantics (#57035)
  • Improved execution progress rendering (#56992)
  • Better handling of empty columns in pandas blocks (#57740)
  • Enhanced support for complex data types and column operations (#57271)
  • Reduced memory usage with improved streaming generator backpressure (#57688)
  • Enhanced preemption testing and utilities (#57883)
  • Improved Download operator display names (#57773)
  • Better handling of variable-shaped tensors and tensor columns (#57240)
  • Optimized aggregator execution with out-of-order processing by default (#57753)

🔨 Fixes:

  • Fixed renamed columns to be appropriately dropped from output (#58040, #58071)
  • Fixed handling of renames in projection pushdown (#58033, #58037)
  • Fixed vLLMEngineStage field name inconsistency for images (#57980)
  • Fixed driver hang during streaming generator block metadata retrieval (#56451)
  • Fixed retry policy for hash-shuffle tasks (#57572)
  • Fixed prefetch loop to avoid blocking on fetches (#57613)
  • Fixed empty projection handling (#57740)
  • Fixed errors with concatenation of mixed pyarrow native and extension types (#56811)

📖 Documentation:

  • Updated document embedding benchmark to use canonical Ray Data API (#57977)
  • Improved concurrency-related documentation (#57658)
  • Updated preprocessing and data handling examples

Ray Train

🎉 New features

  • Turn on Train v2 by default (#57857)
  • Top-level ray.train aliases for public APIs (#57758)

💫 Enhancements

  • Raise clear errors when mixing v1/v2 APIs (#57570)
  • JAX backend: add jax.distributed.shutdown() for JaxBackend (#57802)
  • Update TrainingFailedError module (#57865)
  • Improve deprecation handling when ray.train methods are called from ray.tune (#57810)
  • Enable deprecation warnings for legacy XGBoost/LightGBM trainers (#57280)

🔨 Fixes

  • Fix ControllerError triggered by after_worker_group_poll_status errors (#57869)
  • Fix iter_torch_batches use of ray.train.torch.get_device outside Train (#57816)
  • Fix exception-queue race condition in ThreadRunner (#57249)

📖 Documentation

  • Add validation and details to checkpoint docs (#57065)

🏗 Architecture / tests

  • Enable Train v2 across test suites; migrate remaining tests and isolate/disable stragglers (#56868, #57256, #57534, #57722, #57764)
  • Isolate circular-dependency tests and resolve circular imports (#57710, #56921)
  • Replace Checkpoint Manager Pydantic v2 APIs with v1 (#57147)
  • Bump test timeouts (test_util, torch_trainer) (#57939, #57873)

Ray Tune

💫 Enhancements:

  • Updated release tests to import from tune (#57956)
  • Better integration with Train V2 backend

Ray Serve

🎉 New Features:

  • Application-level autoscaling. Introduces support for custom autoscaling policies that operate across all deployments in an application, enabling coordinated scaling decisions based on aggregate metrics. (#57535, #57548, #57637, #57756)
  • Autoscaling metrics aggregation functions. Adds support for min, max, and time-weighted average aggregation over timeseries data, providing more flexible autoscaling control. (#56871)
  • Enhanced autoscaling context with replica-level metrics. Wires up AutoscalingContext constructor arguments to expose total_running_requests, total_queued_requests, and total_num_requests for use in custom autoscaling policies. (#57202)
  • Multiple task consumers in a single application. Ray Serve applications can now run multiple task consumer deployments concurrently. (#56618)

💫 Enhancements:

  • Reconfigure invoked on replica rank changes. The reconfigure method now receives both user_config and rank parameters when ranks change, enabling replicas to adapt their configuration dynamically. (#57091)
  • Celery adapter configuration improvements. Added default serializer and new configuration fields to enhance Celery integration flexibility. (#56707)
  • AutoscalingContext promoted to public API. The autoscaling context is now officially part of the public API with comprehensive documentation. (#57600)
  • Async inference telemetry. Added telemetry tracking to monitor the number of replicas using asynchronous inference. (#57665)
  • Rank logging verbosity reduced. Changed seven rank-related INFO logs to DEBUG level, reducing log noise during normal operations. (#57831)
  • Controller logging optimized. Removed expensive debug logs from the controller that were costly in large clusters. (#57813)

🔨 Fixes:

  • Max constructor retry count test fixed for Windows. Adjusted test resource requirements to account for Windows process creation overhead compared to Linux forking. (#57541)
  • Streaming test stability improvements. Added synchronization mechanisms to prevent chunk coalescing and rechunking, eliminating test flakiness. (#57592, #57728)
  • Autoscaling test deflaking. Fixed race conditions in application-level autoscaling tests and removed flaky min aggregation test scenario. (#57784, #57967)
  • State API usage test corrected. Fixed a unit test that was broken but not running in CI. (#56948)
  • Controller recovery logging condition fixed. Updated test condition to properly verify debug and JSON logs after controller recovery. (#57568)

📖 Documentation:

  • Custom autoscaling documentation. Added comprehensive guide for implementing custom autoscaling policies with examples and best practices. (#57600)
  • Replica ranks documentation. Documented the replica rank feature, including how ranks are assigned and how to use them in reconfigure methods. (#57649)
  • Application-level autoscaling guide. Added documentation explaining how to configure and use application-level autoscaling policies. (#57756)
  • Autoscaling documentation improvements. Updated serve autoscaling docs with clearer explanations and examples. (#57652)
  • Performance flags documentation. Documented performance-related configuration flags for Ray Serve. (#57845)
  • Metrics documentation fix. Corrected ray_serve_deployment_queued_queries metric name discrepancy in documentation. (#57629)
  • AutoscalingContext import added to examples. Fixed missing import statement in custom autoscaling policy example. (#57876)
  • App builder guide typo corrected. Fixed command syntax error in typed application builder example. (#57634)
  • Celery filesystem broker note. Added warning about using filesystem as a broker in Celery workers. (#57686)
  • Async inference alpha stage warning. Added notice that async inference is in alpha stage. (#57268)

🏗 Architecture refactoring:

  • Autoscaling control moved to application state. Migrated autoscaling control loop from deployment state to application state, preparing for application-level autoscaling. (#57548)
  • Async capability enum removed. Cleaned up unused async capability enum from codebase. (#57666)

Ray Serve/Data LLM

🎉 New Features:

  • Updated vLLM to 0.11.0 and Nixl to 0.6.0 (#57201)
  • Video processor support for multimodal pipelines (#56785)
  • Enhanced callback API for engine customization (#57257)
  • Unified and extended builder configuration for LLM deployments (#57724)

💫 Enhancements:

  • Protocol-based typing improvements and cleaner inheritance structure (#57743)
  • Better engine metrics enabled by default (#57615)
  • Simplified NIXL dependency management in ray-llm images (#57706)
  • Per-stage map kwargs for LLM processor preprocessing/postprocessing (#57826)
  • Improved architecture documentation (#57830)
  • Better code structure alignment with architectural design (#57889)
  • Enhanced multimodal support with Deepseek compatibility (#56906)

🔨 Fixes:

  • Fixed NIXL limitations with proper exception handling (#58159)
  • Improved runai_streamer for vLLM 0.10.2+ integration (#56906)

📖 Documentation:

  • Added comprehensive architecture documentation for Ray Serve LLM (#57830)
  • Reorganized LLM documentation with improved navigation (#57787)
  • Added benchmark page for performance reference (#57960)
  • Converted quick-start guide to MyST Markdown (#57782)
  • Better organization of Ray Serve LLM documentation (#57181?)

RLlib

🎉 New Features:

  • Prometheus metrics support for selected RLlib components (#57932)
  • Enhanced support for complex observations in SingleAgentEpisode (#57017)

💫 Enhancements:

  • LINT improvements with enabled ruff imports for rllib/utils (#56737)
  • Better type hints for learner_connector (#57673)
  • Improved throughput metrics to avoid biasing (#57215)

🔨 Fixes:

  • Fixed segment_tree.py edge case (#57599)
  • Fixed small bug in type hints (#57673)

Ray Core

🎉 New Features:

  • Enhanced Ray Direct Transport (RDT) with improved NIXL integration and garbage collection (#57671, #57603, #58159)
  • Cgroups support improvements with better system resource management (#57776, #57864, #57731, #58017, #58028, #58064)
  • Fault-tolerant RPC improvements for better distributed reliability (#57786, #57861)
  • Exponential backoff for retryable gRPCs (#56568)

💫 Enhancements:

  • Migrated from STATS to metric interface in RPC components (#57926)
  • Improved histogram metrics midpoint calculation (#57948)
  • Made FreeObjects non-fatal for better error handling (#57550)
  • Enhanced ReleaseUnusedBundles fault tolerance (#57786)
  • Made DrainRaylet and ShutdownRaylet fault tolerant (#57861)
  • Better error handling for metric and event exporter agent (#57925)
  • Improved raylet shutdown process and file organization (#57817)
  • Reporter agent can now get PID via RPC to raylet (#57004)
  • Enhanced ray.get thread safety (#57911)
  • Configurable proto naming during event JSON conversion (#57705)
  • Better handling of detached actor restarts (#57931)
  • Improved lease rescheduling in local lease manager during node draining (#57834)

🔨 Fixes:

  • Fixed "RayEventRecorder::StartExportingEvents() should be called only once" error (#57917)
  • Fixed deadlock when cancelling stale requests on in-order actors (#57746)
  • Fixed raylet shutdown races (#57198)
  • Fixed log monitor seeking bug after log rotation (#56902)
  • Deflaked multiple test suites for better CI reliability
  • Fixed various memory and resource management issues
  • Better handling of actor and task failures

📖 Documentation:

  • Added JaxTrainer API overview to Ray docs (#57182)
  • Fixed various typos and documentation issues
  • Updated autoscaling and system configuration guides
  • Enhanced SLURM documentation with symmetric-run support (#56775)

🏗 Architecture refactoring:

  • Dashboard API server subprocesses moved into system cgroup (#57864)
  • Driver moved into workers cgroup for better isolation (#57776)
  • Improved worker-raylet interface separation (#57804)
  • Better plasma store provider architecture

Dashboard

💫 Enhancements:

  • Added percentage usage graphs for resources (#57549)
  • Introduced sub-tabs with full Grafana dashboard embeds on Metrics tab (#57561)
  • Added queued blocks to operator panels (#57739)
  • Improved operator metrics logging for better clarity (#57702)
  • Better filtering and display in job lists

🔨 Fixes:

  • Fixed filtering issue in job list (#56946)
  • Fixed incomplete card content on overview page (#56947)
  • Filtered out ANSI escape codes from logs (#53370)

Autoscaler

🎉 New Features:

  • KubeRay autoscaling support with top-level Resources and Labels fields (#57260)
  • Bundle label selector support in request_resources SDK (#54843)
  • Application Gateway for Containers as ingress for Ray clusters on Azure

💫 Enhancements:

  • Azure improvements: Cleaning up extra resources (MSI, VNET, NSG) during cluster teardown (#57610)
  • Updated defaults for Azure cluster templates (#57716)
  • Better availability zone support for Azure node pools (#55532)
  • Hello world release tests for Azure and GCE (#57597, #57695)
  • Improved cluster resource state handling to fix over-provisioning (#57130)

🔨 Fixes:

  • Fixed autoscaler state synchronization issues (#57010)
  • Better handling of node state information (#57130)
  • Improved timeout handling for patch requests (#56605)

Thank you to everyone who contributed to this release!
Special thanks to all the contributors who helped make Ray 2.51.0 possible through bug fixes, features, documentation improvements, and testing efforts.

Don't miss a new ray release

NewReleases is sending notifications on new releases.