github ray-project/ray ray-2.52.0
Ray-2.52.0

3 hours ago

Release Highlights

Ray Core:

  • End of Life for Python 3.9 Support: Ray will no longer be releasing Python 3.9 wheels from now on.
  • Token authentication: Ray now supports built-in token authentication across all components including the dashboard, CLI, API clients, and internal services. This provides an additional layer of security for production deployments to reduce the risk of unauthorized code execution. Token authentication is initially off by default. For more information, see: https://docs.ray.io/en/latest/ray-security/token-auth.html

Ray Data:

  • We’ve added a number of improvements for Iceberg, including upserts, predicate and projection pushdown, and overwrite.
  • We’ve added significant improvements to our expressions framework, including temporal, list, tensor, and struct datatype expressions.

Ray Libraries

Ray Data

🎉 New Features:

  • Added predicate pushdown rule that pushes filter predicates past eligible operators (#58150, #58555)
  • Iceberg support for upsert tables, schema updates, and overwrite operations (#58270)
  • Iceberg support for predicate and projection pushdown (#58286)
  • Iceberg write datafiles in write() then commit (#58601)
  • Enhanced Unity Catalog integration (#57954)
  • Namespaced expressions that expose PyArrow functions (#58465)
  • Added version argument to read_delta_lake (#54976)
  • Generator UDF support for map_groups (#58039)
  • ApproximateTopK aggregator (#57950)
  • Serialization framework for preprocessors (#58321)
  • Support for temporal, list, tensor, and struct datatypes (#58225)

💫 Enhancements:

  • Use approximate quantile for RobustScaler preprocessor (#58371)
  • Map batches support for limit pushdown (#57880)
  • Make all map operations zero-copy by default (#58285)
  • Use tqdm_ray for progress reporting from workers (#58277)
  • Improved concurrency cap backpressure tuning (#58163, #58023, #57996)
  • Sample finalized partitions randomly to avoid lens effect (#58456)
  • Allow file extensions starting with '.' (#58339)
  • Set default file_extensions for read_parquet (#56481)
  • URL decode values in parse_hive_path (#57625)
  • Streaming partition enforces row_num per block (#57984)
  • Streaming repartition combines small blocks (#58020)
  • Lower DEFAULT_ACTOR_MAX_TASKS_IN_FLIGHT_TO_MAX_CONCURRENCY_FACTOR to 2 (#58262)
  • Set udf-modifying-row-count default to false (#58264)
  • Cache PyArrow schema operations (#58583)
  • Explain optimized plans (#58074)
  • Ranker interface (#58513)

🔨 Fixes:

  • Fixed renamed columns to be appropriately dropped from output (#58040, #58071)
  • Fixed handling of renames in projection pushdown (#58033, #58037)
  • Fixed broken LogicalOperator abstraction barrier in predicate pushdown rule (#58683)
  • Fixed file size ordering in download partitioning with multiple URI columns (#58517)
  • Fixed HTTP streaming file download by using open_input_stream (#58542)
  • Fixed expression mapping for Pandas (#57868)
  • Fixed reading from zipped JSON (#58214)
  • Fixed MCAP datasource import for better compatibility (#57964)
  • Avoid slicing block when total_pending_rows < target (#58699)
  • Clear queue for manually marked execution_finished operators (#58441)
  • Add exception handling for invalid URIs in download operation (#58464)
  • Fixed progress bar name display (#58451)

📖 Documentation:

  • Documentation for Ray Data metrics (#58610)
  • Simplify and add Ray Data LLM quickstart example (#58330)
  • Convert rST-style to Google-style docstrings (#58523)

🏗 Architecture:

  • Removed stats update thread (#57971)
  • Refactor histogram metrics (#57851)
  • Revisit OpResourceAllocator to make data flow explicit (#57788)
  • Create unit test directory for fast, isolated tests (#58445)
  • Dump verbose ResourceManager telemetry into ray-data.log (#58261)

Ray Train

🎉 New Features:

  • Result::from_path implementation in v2 (#58216)

💫 Enhancements:

  • Exit actor and log appropriately when poll_workers is in terminal state (#58287)
  • Set JAX_PLATFORMS environment variable based on ScalingConfig (#57783)
  • Default to disabling Ray Train collective util timeouts (#58229)
  • Add SHUTTING_DOWN TrainControllerState and improve logging (#57882)
  • Improved error message when calling training function utils outside Ray Train worker (#57863)
  • FSDP2 template: Resume from previous epoch when checkpointing (#57938)
  • Clean up checkpoint config and trainer param deprecations (#58022)
  • Update failure policy log message (#58274)

📖 Documentation:

  • Ray Train Metrics documentation page (#58235)
  • Local mode user guide (#57751)
  • Recommend tree_learner="data_parallel" in examples for distributed LightGBM training (#58709)

I'll read the commits file and generate Ray-style release notes for you.

[1 tool called]

Based on the commits in the file, here are the Ray Serve release notes following your requested style:

Ray Serve

🎉 New Features:

  • Custom request routing with runtime environment support. Users can now define custom request router classes that are safely imported and serialized using the application's runtime environment, enabling advanced routing logic with custom dependencies. (#56855)
  • Custom autoscaling policies with enhanced logging. Deployment-level and application-level autoscaling policies now display their custom policy names in logs, making it easier to debug and monitor autoscaling behavior. (#57878)
  • Audio transcription support in vLLM backend. Ray Serve now supports transcription tasks through the vLLM engine, expanding multimodal capabilities. (#57194)
  • Data parallel attention public API. Introduced a public API for data parallel attention, enabling efficient distributed attention mechanisms for large-scale inference workloads. (#58301)
  • Route pattern tracking in proxy metrics. Proxy metrics now expose actual route patterns (e.g., /api/users/{user_id}) instead of just route prefixes, enabling granular endpoint monitoring without high cardinality issues. Performance impact is minimal (~1% RPS decrease). (#58180)
  • Replica dependency graph construction. Added list_outbound_deployments() method to discover downstream deployment dependencies, enabling programmatic analysis of service topology for both stored and dynamically-obtained handles. (#58345, #58350)
  • Multi-dimensional replica ranking. Introduced ReplicaRank schema with global, node-level, and local ranks to support advanced coordination scenarios like tensor parallelism and model sharding across nodes. (#58471, #58473)
  • Proxy readiness verification. Added a check to ensure proxies are ready to serve traffic before serve.run() completes, improving deployment reliability. (#57723)
  • IPv6 socket support. Ray Serve now supports IPv6 networking for socket communication. (#56147)

💫 Enhancements:

  • Selective throughput optimization flag overrides. Users can now override individual flags set by RAY_SERVE_THROUGHPUT_OPTIMIZED without manually configuring all flags, improving flexibility for performance tuning. (#58057)
  • OpenTelemetry metrics enabled by default. Ray now uses OpenTelemetry as the default metrics backend, with updated metric names (ray_serve_*) and improved observability infrastructure. (#56432)
  • Cleaner long-poll communication. Removed actor handles from RunningReplicaInfo objects passed in long-poll updates, avoiding complex reference counting patterns. (#58174)
  • Improved replica config handling. Excluded IMPLICIT_RESOURCE_PREFIX from ReplicaConfig.ray_actor_options to prevent internal resource annotations from leaking into user-visible configurations. (#58275)
  • Custom autoscaling telemetry. Added telemetry tracking for custom autoscaling policy usage. (#58336)
  • Proxy target group control. Added from_proxy_manager argument to get_target_groups() for finer control over returned routing targets. (#57620)

🔨 Fixes:

  • Fixed default deployment name in async inference. Corrected the default deployment name which was changed to _TaskConsumerWrapper during async inference implementation. (#57664)
  • Fixed proxy location handling in CLI and Python API. serve run now respects proxy_location from config files instead of hardcoding EveryNode, and serve.start() no longer defaults to HeadOnly when http_options are provided without an explicit location. (#57622)
  • Fixed deprecated Stable Diffusion model in example. Updated documentation example to use a current model after stabilityai/stable-diffusion-2 was deprecated on Hugging Face. (#58609)

📖 Documentation:

  • KV-cache offloading user guide. Added comprehensive documentation for KV-cache offloading in LLM deployments. (#58025)
  • Model loading documentation. Documented best practices and options for loading models in Ray Serve. (#57922)
  • Cross-node tensor/pipeline parallelism examples. Added examples and documentation for running TP/PP across multiple nodes. (#57715)
  • Data parallel attention documentation. Created user guide for data parallel attention with architecture diagrams. (#58301, #58543)
  • Custom autoscaling policy examples. Added missing imports and improved clarity in autoscaling policy examples. (#57896, #58170)
  • Async inference documentation improvements. Added notes about task consumer replica configurations and fixed the end-to-end example. (#58493)
  • Callback documentation. Added documentation for using callbacks in Ray Serve. (#58713)
  • Monitoring and troubleshooting improvements. Enhanced monitoring section with links to Anyscale troubleshooting resources. (#58472)
  • Minor documentation fixes. Fixed spelling errors and improved docstring alignment. (#58172, #58233)

🏗 Architecture refactoring:

  • Replica rank management refactoring. Extracted generic RankManager class with type-safe ReplicaRank representation, creating a cleaner foundation for future multi-level rank support. (#58471, #58473)

Ray Tune

💫 Enhancements:

  • Updated jobs test to use tune module (#57995)
  • Add pydantic to Ray Tune requirements (#58354)

RLlib

🎉 New Features:

  • Support for vectorize modes in SingleAgentEnvRunner.make_env (#58410)
  • Support for composed spaces in Offline RL (#58594)
  • Enhanced support for complex observations in SingleAgentEpisode (#57017)
  • Prometheus metrics support for selected components (#57932)

💫 Enhancements:

🔨 Fixes:

  • Resolve bug that fails to propagate model_config to MultiAgentRLModule instances (#58243)
  • Fixed access to self._minibatch_size (#58595)
  • Broken restore from remote - Add missing FileSystem argument (#58324)
  • Fixed deterministic sampling and training documentation link (#58494)
  • Corrected typo in pyspiel import error message (#54618)

📖 Documentation:

  • Add reinforcement learning example illustrating GPU-to-GPU RDT and GRPO (#57961)

Ray Core

🎉 New Features:

💫 Enhancements:

  • Fault-tolerant RPCs: KillActor, CancelRemoteTask, NotifyGCSRestart, and ReleaseUnusedBundles (#57648, #57945, #57965)
  • Use graceful actor shutdown when GCS polling detects actor ref deleted (#58605)
  • Use graceful shutdown path when actor OUT_OF_SCOPE (del actor) (#57090)
  • Improved actor kill logs (#58544)
  • Scheduling detached actor with placement group not recommended (#57726)
  • Better handling of detached actor restarts (#57931)
  • Enhanced ray.get thread safety (#57911)
  • Making concurrent ray.get requests for the same object thread-safe (#58606)
  • Move request ID creation to worker to address plasma get perf regression (#58390)
  • Make GlobalState lazy initialization thread-safe (#58182)
  • Reporter agent can get PID via RPC to raylet (#57004)
  • Add tee logging for subprocess exit codes in ray start --block (#57982)
  • Add entrypoint log for jobs (#58300)
  • Cleaner error message for exceeding list actors limit (#58255)
  • Clean up NODE_DIED task error message (#58638)
  • Improved histogram metrics midpoint calculation (#57948)
  • Migrated from STATS to metric interface in RPC components (#57926)
  • Kill STATS in core worker component (#58060)
  • Kill STATS in object manager component (#57974)
  • Improve scheduler_placement_time_s metric (#58217)
  • Refactor OpenTelemetry environment variable handling (#57910)
  • Add option to disable OpenTelemetry SDK error logs (#58257)
  • Improved cgroups support (#57776, #57864, #57731, #58017, #58028, #58059, #58064, #58577)
  • Use GetNodeAddressAndLiveness in raylet client pool (#58576)
  • Ray Direct Transport improvements with NIXL integration (#57671, #58550, #58548, #56783, #58263)
  • Fix symmetric-run (#58337)
  • Make worker connection timeout parameters configurable (#58372)
  • Define env for controlling UVloop (#58442)
  • Allow 60 seconds for dashboard to start (#58341)
  • Report driver stats (#58045)
  • Fix idle node termination on object pulling (#57928)
  • Check if temp_dir is subdir of virtualenv to prevent runtime virtualenv problems (#58084)

🔨 Fixes:

  • Fixed use-after-free in RayletClient (#58747)
  • Fixed deadlock when cancelling stale requests on in-order actors (#57746)
  • Fixed "RayEventRecorder::StartExportingEvents() should be called only once" error (#57917)
  • Fixed raylet shutdown races (#57198)
  • Fixed incorrect usage of gRPC streaming API in ray syncer (#58307)
  • Fixed log monitor seeking bug after log rotation (#56902)
  • Fixed idempotency issues in RequestWorkerLease for scheduled leases (#58265)
  • Fixed RAY_CHECK(inserted) inside reference counter (#58092)
  • Fixed static type hints for ActorClass when setting options (#58439)
  • Fixed exception type for accelerator ID visibility check (#58269)
  • Fixed transport type handling in DAG node initialization (#57987)
  • Fixed RAY_NODE_TYPE_NAME handling when autoscaler is in read-only mode (#58460)
  • Ensure client_call_manager_ outlives metrics_agent_client_ in core worker (#58315)
  • Fixed header validation in dashboard tests (#58648)
  • Validation of Ray-on-Spark-on-YARN mode to enable it to run (#58335)

📖 Documentation:

  • Fix pattern_async_actor demo typo (#58486)
  • Add limitations of RDT documentation (#58063)
  • Add actor+job+node event to ray event export documentation (#57930)
  • Remove implementation details from get_runtime_context docstring (#58212)
  • Improved monitoring section with links (#58472)

🏗 Architecture:

  • Refactor ActorInfoAccessor in gcs_client to be mockable (#57241)
  • Refactor reference_counter out of memory store and plasma store (#57590)
  • Remove reference counter mock for real reference counter in testing (#57178)
  • Split raylet cython file into multiple files (#56575)
  • Move ray_syncer to top level directory (#58316)
  • Move python_callbacks to common (#57909)
  • Consolidate find_free_port to network_utils (#58304)
  • Implement event merge logic at export time (#58070)
  • Feature flag for enabling ray export event (#57999)
  • Add comments explaining ray_syncer_ channels in Raylet (#58342)
  • Integration tests for task event generation (#57636)

Dashboard

💫 Enhancements:

  • Added percentage usage graphs for resources (#57549)
  • Sub-tabs with full Grafana dashboard embeds on Metrics tab (#57561)
  • Added queued blocks to operator panels (#57739)
  • Improved operator metrics logging (#57702)
  • Make do_reply accept status_code instead of success bool (#58384)
  • Add denial of fetch headers (#58553)

🔨 Fixes:

  • Fixed broken Ray Data per node metrics due to unsupported operator filter (#57970)
  • Filtered out ANSI escape codes from logs (#53370)

📖 Documentation:

  • Expose dashboard URL when deploying on Yarn using Skein (#57793)

Autoscaler + KubeRay

🎉 New Features:

  • KubeRay autoscaling support with top-level Resources and Labels fields (#57260)
  • Bundle label selector support in request_resources SDK (#54843)

💫 Enhancements:

  • Azure VM launcher release test (#57921)
  • Azure CLI added to base-extra image (#58012)

📖 Documentation:

  • Label selector guide (#58157)
  • Add minimum version requirement on kai-scheduler (#58161)
  • Mention RayJob gang scheduling for Yunikorn (#58375)
  • Add Volcano RayJob gang scheduling example (#58320)
  • Add KAI scheduler integration documentation (#54857)
  • Kuberay sidecar mode (#58273)
  • Update RayJob documentation with new DeletionStrategy (#58306)
  • Add guidance for RayService initialization timeout (#58238)
  • Update version to 1.5.0 (#58452)
  • Add output example of CLI commands (#58078)
  • Fix invalid syntax in label_selector (#58352)

Thank You to all the Contributors!
@marosset, @curiosity-hyf, @bveeramani, @Future-Outlier, @saihaj, @ZacAttack, @ArthurBook, @crypdick, @Aydin-ab, @elliot-barn, @Kunchd, @justinvyu, @jjyao, @gangsf, @sunsetxh, @Daraan, @justinyeh1995, @MatthewCWeston, @kyuds, @daiping8, @sauravvenkat, @omatthew98, @CowKeyMan, @morotti, @israbbani, @goutamvenkat-anyscale, @fscnick, @Zakelly, @xyuzh, @kouroshHakha, @owenowenisme, @Qiaolin-Yu, @czgdp1807, @shen-shanshan, @wph95, @iamjustinhsu, @MengjinYan, @jugalshah291, @Yicheng-Lu-llll, @ryanaoleary, @nadongjun, @xinyuangui2, @ideal, @my-vegetable-has-exploded, @lucaschadwicklam97, @tianyi-ge, @ahao-anyscale, @abrarsheikh, @Blaze-DSP, @rueian, @thomasdesr, @CaiZhanqi, @harshit-anyscale, @jeffreyjeffreywang, @TimothySeah, @codope, @sampan-s-nayak, @andrewsykim, @xingsuo-zbz, @aslonnie, @OneSizeFitsQuorum, @ryankert01, @Sparks0219, @soffer-anyscale, @akyang-anyscale, @alanwguo, @chrisfellowes-anyscale, @richo-anyscale, @alexeykudinkin, @JasonLi1909, @ruisearch42, @EkinKarabulut, @MarcoGorelli, @SolitaryThinker, @srinathk10, @dayshah, @richardliaw, @pseudo-rnd-thoughts, @win5923, @axreldable, @matthewdeng, @ArturNiederfahrenhorst, @can-anyscale, @khluu, @landscapepainter, @kevin85421, @seanlaii, @edoakes, @nrghosh, @eicherseiji, @Artimislyy, @cem-anyscale, @coqian, @chiayi, @liulehui

Don't miss a new ray release

NewReleases is sending notifications on new releases.