github ray-project/ray ray-2.55.0
Ray-2.55.0

13 hours ago

Ray Data

🎉 New Features

  • Add DataSourceV2 API with scanner/reader framework, file listing, and file partitioning (#61220, #61615, #61997)
  • Support GPU shuffle with rapidsmpf 26.2 (#61371, #62062)
  • Add Kafka datasink, migrate to confluent-kafka, support datetime offsets (#60307, #61284, #60909)
  • Add Turbopuffer datasink (#58910)
  • Add 2-phase commit checkpointing with trie recovery and load method (#61821, #60951)
  • Queue-based autoscaling policy integrated with task consumers (#59548, #60851)
  • Enable autoscaling for GPU stages (#61130)
  • Expressions: add random(), uuid(), cast, and map namespace support (#59656, #60695, #59879)
  • Add support for Arrow native fixed-shape tensor type (#56284)
  • Support writing tensors to tfrecords (#60859)
  • Add pathlib.Path support to read_* functions (#61126)
  • Add cudf as a batch_format (#61329)
  • Allow ActorPoolStrategy for read_datasource() via compute parameter (#59633)
  • Introduce ExecutionCache for streamlined caching (#60996)
  • Support strict=False mode for StreamingRepartition (#60295)
  • Port changes from lance-ray into Ray Data (#60497)
  • Enable PyArrow compute-to-expression conversion for predicate pushdown (#61617)
  • Add vLLM metrics export and Data LLM Grafana dashboard (#60385)
  • Include logical memory in resource manager scheduling decisions (#60774)
  • Add monotonically increasing ID support (#59290)

💫 Enhancements

  • Performance: cache _map_task args, heap-based actor ranking, actor pool map improvements (#61996, #62114, #61591)
  • Optimize concat tables and PyArrow schema hashing (#61315, #62108)
  • Reduce default DownstreamCapacityBackpressurePolicy threshold to 50% (#61890)
  • Improve reproducibility for random APIs (#59662)
  • Clamp batch size to fall within C++ 32-bit int range (#62242)
  • Account for external consumer object store usage in resource manager budget (#62117)
  • Make get_parquet_dataset configurable in number of fragments to scan (#61670)
  • Consolidate schema inference and make all preprocessors implement SerializablePreprocessorBase (#61213, #61341)
  • Disable hanging issue detection by default (#62405)
  • Make execution callback dataflow explicit to prevent state leakage (#61405)
  • Log DataContext in JSON format at execution start for traceability (#61150, #61428)
  • Autoscaler: configurable traceback, Prometheus gauges, relaxed constraints (#62210, #62209, #61917, #61385)
  • Add metrics for task scheduling time, output backpressure, and logical memory (#61192, #61007, #61436)
  • Prevent operators from dominating entire shared object store budget (#61605)
  • Eliminate generators to avoid intermediate state pinning (#60598)
  • Default log encoding to UTF-8 on Windows (#61143)
  • Remove legacy BlockList, locality_with_output, old callback API, PyArrow 9.0 checks (#60575, #61044, #62055, #61483)
  • Upgrade to pyiceberg 0.11.0; cap pandas to <3 (#61062, #60406)
  • Refactor logical operators to frozen dataclasses (#61059, #61308, #61348, #61349, #61351, #61364, #61481)
  • Prevent aggregator head node scheduling (#61288)
  • Add error for local:// paths with a zero-resource head node (#60709)

🔨 Fixes

  • Fix RCE in Arrow extension type deserialization from Parquet (#62056)
  • Fix StreamingSplitDataIterator.schema() (#62057)
  • Fix ParquetDatasource handling of FileSystemFactory.inspect (#62065)
  • Fix read_parquet file-extension filtering for versioned object-store URIs (#61376)
  • Fix wide_schema_pipeline_tensors cloudpickle deserialization (#62149)
  • Fix OpBufferQueue race condition (#60828)
  • Fix scheduling metrics computation (#62031)
  • Fix OneHotEncoder max_categories to use global top-k instead of per-partition (#60790)
  • Fix ReservationOpResourceAllocator resource borrowing for ActorPoolMapOperator (#60882)
  • Fix DatabricksUCDatasource schema() shadowing by schema string attribute (#61282)
  • Fix AliasExpr structural equality to respect rename flag (#60711)
  • Fix _align_struct_fields failure with unaligned scalar fields (#58364)
  • Fix min_scheduling_resources fallback to incremental_resource_usage (#60997)
  • Fix output backpressure unblocking sequence for terminal ops (#60798)
  • Fix multi-input operator object store memory attribution (#61208)
  • Fix reference cycle by moving to module scope (#61934)
  • Fix autoscaler logging: reduce verbose output and move traceback to debug (#61989, #62126)
  • Fix double counting ref_bundle + input_files (#61774)
  • Replace on_exit hook with __ray_shutdown__ to fix UDF cleanup race (#61700)
  • Prevent Limit from getting pushed past map_groups (#60881)
  • Propagate schema in empty _shuffle_block to fix ColumnNotFound in chained left joins (#61507)
  • Fix unclear metadata warning and incorrect operator name logging (#61380)
  • Clamp rolling utilization averages to zero (#61543)
  • Fix floating point errors in TimeWindowAverageCalculator (#61580)
  • Remove default task-level timeout and clamp end_offset in Kafka datasource (#61476)
  • Avoid redundant reads in train_test_split (#60274)
  • Return None when no outputs have been produced (#62029)
  • Replace bare raise with TypeError in string concatenation (#60795)

📖 Documentation

  • Add job-level checkpointing documentation (#60921)
  • Update exclude_resources docs for Train autoscaling changes (#61990)
  • Add locality_with_output migration instructions (#61151)
  • Document max_tasks_in_flight_per_actor vs max_concurrent_batches (#60477)
  • Add missing MOD operation docs; improve ray.data.Datasource docs (#60803, #59654)
  • Add polars usage instructions (#60029)

Ray Serve

🎉 New Features:

  • Added end-to-end gRPC client and bidirectional streaming support, including public APIs, proxy handling, proto updates, and developer docs, so Serve apps can handle streaming workloads natively instead of building custom transport layers. (#60767, #60768, #60769, #60770, #60771)
  • Introduced HAProxy-based serving with fallback proxy support and load-balancer tunables, giving operators a higher-throughput ingress path and more control over traffic behavior in production. (#60586, #61180, #61271, #61468, #61988)
  • Added queue-based autoscaling for async inference and Taskiq-backed workloads, so scaling decisions can account for both HTTP in-flight load and queued tasks. (#59548, #60851, #60977, #61008)
  • Rolled out gang scheduling support across validation, core scheduling, fault tolerance, downscaling, autoscaling, rolling updates, and migration, enabling coordinated multi-replica placement for tightly coupled workloads. (#60944, #61205, #61206, #61207, #61215, #61467, #61216, #61659)
  • Introduced deployment-scoped actors with config/schema, lifecycle management, public API, and controller health checks, making it easier to run durable per-deployment sidecar-like logic inside Serve. (#61639, #61648, #61664, #61833, #62161)

💫 Enhancements:

  • Added first-class tracing support for Serve, including inter-deployment gRPC propagation and richer streaming-path attributes, improving end-to-end observability across distributed request flows. (#61230, #61089, #61451)
  • Expanded operational metrics with replica utilization, richer error labeling, and client IP logging in access logs, helping teams diagnose bottlenecks and user-impacting issues faster. (#60758, #61092, #60967)
  • Improved autoscaling extensibility with class-based policies and policy_kwargs, so advanced users can package reusable autoscaling logic without custom forks. (#60964)
  • Reduced controller overhead with broad algorithmic improvements (indexing, cache reuse, and avoiding repeated per-tick work), which improves scalability as deployment and replica counts grow. (#60810, #60829, #60830, #60838, #60842, #60843, #60844, #60832, #60806)
  • Improved throughput-oriented operation controls by adding environment-based tuning and explicit throughput optimization logging, making performance behavior easier to configure and audit. (#60757, #62146)
  • Upgraded Serve internals to Pydantic v2 and refined time-series aggregation behavior for more predictable metric accuracy under high load. (#61061, #61403)

🔨 Fixes:

  • Fixed a direct-ingress shutdown bug where replicas could hang indefinitely while draining stuck requests, ensuring bounded shutdown behavior in failure scenarios. (#60754)
  • Fixed HAProxy reliability issues, including config race conditions, draining guards, and platform compatibility edge cases, improving stability in production rollouts. (#61120, #60955)
  • Fixed autoscaling correctness issues that could cause runaway scaling or delayed reactions, including feedback-loop regressions, streaming scale-down behavior, and wall-clock delay handling. (#61731, #61920, #62331, #61844, #60613)
  • Fixed high-percentile latency regression in request routing and queue-length accounting, reducing tail-latency spikes under load. (#61755)
  • Fixed replica-state and health-state edge cases during migration and ingress transitions, preventing false errors and unhealthy/healthy misreporting. (#60365, #61818, #62213)
  • Fixed chained upstream actor-failure handling so request failures are attributed correctly and no longer hang when upstream deployments die mid-chain. (#61758, #62147)
  • Fixed HTTP status classification for client disconnects after successful responses, improving accuracy of error-rate monitoring and alerting. (#61396)

📖 Documentation:

  • Added AsyncInferenceAutoscalingPolicy documentation and clarified Serve performance guidance for HAProxy and inter-deployment gRPC use cases. (#61086, #61386)
  • Updated scheduling and configuration docs, including replica scheduling guidance and a catalog of Serve environment variables, so operators can tune deployments with less guesswork. (#60922, #60807)
  • Clarified multiplexing and async behavior docs (including model pre-warming constraints and request-cancel semantics) to prevent common integration mistakes. (#61842, #62280)

🏗 Architecture refactoring:

  • Refactored deployment-state execution to skip unnecessary steady-state per-tick work, lowering control-loop churn and creating cleaner hooks for future scheduling logic. (#60840)
  • Moved autoscaling metric aggregation into Cython-backed paths and added focused controller benchmarking, giving a stronger performance baseline for future Serve controller changes. (#58892, #61368)
  • Simplified internal structure by migrating shared internals away from private modules and consolidating replica abstractions, reducing coupling and maintenance complexity. (#60849, #61363, #60198)

Ray Train

🎉 New Features

  • Elastic training: core capability, user guide, release tests, multi-host TPU, telemetry (#60721, #61115, #61133, #61299, #61267)
  • Add HF TRL (Transformer Reinforcement Learning) example (#61627)
  • Add Tensor Parallel templates for DeepSpeed AutoTP and DTensor (#60160, #60158)
  • Add status attribute to ReportedCheckpoint (#61684)
  • Richer Train run metadata (#59186)
  • Add timers for Train worker initialization (#60870)
  • Configure torchft environment (#61156)

💫 Enhancements

  • Register training resources with AutoscalingCoordinator in FixedScalingPolicy (#61703)
  • Decouple datasets field from TrainRunContext (#61953)
  • Log warning for checkpoint_upload_fn when slow (#61720)
  • Fix StateManagerCallback to accept datasets explicitly (#62042)
  • Make train run abortable during before_controller_shutdown (#61816)
  • Graceful abort catches all RayActorError (#61375)
  • Refactor checkpoint and sync_actor to use wait_with_logging (#61063)
  • Unwrap UserExceptionWithTraceback in WorkerGroupError.worker_failures (#61153)

🔨 Fixes

  • Fix v2 PlacementGroupCleaner zombie actor (#61756)
  • Fix checkpoint paths for multinode run (#61471)
  • Abort cancels validation tasks with deterministic resumption (#61510)
  • Fix deepspeed finetune release test (#61266)

📖 Documentation

  • Add section on async validation with experiment tracking (#62104)
  • Add section on when to use async validation (#61702)

Ray Tune

💫 Enhancements

  • Remove deprecated Logger interface and logger_creator (#61181)

🔨 Fixes

  • Fix PBT trial order when NaN values are present (#57160)

Ray LLM

🎉 New Features

  • Replace PDProxyServer with decode-as-orchestrator PD architecture (#62076)
  • Introduce DP group fault tolerance for WideEP deployments (#61480)
  • SGLang engine: streaming chat/completions, tokenize/detokenize, embeddings, multi-GPU TP/PP (#61236, #61446, #61159, #61201, #62221)
  • Add bundle_per_worker config for simpler placement group setup (#59903)
  • Separate Data and Serve LLM dashboards with improved panel visibility (#61037, #62069)

💫 Enhancements

  • Promote Data LLM and Serve LLM APIs to beta (#61249, #62054, #62223)
  • Upgrade vLLM to 0.16.0, 0.17.0, and 0.18.0 (#61389, #61598, #61952)
  • Upgrade NIXL to v1.0.0 and fix tensor transport issues (#61991)
  • Unify duplicated PlacementGroup config schemes (#62241)
  • Decouple Serve LLM ingress from vLLM protocol models (#61931)
  • Set download task num_cpus=0 to reduce contention on low-CPU machines (#61191)
  • SGLangServer cleanup and replace format_messages_to_prompt with _build_chat_messages (#61117, #61372)

🔨 Fixes

  • Fix duplicate data: [DONE] in streaming SSE responses (#62246)
  • Fix enable_log_requests=False not forwarded to vLLM AsyncLLM (#60824)
  • Fix OpenAiIngress scale-to-zero when all models set min_replicas=0 (#60836)
  • Handle missing state attributes from vLLM's task-conditional init_app_state (#60812)
  • Fix NIXL side channel host for cross-node P/D disaggregation (#60817)
  • Fix trust_remote_code download (#60344)
  • Avoid deprecated TRANSFORMERS_CACHE; treat HuggingFace config load failure as non-fatal (#60854)
  • Fix sequential batch processing in SGLangServer (#61189)

📖 Documentation

  • Update data parallel attention documentation (#61706)
  • Add custom tokenizer example (#61098)
  • Add C/C++ binaries incompatibility workaround (#62110)

Ray RLlib

💫 Enhancements

  • Connector/batching optimizations: ndarray fast paths, direct env step pipeline, batch reuse (#61320, #61255, #61256, #61259, #61144)
  • Unify default encoders for all algorithms (#60302)
  • Toggle eval/train mode in TorchRLModule forward passes (#61985)
  • Clean up offline prelearner and unit testing (#60632)
  • Remove duplicate assignments in AlgorithmConfig (#61233)
  • Remove legacy RLlib release tests (#59288)
  • Add APPO example with Footsies environment (#59006)

🔨 Fixes

  • Support custom eval functions returning zero eval_results, env_steps, or agent_steps (#61563)
  • Fix PrioritizedEpisodeReplayBuffer bug (#60065)
  • Fix missing LayerNorm in RLModuleSpec (#61025)
  • Fix evaluation in parallel to training (#60777)
  • Fix MultiAgentEpisode.env_t_to_agent_t (#60319)
  • Fix default metric during eval (#61590)
  • Fix incorrect log value of environment steps sampled/trained (#56599)
  • Prevent torch_learner.py crash under parameter-freezing edge cases (#62158)

Ray Core

🎉 New Features

  • Resource isolation: pressure-based memory monitor, time-based killing, cgroup constraints (#61361, #61323, #61097, #61210, #61297, #59365, #59368, #60752)
  • IPPR: add ResizeRayletResourceInstances to GCS/Python client, schema/status models, KubeRay provider (#61654, #61666, #61803, #61814)
  • Add PlatformEvent proto and placement group events in one-event framework (#61701, #60449)
  • Add Nvidia B300 support (#60753)
  • Add UV support for Ray Client mode (#60868)
  • Add Percentile metric type backed by quadratic histogram (#61148)
  • Expose fallback_strategy in TaskInfoEntry and ActorTableData (#60659)
  • Add submission job proto changes (#60857)
  • Add TPU util for ready multi-host slice count; simplify elastic TPU scaling (#61300, #62141)
  • Introduce per-node level temp-dir (#60761)
  • Make ray.put() generic: put(value: R) -> ObjectRef[R] (#60995)
  • Add Python 3.14 support for recursion limit handling (#58459)

💫 Enhancements

  • Upgrade cloudpickle to 3.1.2, gRPC to v1.58.0, protobuf to 3.20.3 (#60317, #61499, #60736)
  • Multiple gRPC connections for improved object transfer throughput, enabled by default (#61121, #61440)
  • Improve pg.ready() performance via async GCS RPC; fix deadlocks (#60657, #62086)
  • RDT: non-torch transfers, PyTorch storage caching, metadata caching, NIXL agent reuse (#61081, #60999, #60689, #60602)
  • Cache ActorHandle.__hash__ and fix __eq__ correctness (#61638)
  • Cache find_gcs_addresses (#61065)
  • Optimize worker listener thread (#61353)
  • Eliminate Python GCS client from state manager get_all_node_info (#61232)
  • Loosen restriction on worker thread count (#62279)
  • Sequence in-order actor tasks per concurrency group instead of globally (#61082)
  • Prioritize killing workers that occupy large memory in OOM killer (#60330)
  • Cap exponential backoff attempt number to prevent integer overflow (#61003)
  • Replace deprecated threading APIs (getName/setDaemon) (#62153)
  • Improve error handling for @ray.remote/@ray.method with num_returns (#59286)
  • Convert StopIteration on non-generator functions to RuntimeError (#60521)
  • Surface warnings for scheduling rate limits slowing task ramp-up (#61004)
  • Periodically reload service account tokens; use AuthenticationValidator in sync server (#60778, #60779)
  • Remove support for local_mode (#60647)
  • Allow matching worker_process_setup_hook on re-entry (#61473)
  • Reduce default event aggregator buffer size to avoid OOM (#60826)
  • Suppress autoscaler action logs for read-only provider (#61732)
  • Lazy subscription to node changes on non-driver workers (#61118)
  • Tighten export symbol allowlists to prevent non-ray symbol leakage (#61298)
  • Approximate USS from memory_info instead of calling memory_full_info (#60000)
  • Dedicated IO context for NodeManager and InternalKVManager (#61002)
  • Print gRPC peer address on GCS HandleUnregisterNode/HandleDrainNode (#62226, #62112)

🔨 Fixes

  • Fix task stuck when pop worker repeatedly fails (#60104)
  • Fix bool env var parsing for RAY_CGRAPH_overlap_gpu_communication (#61421)
  • Fix negative RUNNING task metric (#62070)
  • Fix OnNodeDead to destroy all owned actors when owner node dies (#60669)
  • Fix actor task queue blocked after cancelling head task (#60850)
  • Fix TASK_PROFILE_EVENT aggregation for multiple phases (#61559)
  • Fix double-counting in WorkerPool::WarnAboutSize() (#61246)
  • Fix TaskLifecycleEvent.node_id using emitting node instead of executor (#61478)
  • Fix publisher_id type mismatch in GCS pubsub (#61518)
  • Fix dataclass.asdict with None in dashboard list_jobs API (#61033)
  • Fix dashboard node head API dead node cache (#61185)
  • Fix dashboard event agent for events without HTTP scheme (#60811)
  • Fix Ray Actor typing for async methods (#60682)
  • Fix autoscaler retry during k8s exceptions (#60658)
  • Fix ReadOnlyProvider.terminate() signature mismatch (#62251)
  • Fix set/get env races in OtlpGrpcMetricExporterOptions and metrics exporter init (#61034, #61281)
  • Clean up node processes on version mismatch during ray start (#61837)
  • Retry node discovery upon ray.init() (#61029)
  • Ensure Node._node_labels initializes regardless of connect_only (#61618)
  • Avoid reentrant locking in worker context (#61925)
  • Java Local Mode type confusion with multiple Actor types (#61858)
  • Recover from WrongClusterID on head restart (#60860)
  • Fix Azure: do not delete shared MSI when tearing down clusters (#61811)
  • Configure TLS/mTLS for OpenTelemetry OTLP gRPC exporter (#60745)

Dashboard

🎉 New Features

  • Add Queued Blocks metric to Ray Data Dashboard (#61716)
  • Add Logical Memory Usage panel (#60772)
  • Add running tasks by node, update Ray Data active tasks panel (#61641)
  • Add NIXL KV transfer metrics to Serve LLM Grafana dashboard (#60819)
  • Add GPU power and temperature graphs (#60942)
  • Support log links in Grafana dashboard (#60896)
  • Support autoscaler v2 for cluster-level node metrics (#60504)
  • Add middleware proxy for history server (#61295)
  • Forward **kwargs through JobSubmissionClient to cluster info resolvers (#61902)

Ray Wheels and Images

  • Upgrade Bazel from 6.5.0 to 7.5.0 (#61601)
  • Bump torch to 2.7.0+cu128 and torchvision (#61328)
  • Upgrade jackson-databind 2.16.1 -> 2.18.6 (GHSA-72hv-8253-57qq) (#61808)
  • Upgrade CI containers from Ubuntu 20.04 to 22.04; Forge from clang-12 to clang-14 (#61533, #61662)
  • Add CUDA 13 images for ray-llm/core-gpu and release test configs (#61497, #61637)
  • Add py312+CUDA 12.9 and py312+CUDA 13 depsets for Ray LLM (#61116, #61149, #61496)
  • Add TPU Docker images to CI build and publish pipeline (#61172, #61173, #61174, #61175)
  • Add Python 3.14 to Linux wheel verification (#62127)
  • Windows base build fix (#62415)
  • Add build-image.sh and CLI for local Docker image builder (#61042, #61338)
  • Support repeated execution of setup-dev.py (#61357)

Documentation

  • Add Ray History Server user guide (#62030)
  • Add RAY_BACKEND_LOG_JSON environment variable documentation (#59962)
  • Add user guide for Ray token auth with Kubernetes RBAC (#61644)
  • Add warning about token authentication in untrusted networks (#62248)
  • KubeRay: prerunning deadline docs, version 1.6.0 refs, GKE/cgroups cross-refs (#61552, #61865, #62140)
  • Use RayCluster name as ServiceAccount name for RBAC authentication (#61785)
  • Remove outdated note on labels in local RayCluster (#61719)
  • List TPUs as fully tested/supported (#61634)
  • Restructure development.rst with image build, wheel paths, and cross-references (#61500, #61501, #61504, #61596)
  • Remove incorrect warning for placement groups (#61176)
  • Add multi-agent A2A example (#61193)
  • Add object spill internal doc (#60930)

Thanks

Many thanks to all those who contributed to this release!

@justinyeh1995, @marwan116, @jddqd, @MkDev11, @mjd3, @XuQianJin-Stars, @elliot-barn, @DeborahOlaboye, @aaronscalene, @rayhhome, @ayushk7102, @bj-son, @nadongjun, @Daraan, @xinyuangui2, @Sparks0219, @justinvyu, @suppagoddo, @akyang-anyscale, @ambicuity, @Aydin-ab, @mickeyyliu, @MatthewCWeston, @vaishdho1, @jinbum-kim, @eicherseiji, @kouroshHakha, @karticam, @JasonLi1909, @ArturNiederfahrenhorst, @moktamd, @nrghosh, @dragongu, @andrewsykim, @mgchoi239, @ruoliu2, @harshit-anyscale, @Chong-Li, @pseudo-rnd-thoughts, @lee1258561, @khluu, @daiping8, @SolitaryThinker, @jonalee99, @yancanmao, @SohamRajpure, @rueian, @VitaliyEroshin, @Future-Outlier, @nehiljain, @JiangJiaWei1103, @Yicheng-Lu-llll, @KaisennHu, @jeffreywang-anyscale, @aslonnie, @alanwguo, @machichima, @limarkdcunha, @codope, @sampan-s-nayak, @kyuds, @thjung123, @abrarsheikh, @wingkitlee0, @preneond, @7ckingBest, @slfan1989, @win5923, @kaori-seasons, @israbbani, @andrew-anyscale, @zestze, @owenowenisme, @edoakes, @laysfire, @pushpavanthar, @tohtana, @leewyang, @liulehui, @Hyunoh-Yeo, @eureka0928, @ryanaoleary, @947132885, @Kunchd, @simonsays1980, @dpj135, @bveeramani, @raulchen, @Partth101, @dubin555, @richabanker, @bittoby, @sai-miduthuri, @RedGrey1993, @kamil-kaczmarek, @TimothySeah, @myandpr, @rishic3, @justinrmiller, @HassamSheikh, @chiayi, @petern48, @carolynwang, @MrKWatkins, @400Ping, @summaryzb, @peterxcli, @RocMarshal, @coqian, @yuhuan130, @ryankert01, @dayshah, @Anarion-zuo, @ZacAttack, @weimingdiit, @iamjustinhsu, @matthewdeng, @goutamvenkat-anyscale, @KeeProMise, @Sanskarzz, @yuchen-ecnu, @praneethkaturi, @rajeshg007, @ankur-anyscale, @Art0white, @xyuzh, @dancingactor, @MengjinYan, @dengkliu92, @alexeykudinkin

Don't miss a new ray release

NewReleases is sending notifications on new releases.