Ray Data
🎉 New Features
- Add
DataSourceV2API with scanner/reader framework, file listing, and file partitioning (#61220, #61615, #61997) - Support GPU shuffle with
rapidsmpf26.2 (#61371, #62062) - Add Kafka datasink, migrate to
confluent-kafka, supportdatetimeoffsets (#60307, #61284, #60909) - Add Turbopuffer datasink (#58910)
- Add 2-phase commit checkpointing with trie recovery and load method (#61821, #60951)
- Queue-based autoscaling policy integrated with task consumers (#59548, #60851)
- Enable autoscaling for GPU stages (#61130)
- Expressions: add
random(),uuid(),cast, and map namespace support (#59656, #60695, #59879) - Add support for Arrow native fixed-shape tensor type (#56284)
- Support writing tensors to tfrecords (#60859)
- Add
pathlib.Pathsupport toread_*functions (#61126) - Add
cudfas abatch_format(#61329) - Allow
ActorPoolStrategyforread_datasource()viacomputeparameter (#59633) - Introduce
ExecutionCachefor streamlined caching (#60996) - Support
strict=Falsemode forStreamingRepartition(#60295) - Port changes from lance-ray into Ray Data (#60497)
- Enable PyArrow compute-to-expression conversion for predicate pushdown (#61617)
- Add vLLM metrics export and Data LLM Grafana dashboard (#60385)
- Include logical memory in resource manager scheduling decisions (#60774)
- Add monotonically increasing ID support (#59290)
💫 Enhancements
- Performance: cache
_map_taskargs, heap-based actor ranking, actor pool map improvements (#61996, #62114, #61591) - Optimize concat tables and PyArrow schema hashing (#61315, #62108)
- Reduce default
DownstreamCapacityBackpressurePolicythreshold to 50% (#61890) - Improve reproducibility for random APIs (#59662)
- Clamp batch size to fall within C++ 32-bit int range (#62242)
- Account for external consumer object store usage in resource manager budget (#62117)
- Make
get_parquet_datasetconfigurable in number of fragments to scan (#61670) - Consolidate schema inference and make all preprocessors implement
SerializablePreprocessorBase(#61213, #61341) - Disable hanging issue detection by default (#62405)
- Make execution callback dataflow explicit to prevent state leakage (#61405)
- Log
DataContextin JSON format at execution start for traceability (#61150, #61428) - Autoscaler: configurable traceback, Prometheus gauges, relaxed constraints (#62210, #62209, #61917, #61385)
- Add metrics for task scheduling time, output backpressure, and logical memory (#61192, #61007, #61436)
- Prevent operators from dominating entire shared object store budget (#61605)
- Eliminate generators to avoid intermediate state pinning (#60598)
- Default log encoding to UTF-8 on Windows (#61143)
- Remove legacy
BlockList,locality_with_output, old callback API, PyArrow 9.0 checks (#60575, #61044, #62055, #61483) - Upgrade to
pyiceberg0.11.0; cappandasto <3 (#61062, #60406) - Refactor logical operators to frozen dataclasses (#61059, #61308, #61348, #61349, #61351, #61364, #61481)
- Prevent aggregator head node scheduling (#61288)
- Add error for
local://paths with a zero-resource head node (#60709)
🔨 Fixes
- Fix RCE in Arrow extension type deserialization from Parquet (#62056)
- Fix
StreamingSplitDataIterator.schema()(#62057) - Fix
ParquetDatasourcehandling ofFileSystemFactory.inspect(#62065) - Fix
read_parquetfile-extension filtering for versioned object-store URIs (#61376) - Fix
wide_schema_pipeline_tensorscloudpickle deserialization (#62149) - Fix
OpBufferQueuerace condition (#60828) - Fix scheduling metrics computation (#62031)
- Fix
OneHotEncodermax_categoriesto use global top-k instead of per-partition (#60790) - Fix
ReservationOpResourceAllocatorresource borrowing forActorPoolMapOperator(#60882) - Fix
DatabricksUCDatasourceschema()shadowing by schema string attribute (#61282) - Fix
AliasExprstructural equality to respect rename flag (#60711) - Fix
_align_struct_fieldsfailure with unaligned scalar fields (#58364) - Fix
min_scheduling_resourcesfallback toincremental_resource_usage(#60997) - Fix output backpressure unblocking sequence for terminal ops (#60798)
- Fix multi-input operator object store memory attribution (#61208)
- Fix reference cycle by moving to module scope (#61934)
- Fix autoscaler logging: reduce verbose output and move traceback to debug (#61989, #62126)
- Fix double counting
ref_bundle+input_files(#61774) - Replace
on_exithook with__ray_shutdown__to fix UDF cleanup race (#61700) - Prevent
Limitfrom getting pushed pastmap_groups(#60881) - Propagate schema in empty
_shuffle_blockto fixColumnNotFoundin chained left joins (#61507) - Fix unclear metadata warning and incorrect operator name logging (#61380)
- Clamp rolling utilization averages to zero (#61543)
- Fix floating point errors in
TimeWindowAverageCalculator(#61580) - Remove default task-level timeout and clamp
end_offsetin Kafka datasource (#61476) - Avoid redundant reads in
train_test_split(#60274) - Return
Nonewhen no outputs have been produced (#62029) - Replace bare
raisewithTypeErrorin string concatenation (#60795)
📖 Documentation
- Add job-level checkpointing documentation (#60921)
- Update
exclude_resourcesdocs for Train autoscaling changes (#61990) - Add
locality_with_outputmigration instructions (#61151) - Document
max_tasks_in_flight_per_actorvsmax_concurrent_batches(#60477) - Add missing
MODoperation docs; improveray.data.Datasourcedocs (#60803, #59654) - Add
polarsusage instructions (#60029)
Ray Serve
🎉 New Features:
- Added end-to-end gRPC client and bidirectional streaming support, including public APIs, proxy handling, proto updates, and developer docs, so Serve apps can handle streaming workloads natively instead of building custom transport layers. (#60767, #60768, #60769, #60770, #60771)
- Introduced HAProxy-based serving with fallback proxy support and load-balancer tunables, giving operators a higher-throughput ingress path and more control over traffic behavior in production. (#60586, #61180, #61271, #61468, #61988)
- Added queue-based autoscaling for async inference and Taskiq-backed workloads, so scaling decisions can account for both HTTP in-flight load and queued tasks. (#59548, #60851, #60977, #61008)
- Rolled out gang scheduling support across validation, core scheduling, fault tolerance, downscaling, autoscaling, rolling updates, and migration, enabling coordinated multi-replica placement for tightly coupled workloads. (#60944, #61205, #61206, #61207, #61215, #61467, #61216, #61659)
- Introduced deployment-scoped actors with config/schema, lifecycle management, public API, and controller health checks, making it easier to run durable per-deployment sidecar-like logic inside Serve. (#61639, #61648, #61664, #61833, #62161)
💫 Enhancements:
- Added first-class tracing support for Serve, including inter-deployment gRPC propagation and richer streaming-path attributes, improving end-to-end observability across distributed request flows. (#61230, #61089, #61451)
- Expanded operational metrics with replica utilization, richer error labeling, and client IP logging in access logs, helping teams diagnose bottlenecks and user-impacting issues faster. (#60758, #61092, #60967)
- Improved autoscaling extensibility with class-based policies and
policy_kwargs, so advanced users can package reusable autoscaling logic without custom forks. (#60964) - Reduced controller overhead with broad algorithmic improvements (indexing, cache reuse, and avoiding repeated per-tick work), which improves scalability as deployment and replica counts grow. (#60810, #60829, #60830, #60838, #60842, #60843, #60844, #60832, #60806)
- Improved throughput-oriented operation controls by adding environment-based tuning and explicit throughput optimization logging, making performance behavior easier to configure and audit. (#60757, #62146)
- Upgraded Serve internals to Pydantic v2 and refined time-series aggregation behavior for more predictable metric accuracy under high load. (#61061, #61403)
🔨 Fixes:
- Fixed a direct-ingress shutdown bug where replicas could hang indefinitely while draining stuck requests, ensuring bounded shutdown behavior in failure scenarios. (#60754)
- Fixed HAProxy reliability issues, including config race conditions, draining guards, and platform compatibility edge cases, improving stability in production rollouts. (#61120, #60955)
- Fixed autoscaling correctness issues that could cause runaway scaling or delayed reactions, including feedback-loop regressions, streaming scale-down behavior, and wall-clock delay handling. (#61731, #61920, #62331, #61844, #60613)
- Fixed high-percentile latency regression in request routing and queue-length accounting, reducing tail-latency spikes under load. (#61755)
- Fixed replica-state and health-state edge cases during migration and ingress transitions, preventing false errors and unhealthy/healthy misreporting. (#60365, #61818, #62213)
- Fixed chained upstream actor-failure handling so request failures are attributed correctly and no longer hang when upstream deployments die mid-chain. (#61758, #62147)
- Fixed HTTP status classification for client disconnects after successful responses, improving accuracy of error-rate monitoring and alerting. (#61396)
📖 Documentation:
- Added
AsyncInferenceAutoscalingPolicydocumentation and clarified Serve performance guidance for HAProxy and inter-deployment gRPC use cases. (#61086, #61386) - Updated scheduling and configuration docs, including replica scheduling guidance and a catalog of Serve environment variables, so operators can tune deployments with less guesswork. (#60922, #60807)
- Clarified multiplexing and async behavior docs (including model pre-warming constraints and request-cancel semantics) to prevent common integration mistakes. (#61842, #62280)
🏗 Architecture refactoring:
- Refactored deployment-state execution to skip unnecessary steady-state per-tick work, lowering control-loop churn and creating cleaner hooks for future scheduling logic. (#60840)
- Moved autoscaling metric aggregation into Cython-backed paths and added focused controller benchmarking, giving a stronger performance baseline for future Serve controller changes. (#58892, #61368)
- Simplified internal structure by migrating shared internals away from private modules and consolidating replica abstractions, reducing coupling and maintenance complexity. (#60849, #61363, #60198)
Ray Train
🎉 New Features
- Elastic training: core capability, user guide, release tests, multi-host TPU, telemetry (#60721, #61115, #61133, #61299, #61267)
- Add HF TRL (Transformer Reinforcement Learning) example (#61627)
- Add Tensor Parallel templates for DeepSpeed AutoTP and DTensor (#60160, #60158)
- Add
statusattribute toReportedCheckpoint(#61684) - Richer Train run metadata (#59186)
- Add timers for Train worker initialization (#60870)
- Configure
torchftenvironment (#61156)
💫 Enhancements
- Register training resources with
AutoscalingCoordinatorinFixedScalingPolicy(#61703) - Decouple
datasetsfield fromTrainRunContext(#61953) - Log warning for
checkpoint_upload_fnwhen slow (#61720) - Fix
StateManagerCallbackto accept datasets explicitly (#62042) - Make train run abortable during
before_controller_shutdown(#61816) - Graceful abort catches all
RayActorError(#61375) - Refactor checkpoint and
sync_actorto usewait_with_logging(#61063) - Unwrap
UserExceptionWithTracebackinWorkerGroupError.worker_failures(#61153)
🔨 Fixes
- Fix v2
PlacementGroupCleanerzombie actor (#61756) - Fix checkpoint paths for multinode run (#61471)
- Abort cancels validation tasks with deterministic resumption (#61510)
- Fix deepspeed finetune release test (#61266)
📖 Documentation
- Add section on async validation with experiment tracking (#62104)
- Add section on when to use async validation (#61702)
Ray Tune
💫 Enhancements
- Remove deprecated
Loggerinterface andlogger_creator(#61181)
🔨 Fixes
- Fix PBT trial order when
NaNvalues are present (#57160)
Ray LLM
🎉 New Features
- Replace
PDProxyServerwith decode-as-orchestrator PD architecture (#62076) - Introduce DP group fault tolerance for WideEP deployments (#61480)
- SGLang engine: streaming chat/completions, tokenize/detokenize, embeddings, multi-GPU TP/PP (#61236, #61446, #61159, #61201, #62221)
- Add
bundle_per_workerconfig for simpler placement group setup (#59903) - Separate Data and Serve LLM dashboards with improved panel visibility (#61037, #62069)
💫 Enhancements
- Promote Data LLM and Serve LLM APIs to beta (#61249, #62054, #62223)
- Upgrade vLLM to 0.16.0, 0.17.0, and 0.18.0 (#61389, #61598, #61952)
- Upgrade NIXL to v1.0.0 and fix tensor transport issues (#61991)
- Unify duplicated
PlacementGroupconfig schemes (#62241) - Decouple Serve LLM ingress from vLLM protocol models (#61931)
- Set download task
num_cpus=0to reduce contention on low-CPU machines (#61191) - SGLangServer cleanup and replace
format_messages_to_promptwith_build_chat_messages(#61117, #61372)
🔨 Fixes
- Fix duplicate
data: [DONE]in streaming SSE responses (#62246) - Fix
enable_log_requests=Falsenot forwarded to vLLMAsyncLLM(#60824) - Fix
OpenAiIngressscale-to-zero when all models setmin_replicas=0(#60836) - Handle missing state attributes from vLLM's task-conditional
init_app_state(#60812) - Fix NIXL side channel host for cross-node P/D disaggregation (#60817)
- Fix
trust_remote_codedownload (#60344) - Avoid deprecated
TRANSFORMERS_CACHE; treat HuggingFace config load failure as non-fatal (#60854) - Fix sequential batch processing in SGLangServer (#61189)
📖 Documentation
- Update data parallel attention documentation (#61706)
- Add custom tokenizer example (#61098)
- Add C/C++ binaries incompatibility workaround (#62110)
Ray RLlib
💫 Enhancements
- Connector/batching optimizations: ndarray fast paths, direct env step pipeline, batch reuse (#61320, #61255, #61256, #61259, #61144)
- Unify default encoders for all algorithms (#60302)
- Toggle eval/train mode in
TorchRLModuleforward passes (#61985) - Clean up offline prelearner and unit testing (#60632)
- Remove duplicate assignments in
AlgorithmConfig(#61233) - Remove legacy RLlib release tests (#59288)
- Add APPO example with Footsies environment (#59006)
🔨 Fixes
- Support custom eval functions returning zero
eval_results,env_steps, oragent_steps(#61563) - Fix
PrioritizedEpisodeReplayBufferbug (#60065) - Fix missing
LayerNorminRLModuleSpec(#61025) - Fix evaluation in parallel to training (#60777)
- Fix
MultiAgentEpisode.env_t_to_agent_t(#60319) - Fix default metric during eval (#61590)
- Fix incorrect log value of environment steps sampled/trained (#56599)
- Prevent
torch_learner.pycrash under parameter-freezing edge cases (#62158)
Ray Core
🎉 New Features
- Resource isolation: pressure-based memory monitor, time-based killing, cgroup constraints (#61361, #61323, #61097, #61210, #61297, #59365, #59368, #60752)
- IPPR: add
ResizeRayletResourceInstancesto GCS/Python client, schema/status models, KubeRay provider (#61654, #61666, #61803, #61814) - Add
PlatformEventproto and placement group events in one-event framework (#61701, #60449) - Add Nvidia B300 support (#60753)
- Add UV support for Ray Client mode (#60868)
- Add
Percentilemetric type backed by quadratic histogram (#61148) - Expose
fallback_strategyinTaskInfoEntryandActorTableData(#60659) - Add submission job proto changes (#60857)
- Add TPU util for ready multi-host slice count; simplify elastic TPU scaling (#61300, #62141)
- Introduce per-node level temp-dir (#60761)
- Make
ray.put()generic:put(value: R) -> ObjectRef[R](#60995) - Add Python 3.14 support for recursion limit handling (#58459)
💫 Enhancements
- Upgrade
cloudpickleto 3.1.2, gRPC to v1.58.0, protobuf to 3.20.3 (#60317, #61499, #60736) - Multiple gRPC connections for improved object transfer throughput, enabled by default (#61121, #61440)
- Improve
pg.ready()performance via async GCS RPC; fix deadlocks (#60657, #62086) - RDT: non-torch transfers, PyTorch storage caching, metadata caching, NIXL agent reuse (#61081, #60999, #60689, #60602)
- Cache
ActorHandle.__hash__and fix__eq__correctness (#61638) - Cache
find_gcs_addresses(#61065) - Optimize worker listener thread (#61353)
- Eliminate Python GCS client from state manager
get_all_node_info(#61232) - Loosen restriction on worker thread count (#62279)
- Sequence in-order actor tasks per concurrency group instead of globally (#61082)
- Prioritize killing workers that occupy large memory in OOM killer (#60330)
- Cap exponential backoff attempt number to prevent integer overflow (#61003)
- Replace deprecated threading APIs (
getName/setDaemon) (#62153) - Improve error handling for
@ray.remote/@ray.methodwithnum_returns(#59286) - Convert
StopIterationon non-generator functions toRuntimeError(#60521) - Surface warnings for scheduling rate limits slowing task ramp-up (#61004)
- Periodically reload service account tokens; use
AuthenticationValidatorin sync server (#60778, #60779) - Remove support for
local_mode(#60647) - Allow matching
worker_process_setup_hookon re-entry (#61473) - Reduce default event aggregator buffer size to avoid OOM (#60826)
- Suppress autoscaler action logs for read-only provider (#61732)
- Lazy subscription to node changes on non-driver workers (#61118)
- Tighten export symbol allowlists to prevent non-ray symbol leakage (#61298)
- Approximate USS from
memory_infoinstead of callingmemory_full_info(#60000) - Dedicated IO context for
NodeManagerandInternalKVManager(#61002) - Print gRPC peer address on GCS
HandleUnregisterNode/HandleDrainNode(#62226, #62112)
🔨 Fixes
- Fix task stuck when pop worker repeatedly fails (#60104)
- Fix
boolenv var parsing forRAY_CGRAPH_overlap_gpu_communication(#61421) - Fix negative RUNNING task metric (#62070)
- Fix
OnNodeDeadto destroy all owned actors when owner node dies (#60669) - Fix actor task queue blocked after cancelling head task (#60850)
- Fix
TASK_PROFILE_EVENTaggregation for multiple phases (#61559) - Fix double-counting in
WorkerPool::WarnAboutSize()(#61246) - Fix
TaskLifecycleEvent.node_idusing emitting node instead of executor (#61478) - Fix
publisher_idtype mismatch in GCS pubsub (#61518) - Fix
dataclass.asdictwithNonein dashboardlist_jobsAPI (#61033) - Fix dashboard node head API dead node cache (#61185)
- Fix dashboard event agent for events without HTTP scheme (#60811)
- Fix Ray Actor typing for async methods (#60682)
- Fix autoscaler retry during k8s exceptions (#60658)
- Fix
ReadOnlyProvider.terminate()signature mismatch (#62251) - Fix
set/getenv races inOtlpGrpcMetricExporterOptionsand metrics exporter init (#61034, #61281) - Clean up node processes on version mismatch during
ray start(#61837) - Retry node discovery upon
ray.init()(#61029) - Ensure
Node._node_labelsinitializes regardless ofconnect_only(#61618) - Avoid reentrant locking in worker context (#61925)
- Java Local Mode type confusion with multiple Actor types (#61858)
- Recover from
WrongClusterIDon head restart (#60860) - Fix Azure: do not delete shared MSI when tearing down clusters (#61811)
- Configure TLS/mTLS for OpenTelemetry OTLP gRPC exporter (#60745)
Dashboard
🎉 New Features
- Add Queued Blocks metric to Ray Data Dashboard (#61716)
- Add Logical Memory Usage panel (#60772)
- Add running tasks by node, update Ray Data active tasks panel (#61641)
- Add NIXL KV transfer metrics to Serve LLM Grafana dashboard (#60819)
- Add GPU power and temperature graphs (#60942)
- Support log links in Grafana dashboard (#60896)
- Support autoscaler v2 for cluster-level node metrics (#60504)
- Add middleware proxy for history server (#61295)
- Forward
**kwargsthroughJobSubmissionClientto cluster info resolvers (#61902)
Ray Wheels and Images
- Upgrade Bazel from 6.5.0 to 7.5.0 (#61601)
- Bump
torchto 2.7.0+cu128 andtorchvision(#61328) - Upgrade
jackson-databind2.16.1 -> 2.18.6 (GHSA-72hv-8253-57qq) (#61808) - Upgrade CI containers from Ubuntu 20.04 to 22.04; Forge from clang-12 to clang-14 (#61533, #61662)
- Add CUDA 13 images for ray-llm/core-gpu and release test configs (#61497, #61637)
- Add py312+CUDA 12.9 and py312+CUDA 13 depsets for Ray LLM (#61116, #61149, #61496)
- Add TPU Docker images to CI build and publish pipeline (#61172, #61173, #61174, #61175)
- Add Python 3.14 to Linux wheel verification (#62127)
- Windows base build fix (#62415)
- Add
build-image.shand CLI for local Docker image builder (#61042, #61338) - Support repeated execution of
setup-dev.py(#61357)
Documentation
- Add Ray History Server user guide (#62030)
- Add
RAY_BACKEND_LOG_JSONenvironment variable documentation (#59962) - Add user guide for Ray token auth with Kubernetes RBAC (#61644)
- Add warning about token authentication in untrusted networks (#62248)
- KubeRay: prerunning deadline docs, version 1.6.0 refs, GKE/cgroups cross-refs (#61552, #61865, #62140)
- Use
RayClustername asServiceAccountname for RBAC authentication (#61785) - Remove outdated note on labels in local
RayCluster(#61719) - List TPUs as fully tested/supported (#61634)
- Restructure
development.rstwith image build, wheel paths, and cross-references (#61500, #61501, #61504, #61596) - Remove incorrect warning for placement groups (#61176)
- Add multi-agent A2A example (#61193)
- Add object spill internal doc (#60930)
Thanks
Many thanks to all those who contributed to this release!
@justinyeh1995, @marwan116, @jddqd, @MkDev11, @mjd3, @XuQianJin-Stars, @elliot-barn, @DeborahOlaboye, @aaronscalene, @rayhhome, @ayushk7102, @bj-son, @nadongjun, @Daraan, @xinyuangui2, @Sparks0219, @justinvyu, @suppagoddo, @akyang-anyscale, @ambicuity, @Aydin-ab, @mickeyyliu, @MatthewCWeston, @vaishdho1, @jinbum-kim, @eicherseiji, @kouroshHakha, @karticam, @JasonLi1909, @ArturNiederfahrenhorst, @moktamd, @nrghosh, @dragongu, @andrewsykim, @mgchoi239, @ruoliu2, @harshit-anyscale, @Chong-Li, @pseudo-rnd-thoughts, @lee1258561, @khluu, @daiping8, @SolitaryThinker, @jonalee99, @yancanmao, @SohamRajpure, @rueian, @VitaliyEroshin, @Future-Outlier, @nehiljain, @JiangJiaWei1103, @Yicheng-Lu-llll, @KaisennHu, @jeffreywang-anyscale, @aslonnie, @alanwguo, @machichima, @limarkdcunha, @codope, @sampan-s-nayak, @kyuds, @thjung123, @abrarsheikh, @wingkitlee0, @preneond, @7ckingBest, @slfan1989, @win5923, @kaori-seasons, @israbbani, @andrew-anyscale, @zestze, @owenowenisme, @edoakes, @laysfire, @pushpavanthar, @tohtana, @leewyang, @liulehui, @Hyunoh-Yeo, @eureka0928, @ryanaoleary, @947132885, @Kunchd, @simonsays1980, @dpj135, @bveeramani, @raulchen, @Partth101, @dubin555, @richabanker, @bittoby, @sai-miduthuri, @RedGrey1993, @kamil-kaczmarek, @TimothySeah, @myandpr, @rishic3, @justinrmiller, @HassamSheikh, @chiayi, @petern48, @carolynwang, @MrKWatkins, @400Ping, @summaryzb, @peterxcli, @RocMarshal, @coqian, @yuhuan130, @ryankert01, @dayshah, @Anarion-zuo, @ZacAttack, @weimingdiit, @iamjustinhsu, @matthewdeng, @goutamvenkat-anyscale, @KeeProMise, @Sanskarzz, @yuchen-ecnu, @praneethkaturi, @rajeshg007, @ankur-anyscale, @Art0white, @xyuzh, @dancingactor, @MengjinYan, @dengkliu92, @alexeykudinkin