Highlights
- Ray Data Stability: In this Ray release, we've added a variety of stability improvements, including running multiple datasets in a cluster, adding automatic batch size selection to CPU-based map-batches, and default logical memory configuration to prevent OOMs. We've also tightened
iter_batchesstability by reducing hidden buffering and shutting down the executor when consumers exit early (#63660, #63682, #62949). This reduces object-store spilling for common training workloads - Ray Serve: We re-architected Ray Serve LLM by decoupling request handling from token streaming response path (#62667, #62680, #62668, #62669, #63167), resulting in significant LLM serving performance improvements. We've also introduced new routing policies such as session-sticky routing via consistent hashing with
ConsistentHashRouter(#62905, #63096, #62906) andCapacityQueueRouter(#62323) which is beneficial for supply-constrained workloads. - Ray Core: We've added GPU-domain-aware placement groups using label locality (#61442, #61614, #62487, #62533). This enables placement groups to pack bundles onto nodes that share a
ray.io/gpu-domainlabel instead of only packing at the single-node level. We've also added initial Kubernetes in-place pod resizing support for Autoscaler v2 (#55961, #62369, #62215), enabling Ray clusters to resize CPU and memory on existing worker pods before scaling out new pods.
Ray Data
🎉 New Features
- Support multiple datasets per cluster via subcluster labels and resource partitioning (#63331, #63375, #63982)
- Add
Dataset.mix()public API andMixOperatorfor weighted dataset mixing (#63168, #62450) - New DataSourceV2 framework:
ParquetDatasourceV2, chunked reader, predicate splitting, listing/scanner infra (#63113, #63454, #63163, #62975, #63027, #62182) - Add
batch_size='auto'tomap_batchesto derive batch row count from target row batch size (#62648) - Implement distributed upsert for Iceberg using task-based merge algorithm, preventing performance bottleneck on driver (#63482)
- Add
include_row_hashtoread_parquet(#61408) - Add JAX data iterator (#61630)
- Expose flag to run read tasks on isolated worker processes via
isolate_read_workers(#63490) - Expose flag to set default logical memory for map operators via
default_map_logical_memory_enabled(#63814) - Support predicate pushdown for Lance format (#61400)
- Support per-partition
start_offsetandend_offsetforread_kafka(#61620) - Add obstore async download backend for download operator (#61735)
- Support UDF retries on transient exceptions (#63023)
💫 Enhancements
- Fix
iter_batchesspilling by replacingmake_async_genwithiter_threadedand reducing buffered batches (#63660, #63682) - Gate
restore_original_orderiniter_batchesbehindpreserve_order(#63792) - Convert
drop_columnsto aProjectlogical operator when input schema is known (#63813) - Make
ConcatAggregationandTurbopufferDatasinkusepolarsfor sorting (#61904) - Boost and vectorize
hash_partitionwithsort_indices, zero-copy slices, and pandas (#63498, #62757, #63152, #62587) - Enable
GPU_SHUFFLEingrouped_data.py(#62410) - Eager
StarExprexpansion, schema inference for non-black-box UDFs, and Expressions struct support (#63776, #63387, #62560) - Make logging configurable via
RAY_DATA_LOG_LEVELand logRAY_DATAenv vars at execution start (#63487, #63380) - Display and track logical memory in progress bar (#63379)
- Honor
compute=infilter(expr=...)and deprecateconcurrency=(#63576) - Enable filter pushdown through
StreamingRepartitionand read stage column-rename removal (#62347, #63384, #63582) - Cache deserialized Arrow schemas in
BlockMetadataWithSchema(#63462) - Track scheduling-loop step duration (p50/p90/max), peak USS/object-store memory, and task block locality (#63586, #63345, #63489, #63418, #62249)
- Replace
TaskDurationStatsand Timer withDistributionTracker(#63488, #63530, #63825) - Introduce
BlockEntryonRefBundlein place of(ref, metadata)tuples (#63654) - Pre-resolve filesystem in threaded download to avoid IMDS herd (#62898)
- Convert logical operators to frozen dataclasses and consolidate operator base/repr (#62593, #62568, #62400, #63137, #63140, #63108)
- Non-blocking default autoscaling coordinator and resource-aware auto-downscaling (#62725, #62574)
- Release pinned blocks after dataset execution and shut down executor on early
DataIteratorexit (#62456, #62949) - Optimize local shuffle with incremental index and configurable compaction threshold (#62539)
- Speed up checkpoint filter and reduce memory usage (#60294)
- Preserve Arrow types through pandas roundtrip and reorder block columns by name before schema ops (#63017, #63582)
- Block pickle object columns when reading untrusted Parquet and gate unsafe WebDataset deserialization (#63470, #63469)
- Move backpressure escape hatch across all policies (#63539)
- Update
pandas,modin, andpyarrowminimum versions (#62899) - Add utilization monitoring and correct logical resource usage for
ActorPool(#61987, #61528) - Deprecate
ConcurrencyCapBackpressurePolicy,DataIterator.to_torch, and pandas UDF batches (#63392, #62540, #61733) - Rank actors per node in a heap and avoid re-exporting actor class via
.options(#62309, #62722) read_deltareads from preconfiguredpyarrowdataset (#61721)- Include column name and target type in
ArrowConversionError; reduce arrow conversion warning verbosity (#62407, #61486, #62521) - Show external consumer bytes in verbose operator progress log (#63728)
- Disable
DataSourceV2by default after earlier enabling (#63674, #63326)
🔨 Fixes
- Rename subcluster label key from
__subcluster__toray-subcluster(#63982) - Fix
get_or_create_stats_actorcrash in Ray Client mode (#63402) - Fix datasource pushdown crashes for generic
UDFExprfilter predicates (#63781) - Fix hash-shuffle aggregator memory estimation: metadata propagation, node-size clamp, column pruning (#63809)
- Fix
CheckpointConfigFileNotFoundErroron Azure Blob Storage (#63606) - Fix silent credential drop for fsspec-S3 in download expression (#62897)
- Fix missing f-string prefix in
_concatenate_extension_column(#62939) - Fix
HashAggregateduplicate group rows forAggregateFnV2(#63066) - Fix JSONL read retry with advanced file cursor (#63233)
- Fix
read_parquetArrowNotImplementedErrorfor nested column types exceeding ~2GB row group (#61824) - Fix
read_parquetnested-type fallback and parquet scanner memory accumulation (#63175, #62745) - Fix memory leak in
DataIterator.to_torch()by switching toPyArrow(#60966) - Fix
ZipOperatorfreeing shared blocks via_split_at_indices(#62665) - Fix concurrent writes race condition in
write_parquet(#62377) - Fix GPU shuffle output ordering when using
ShuffleStrategy.GPU_SHUFFLE(#62351) - Fix incorrect
DatasetStatuuid propagation (#62255) - Fix none issue when
DATA_ENABLE_OP_RESOURCE_RESERVATION=False(#61718) - Fix filesystem compatibility check for fsspec-wrapped
PyFileSystem(#61850) - Forward
try_create_dirtopyarrow.dataset.write_dataset(#58302) - Fix autoscaler bug blocking timely release of leased resources (#62592)
- Ensure consistent
nan_is_null/nans-as-nulls semantics in encoder (#62623, #62618) - Skip unconditional null strip in
find_partition_index(#62594) - V1
_split_predicate_by_columnscorrectness fix (#63176) - Avoid importing cudf in
_is_cudf_dataframewhen cudf not loaded (#62302) - Revert raw-modulo hash partition fast path (#63097)
- Remove
tfx-bslsupport fromread_tfrecords(#63245)
📖 Documentation
- Document
isolate_read_workersforread_parquet(#63816) - Remove docs recommending increased object store memory proportion (#63389)
- Update docs minimum version for
build_processorand"auto"batch size (#61757, #62790) - Remove outdated limitation of
DefaultClusterAutoscalerV2and stale object-store-memory warnings (#62385, #62387)
Ray Serve
🎉 New Features:
- Add custom ingress request router app interfaces and HAProxy ingress dispatch path (#62680, #62668, #62669, #62667)
- Expose
choose_replica/dispatchon deployment handles andAsyncioRouterwith replica-side slot reservation (#63255, #63254, #63252) - Introduce experimental round robin router and
ConsistentHashRouterfor session-sticky routing (#63238, #62906, #63096, #62905) - Central capacity queue for token-based request routing via
CapacityQueueRouter(#62323) - Add experimental
ray-haproxysupport behindRAY_SERVE_EXPERIMENTAL_PIP_HAPROXY(#62589) - Add deployment actor context API and broadcast API for deployment handles (#62532, #61472)
- Add
ControllerOptionsfor configurable controllerruntime_env(#63352) - Make rolling update percentage configurable (#62160)
- Support per-request timeout and disconnect in HTTP proxy path (#62867)
💫 Enhancements:
- HAProxy stability improvements: wait for old workers before drain, redirect stdout/stderr, redispatch+retry-on, coalesce broadcasts, quarantine released ports (#63620, #63621, #63622, #63623, #63628)
- Bind direct ingress ports to
0.0.0.0for cross-node HAProxy routing (#62515) - HAProxy ingress request router metrics, enable splice by default,
TCP_NODELAYdefault 1, optional retry knobs,RAY_SERVE_HAPROXY_STATS_PORT(#63356, #63531, #63353, #63415, #62979) - Resolve bundled ray-haproxy binary before
RAY_SERVE_HAPROXY_BINARY_PATH; HAProxy abspath env var (#63829, #62610) - Replace socat subprocess with Python socket for HAProxy admin communication; bump HAProxy to avoid CVE-2025-11230 (#61897, #62585)
- Expose controller health metrics via
/api/serve/applications/API; addmax_replicas_per_nodeto response (#63556, #63234) - Run health check on user execution path to detect request-serving stalls (#61621)
- Mark widely-used APIs as stable (#62932)
- Retain recently-stopped replica logs in the dashboard (#63678)
- Add observability logs for pack scheduling decisions (#63603)
- Gate ingress request router body forwarding behind escape hatch (#63183)
- Avoid rolling replicas for no-op config overrides (#63034)
- Gate replica/deployment creation during shutdown (#62761)
- Defer PG creation for TPU Serve deployments to accelerator backend (#62941)
- Expose
DeploymentStateManagerAPIs for controller access (#62950) - Add tracing support for Windows and gRPC tracing improvements (#62821, #63833)
- Split node vs requested resources in deployment scheduler (#62778)
- Defer
DEPLOYMENT_TARGETSbroadcast while replicas are RECOVERING (#62751) - Evict per-deployment
LongPollHoststate on deployment delete; enable logs when client stops its event loop (#62820, #63028) - Add metrics: max replica processing latency, objref resolution latency,
serve_autoscaling_target_ongoing_requests(#62381, #62355, #62421) - Filter stale bootstrap observations from
serve_long_poll_latency_ms(#62868) - Retry
build_serve_applicationtask on failure (#62987) - Scale down non-matching primary-label replicas first (#61488)
- Refactor internal autoscaling policy state extraction into a single helper (#62452)
- Catalog Ray Serve env vars (#62006)
- Remove or raise clear error for deprecated deployment items; remove deprecated
DeploymentMode(#63548, #63510)
🔨 Fixes:
- Fix orphaned actors on controller crash during shutdown; drop and replace replicas surviving a controller crash without rank assignment (#62823, #63139)
- Fix deployment actors creating 15K OS threads for sync actor classes (#62661)
- Fix gang scheduling PG leak when deployment actors are starting (#62469)
- Fix app-level autoscaling policy state cross-deployment contamination and state loss for skipped deployments (#62484)
- Fix Serve autoscaling delay to use wall-clock time (#62144)
- Fix race condition in multiplex LRU cache update using
move_to_end()(#62548) - Normalize multiplexed model ID header to support proxy-transformed names (#61869)
- Fix
AttributeErrorwhenrequest_routeris None inupdate_deployment_config(#63180) - Fix potential
UnboundLocalErrorinActorReplicaWrapper.check_stopped()(#63339) - Fail loud when ingress request router dispatch fails (#63215)
- Fix stale
_global_clientcache across driver sessions (#62368) - Fix
start_metrics_pushercrash when deployment hasrecord_autoscaling_statsbut no autoscaling config (#62123) - Fix high-cardinality namespace tag on long poll metrics (#62386)
- Fix Java long poll timeout serialization (#61875)
- Avoid destructor error when FastAPI ingress init fails (#62172)
- Avoid proxy readiness future timeout race (#62194)
- Avoid self-cause on non-gRPC replica exceptions (#62412)
- Fix HAProxy startup timeout propagation (#61752)
- Include
ingress_request_router.lua.tmplinpackage_data(#63145) - Revert support for
root_pathparameter across uvicorn versions (#62529)
📖 Documentation:
- Add round robin and consistent hashing router documentation (#63636)
- Introduce gang scheduling documentation (#61737)
- Add deployment scope actor docs (#62735)
- Add Kuberay guide for RayService with HAProxy and High Throughput mode (#62408)
- Add Ray Serve office hours invite into documentation (#62176)
Ray Train
🎉 New Features
- Add
LoggingConfigfor configuring theray.trainlogger on controller and workers (#61550) - Allow
DataParallelTrainer'strain_fnto return data (#62021) - Add async checkpointing/validation with Torch Lightning (#62370)
💫 Enhancements
- Report time spent syncing and transferring checkpoints to storage in
ray.train.report(checkpoint)(#62027) - Block until
create_or_update_train_runcompletes on Train initialization (#63432) - Implement
DatasetManager(#63309) - Forward
label_selectortoAutoscalingCoordinator(#63287) - Add log line before launching training function (#62911)
- Allow
contextlib.redirect_stdout()to bypass print redirect to logs (#61075) - Add timeouts to validation functions of
ray.train.report(#62916) ray.train.reportdoes not hang across replica group restarts; Ray Train manages replica group restarts (#62651, #61475)- Swallow
RayTaskErrorduringBackendSetupCallbackshutdown (#63143) - Improve
JaxTrainerTPU multi-slice fault tolerance and reservation ergonomics (#62893) - Export default data execution options (#62784)
- Consolidate Train run metadata sanitization and improve readability (#63182)
- Fix
PlacementGroupCleanerrace condition: drain queue before cleanup on controller death (#62754) - Harden against unsafe pickle deserialization (#62807)
- Raise error when checkpoint is within experiment directory and
delete_local_checkpoint_after_upload=True(#62555) - Add
timeout_storay.train.get_all_reported_checkpoints(#61761) - Change remaining
pytorch_lightningimports (#61291) - Make controller resilient to errors in all lifecycle hooks (#60900)
- Remove
Predictorfrom Train v1 (#63461)
🔨 Fixes
- Fix missing comma in
DataBatchTypeUnion type (#63872) - Handle Arrow-backed pandas dtypes in LightGBM examples (#63427)
- Fix
exclude_resourcesregression for V1 Train + V2 cluster autoscaler (#62827) - Add missing
%stologger.debug(#63039) - Increase
get_actortimeout (#62516)
📖 Documentation
- Document S3-compatible storage (#63103)
- Add Azure Files to persistent storage docs (#63406)
- Uncomment
Result.from_pathin docs (#62887) - Document how to tune async validation (#62227)
- Document why validation runs need unique names (#62224)
Ray Tune
💫 Enhancements
- Fix Tune search for Python 3.14 (#63575)
- Modernize
AxSearchfor Ax Platform 1.0.0+ (#60522) - Use built-in
inspectfor argument capture (#60049)
🔨 Fixes
- Fix import count in CIFAR PyTorch tutorial (#62756)
Ray LLM
🎉 New Features
- Major Ray Serve LLM performance improvement with direct streaming (#63167, #63468, #63779)
- TPU support: Add
topologyfield toLLMConfigfor multi-host TPU support (#61906) - Add per-host bundles default and fix fractional TPUs for
TPUAccelerator(#63177) - Enable Ray Serve LLM session-stickiness routing policy via
RAY_SERVE_SESSION_ID_HEADER_KEY(#63362)
💫 Enhancements
- Upgrade
vLLMto 0.22.0 (#63730, #63396, #62970, #62349) - Co-locate DP rank 0 worker with advertised master address (#63803)
- Add pick-only fast path to
AsyncioRouterfor LLM ingress (#63517) - Replace LLM ingress router replica selection with
choose_replica; don't fetchLLMConfigfrom replicas at startup (#63280, #63065) - Promote
max_tasks_in_flight_per_actorto a first-class config field and adjust defaults (#63214) - Validate
accelerator_typeagainst CPU-only configs; replaceGPUTypealias withAcceleratorType(#62139, #62978) - Add rate-limiter for per-request traceback spam (#62440)
- Promote SGLang integration to user guide and move engine to
_internal(#62570) - Lazy-load batch stage/processor submodules and make boto3/botocore imports lazy (#62861, #62383)
- LLM telemetry bugfixes (#63782)
🔨 Fixes
- Fix flaky GPU-0 worker and NIXL port collisions (#63810)
- Fix P/D direct streaming OpenAI routing (#63679)
- Remove
guided_decoding,truncate_prompt_tokens,build_llm_processor(#63569) - Fix misleading
ImportErrorwhenvLLMis installed but fails to import (#63305) - Fix
max_pending_requestsdefault to trackvLLM's GPU-dependentmax_num_seqs(#62918) - Fix HF config loading for models with custom
rope_scaling(#62464) - Wait for request router init in
LLMRouterconstructor (#63206) - Materialize chat completion message
contentin sanitizer (#63119) - Fix
lora_requestnot forwarded tovLLMengine + add regression tests (#62609) - Fix
SGLangEngineProcessortelemetry fortrust_remote_codemodels (#62102) - Fix
TOKENIZER_ONLYdownloads missingchat_templatefor S3-backed models (#62121) - Fix SGLang chat tokenize to respect
add_generation_prompt(#61688) - Fix bool serialization in
benchmark_vllmCLI builder (#63516)
📖 Documentation
- Document multimodal pixel-budget gotchas and
vLLMcompatibility (#63593) - Add tokenization disaggregation documentation (#62494)
- Add benchmark docs and refactor into submodules (#62204)
- Remove
VLLM_USE_V1from docs and examples (#63001) - Fix wrong documented default for
max_tasks_in_flight_per_actor(#62917)
Ray RLlib
🎉 New Features
- Add
custom_resources_per_learnerconfig andcustom_resources_for_main_processtoAlgorithmConfig(#63303, #62475) - Add Importance Sampling
APPOmetrics to the torch learner (#63675)
💫 Enhancements
- Put only one copy of weights into the object store (#63529)
- Handle the all-evaluation-workers-unhealthy case uniformly across modes (#63128)
- Stop
IMPALA/APPOlearner thread gracefully to avoid misleading error messages (#62763) - Improve invalid input error messages (#62324)
🔨 Fixes
- Fix two substantial edge cases in
PPO's value target calculation (#59958) - Fix
EnvRunnercrash loops (#62884) - Fix extra model outputs hanging val indexing (#62960)
- Fix
ValueErrorinMultiAgentEpisode.get_rewards()when an agent is inactive for all requested env steps (#62907) - Preserve Torch optimizer param-group scalar types on restore (#61937)
- Fix wrong assert variable in
_update_env_seed_if_necessary(#61823) - Maintain value in
EMAStat(#63064)
📖 Documentation
- Clarify extra model output docstrings (#63524)
Ray Core
🎉 New Features
- Add support for Furiosa AI NPU (#63035) and
register_collective_backendAPI for custom collective backends (#60701) - In-place pod resizing (IPPR) on Kubernetes 1.35: initial implementation and standalone KubeRay IPPR provider (#55961, #62369, #62215)
- Label locality support: GPU-domain-aware placement groups, autoscaler proto changes, and state API observability (#61442, #61614, #62487, #62533)
- Publish platform events via Ray Event Recorder and support single-event emission in the Python layer (#63329, #60858)
- Autoscaler v2: priority-based worker group selection (#62997) and
noDriverTimeoutSecondsfor KubeRay cluster termination (#63465) - RDT: concurrent one-sided transfers for multiple
ObjectRefs inray.get(#61773), retry support (#62842), and NIXL memory deregistration viaderegister_nixl_memory(#62341) - Support
.tar.gzarchives for remoteworking_dirURIs (#62813) - Add IPv6 localhost and all-interfaces support (#60023)
💫 Enhancements
- Resource isolation: event-based memory monitor, multi-memory-monitor factory, time-based group killing policy, idle-worker prioritization, system/user slice bounds, and OOM policy tuning (#62060, #62705, #62643, #62378, #62168, #63521, #63324, #63067, #62957)
- Compute per-component memory usage in MiB (#63932) and add host vs container memory distinction to memory panels (#63111)
- Consider cgroup limit when fetching CPU (#63685) and correct worker OOM score adjustment logic (#62470)
- Replace
NodeAffinitySchedulingStrategywith Label Selector API whensoft=False(#54940) - Improve
SlicePlacementGrouplifecycle and support explicitbundle_label_selectorfor TPUs (#63171); add TPU head resource for Ironwood TPU (#62786),chips_per_vmarg (#62526), and v6e single-host fixes (#62306) - Batch placement group bundle removal RPCs (#63839); remove PG resource deduction from GCS in favor of resource broadcast (#63723)
- Migrate Raylet/GCS timing logic to a shared
ClockInterfacewith a fake clock for testing (#62562, #62502, #62476) - Refactor asio build targets and add
IOContextMonitor; run GCS health check onio_service(#63042, #63166, #62608, #62374) - Autoscaler v2 performance: skip serializations for debug logs (#63778); accept fractional resource values in
request_resources(#63306) - Reduce traffic: halve task arg pubsub by skipping redundant raylet pull (#62583), avoid extra memcpy when spilling fused objects (#63653), and resolve task dependencies synchronously when objects exist (#62561)
- Improve
inspect_serializabilitymessages and traversal context (#63501, #63373, #63258); better worker startup error messages (#63714) - Warn when
runtime_envpackage approaches upload size limit (#63404); harden zip extraction path containment (#63786, #62813) - Include owner node ID in
OwnerDiedError(#63727); add dependency info to taskspec debug string (#62316) - Add unexpected worker failure metric and dashboard panel (#62297); group observability APIs in
rayCLI help (#62748) - Normalize OTel metric labels before Prometheus export (#63744) and retry/log when Prometheus queries fail (#63578); add GPU usage instance filter (#62214)
- Move observability and control-plane pubsub to dedicated services and rename
InternalPubSub*toControlPlanePubSub*(#62806, #63044, #62461) - AMD GPU: replace
rocm-smictypes binding withamd-smiPython interface (#62393); detect NVIDIA Blackwell consumer GPUs (#63322) - Add task retry delay for
ACTOR_UNAVAILABLEretries (#62330); improve State API filter key handling (#63638) - Patch
setproctitleto skip launch services IPC calls (#63366); add timeout for first redis probe (#63148) - Clarify head node commands in
ray upoutput (#63409); passlogging_configthrough Ray Clientray.init(#62192) - Print subprocess log tails with exit codes on unexpected exit (#61905); add warning log when GPU profiling command times out (#63706)
- Add unique suffix to log filenames (#62365); disable profiling endpoints by default (#62531)
- Remove pydantic v1 support (#62716); update Starlette to v1.0.1 (#63722)
- Deprecate
DAGNode.execute()(#63716); remove experimental_ownersupport forray.put(#63520)
🔨 Fixes
- Fix
ray.gethanging forever when an object's owner dies during pull (#63694); resolveReferenceCounterrace onWORKER_REF_REMOVED_CHANNEL(#60495) - Fix resource leaks in subprocess management (#63878) and
runtime_envcache not detecting changes in-r-referenced requirements files (#63403) - Fix replica actor zombie process after GCS restart (#63764); fix actor creation race condition (#62994); fix actor state counter bug (#63647)
- Fix placement groups with label domain stuck on the infeasible queue (#62483); log status for failed PG
PrepareResources/CommitResources(#62836) - Fix env var expansion in
ray job submitCLI viashlex.join(#63797) and--working-dirfor local zip files andhttp://URLs (#62843) - Surface WebSocket close codes and errors in job log streaming (#63364); fix
ray stopfailing to terminate dashboard/runtime_env agents on Windows (#62428) - Fix
ray downnot stopping Docker containers on worker nodes for local clusters (#62169); fix delayed/missing worker logs in Jupyter by flushing stdout/stderr (#63599) - Fix Python log monitor handling for same-inode truncated files (#63720); avoid
os.getcwd()on import by lazily evaluatingscratch_dir(#63040) - Fix accelerator detection on NVIDIA Blackwell consumer GPUs (#63322); avoid
FabricManagerstall on NVLink systems inGpuProfilingManager(#63312) - Fix POSIX semaphore crash in experimental mutable objects (#62328); fix overflow on exponential backoff multiplication (#62366)
- Fix OOM kill message wrong threshold with resource isolation (#62948); fix
OpenTelemetryMetricRecordersingleton init guard (#63081) - Fix
MarkFootprintAsBusyclearing saved idle state for unrelated items (#62588); fixHandleIsLocalWorkerDeadfor drivers (#62688) - Fix
AttributeErrorontracein client mode (#62955); fixIndexErrorin legacy post-mortem debugging (#61479) - Keep strong references to fire-and-forget asyncio tasks (#63291); validate
JobConfigcode_search_pathtype (#62499) - Fix
uvexistence check inUVProcessor(#62818); fix invalid default stats factory inClusterStatus(#62934) - Fix autoscaler v2
instance_type_namein autoscaling state (#62101) and stopped-node metric double counting (#62026) - Fix
ReadOnlyProviderConfigReadermax_workerscounting bug (#62819); fix circular import inray_print_logsthread (#63410) - Fix wrong container in spill-fusion threshold check (#63605); avoid emitting idle worker failure for unregistered failed workers (#62789)
- Avoid
returninfinallyblock (Python 3.14SyntaxWarning) (#63742); fix typos and replacetype()checks withisinstance()(#62154)
📖 Documentation
- Add "bring your own transport" docs page for RDT (#60308); doc changes for label locality support (#62551)
- Fix misleading docstrings on
drain_nodeAPIs (#62942); update outdated description formax_direct_call_object_size(#63164)
Dashboard
🎉 New Features
- Add Platform Events module with K8s event ingestion/caching and frontend UI (#62314, #63332)
- Show TPU stats on the Cluster tab (#63774)
💫 Enhancements
- Add
py-spy--idleand--subprocessesflags to profiling endpoints (#63852) - Pass
Grafanacluster filter to Serve metrics URLs (#63211) - Show last data load time (#63618)
- Add
Namecolumn to Jobs view fromjob_namemetadata (#62257) - Mask password arguments in
get_entrypoint_name()to prevent password exposure (#61995)
🔨 Fixes
- Fix TPU metrics (#63998)
- Guard against zero
num_cpusink8s_utils.cpu_percent(#63729) - Fix invalid
PromQLwhenglobal_filtersis empty inGrafanadashboard generation (#63687) - Fix unexpected log line details pop-up in log viewer UI (#62637)
Ray Wheels and Images
- Bumped the Ray version for the 2.56.0 release.
- Bumped the minimum Python version in
pyproject.toml(#62569). - Added TPU release images (#62113) and updated the TPU Docker image base dependencies (#63006).
- Added a
torchftimage for Torch trainer tests (#63361) and ranapt-get upgradefor slim base images (#62666). - Numerous dependency lockfile and CI image updates (raydepsets migration, depset regeneration across core/ML/RLlib/docs/macOS CI images).
Documentation
- Established
doc/redirects/current.yamlas the redirects source of truth with legacy-version 404 redirect coverage (#63367, #63880). - Added an agent context guide for Ray documentation and an
ipython3lexer hook for notebook shell/magic cells (#63227, #63515). - Added Sphinx
/llms.txtand/llms-full.txtgeneration, excluding Jupyter notebooks (#63130, #63228). - Upgraded doc toolchain:
pydata-sphinx-theme0.17.1,myst-nb1.4.0, addedsphinxext-opengraph, unpinned yankedtf-keras(#63344, #63360, #63343, #63358). - Banned new
.rstfiles underdoc/source/and added CI to skip RTD builds for PRs that don't touch docs (#63057, #63431). - Added meta descriptions to ray-contribute pages and anonymized personal paths in Tune notebook outputs (#63832, #63464).
- Tune: updated deprecated
sample_fromexamples to config-dict style and documentedtime_attrscheduler values (#63804, #32467). - RLlib: clarified DQN
hiddensas dueling-only, removed a broken parametric-actions link, fixed broken doc links (#43051, #54671, #47146). - Ray Data: added a
map_batchesshuffle section, streaming generator docs, and fixed a broken README link (#62576, #63791, #63412). - Ray Train: documented
iter_jax_batchesforJaxTrainerand updated TPU scaling config docs (#63294, #62584). - Kubernetes/TPU: added a GKE Gateway ingress example, fixed the GKE TPU guide, and replaced deprecated example images (#63546, #63209, #63019).
- Added a RayCronJob quick-start guide and clarified KAI Scheduler RayJob submission modes (#62151, #61332).
- Added a Slurm guide for running Ray inside Docker containers (#63221).
- Documented
AutoscalingConfigreplica/target fields and correctedmax_callsdefault docs (#48601, #63894).
Dependencies
This is the last Ray release to support the dependency versions listed below. For the 2.57.0 release, Ray will raise its minimum required versions for several core dependencies. If your environment
pins any of these packages below the new minimums, plan to upgrade before moving to the next Ray release.
| Dependency | Last supported in this release | New minimum (next release) |
|---|---|---|
| numpy | < 2.1
| >= 2.1
|
| protobuf | < 5.26
| >= 5.26
|
| pandas | < 2.2.3
| >= 2.2.3
|
| pyarrow | < 18.0.0
| >= 18.0.0
|
| pydantic | < 2.9
| >= 2.9
|
| grpcio | < 1.66
| >= 1.66
|
| scipy | (previously unpinned) | >= 1.14.1
|
Most users on recent releases of these packages are unaffected
Thanks
Many thanks to all those who contributed to this release!
@khluu, @Krishnachaitanyakc, @leewyang, @ssam18, @christian-pinto, @Hyunoh-Yeo, @hango880623, @yuanzhuoyang1-bit, @marwan116, @aaronscalene, @tianyi-ge, @TriNguyen1208, @andrewsykim, @leonaIee, @OneSizeFitsQuorum, @AksodFlare, @limarkdcunha, @dayshah, @jade710, @pedrojeronim0, @dev-miro26, @DonPalius, @TimothySeah, @abrarsheikh, @nathon-lee, @prince8273, @Bye-legumes, @rayhhome, @Yunnglin, @spencer-p, @ryanaoleary, @herin049, @stephanie-wang, @liulehui, @slxswaa1993, @psaikaushik, @cyhapun, @tdat1465, @akyang-anyscale, @chenshi5012, @zzchun, @ryankert01, @EagleLo, @mzjp2, @justinvyu, @petern48, @YuangGao, @sjp611, @wingkitlee0, @AndySung320, @dstrodtman, @Accurio, @JasonLi1909, @peterjc123, @eicherseiji, @kyuds, @Chong-Li, @joaquinhuigomez, @IrvinFan, @XuQianJin-Stars, @AJamesPhillips, @harshit-anyscale, @claytonlin1110, @nhquana2, @Rruop, @win5923, @raulchen, @rohankmr414, @andrew-anyscale, @YoyinZyc, @doanxem99, @liujp, @dancingactor, @Evelynn-V, @SohamRajpure, @dragongu, @ShockYoungCHN, @ljstrnadiii, @WFY123wfy, @axreldable, @pseudo-rnd-thoughts, @H4ck2, @mvcb, @xinyuangui2, @edoakes, @ankushbbbr, @ps2181, @dominikkawka, @vinhuytran0810-cell, @siyuanfoundation, @MengjinYan, @Chronostasys, @jeffreywang88, @lalitc375, @sampan-s-nayak, @ArturNiederfahrenhorst, @srini047, @ChangyuWang, @adam360x, @Yicheng-Lu-llll, @thakoreh, @Aydin-ab, @manhld0206, @oab24413gmai, @ayushk7102, @tycao0338-cpu, @slfan1989, @myandpr, @rueian, @ans9868, @Ziy1-Tan, @elliot-barn, @as-jding, @daiping8, @robertnishihara, @MatthewCWeston, @Cursx, @laysfire, @karticam, @Mr-Neutr0n, @jjyao, @zent1n0, @aslonnie, @DenBuzz, @michael-pryor, @goanpeca, @nadongjun, @ronny-anyscale, @GoparapukethaN, @werkt, @carolynwang, @kamil-kaczmarek, @madiyar-wayve, @peterxcli, @pqkzzz, @Future-Outlier, @iamjustinhsu, @micah-yong-ai, @wxwmd, @owenowenisme, @sai-miduthuri, @lonexreb, @prassanna-ravishankar, @wanadzhar913, @kouroshHakha, @tobby168, @johntaylor-cell, @richabanker, @Kunchd, @vincere-mori, @vaishdho1, @wenhaozhao011-cmd, @bveeramani, @bittoby, @Phucvt123, @aschuh-hf, @RudrenduPaul, @xyuzh, @Sparks0219, @yancanmao, @eureka0928, @yuhuan130, @goutamvenkat-anyscale, @Zerui18, @machichima, @Lucas61000, @weimingdiit, @xi377266, @EmaFerrao, @awen11123, @Lawson-Darrow, @suppagoddo