Highlights
- Ray plans to drop support for Pydantic V1 starting version 2.56.0. Please see this RFC for details.
- Ray Data now has support for bounded reading from Kafka and improved Iceberg support.
Ray Data
π New Features
- Autoscaling: New utilization-based cluster autoscaler for Ray Data workloads (#59353, #59362, #59366). To use this new autoscaler set RAY_DATA_CLUSTER_AUTOSCALER=V2.
- Kafka Datasource: Add Kafka as a native datasource for data ingestion (#58592)
- Dataset summary API: Add
Dataset.summary()API for quick dataset inspection (#58862) - Iceberg support: Add Iceberg schema evolution, upsert, and overwrite support (#59210, #59335)
- Graceful error handling: Add
should_continue_on_errorfor graceful error handling in batch inference (#59212) - Datetime compute expressions: Add datetime compute expressions support (#58740)
- Grouped
with_columnexpressions: Enable expressions for groupedwith_columnin Ray Data (#58231) - Parallelized collation: Parallelize
DefaultCollateFn,arrow_batch_to_tensors(#58821)
π« Enhancements
- Optimized Autoscaler Step Size: Optimize autoscaler to support configurable step size for actor pool scaling (#58726)
- Improved Streaming Repartition: Improve streaming repartition performance (#58728)
- Actor init retry: Add actor retry if there's a failure in
__init__(#59105) - Fused Repartition + MapBatches: Fuse StreamingRepartition with MapBatches operators to scale collate (#59108)
- Combined repartitions: Combine consecutive repartitions for efficiency (#59145)
- Prefetch buffering: Handle prefetch buffering in
iter_batches(#58657) - HashShuffle block breakdown:
HashShuffleAggregatorbreaks down blocks on finalize (#58603) - Backpressure tuning: Tune concurrency cap backpressure object store budget ratio (#58813)
- Non-string ApproximateTopK: Support non-string items for
ApproximateTopKaggregator (#58659) - Lance version support: Add version support to
read_lance()(#58895) - Dashboard metrics: Add
time_to_first_batchandget_ref_bundlesmetrics to data dashboard (#58912) - Iter prefetched bytes stats: Add
iter_prefetched_bytesstatistics tracking (#58900) - Configurable batching for
iter_batches: Add configurable batching forresolve_block_refsto speed upiter_batches(#58467) - Improved dashboard metrics: Improve Ray Data dashboard metrics display (#58667)
- Histogram percentiles: Update Ray Data histograms to show percentiles in data dashboard (#58650)
- Deprecated API removal: Remove deprecated
read_parquet_bulkAPI (#58970) - Block shaping option: Add disable block shaping option to BlockOutputBuffer (#58757)
- Removed concurrency lock: Remove concurrency lock for better performance (#56798)
π¨ Fixes
- Fixes to Unique: Fix support of list types for Unique aggregator (#58916)
- Parquet NaN fix: Fix reading from written parquet for numpy with NaNs (#59172)
- Hash Shuffle empty block: Fix empty block sort in hash shuffle operator (#58836)
- Hive partitioning pushdown: Fix pushdown optimizations with Hive partitioning (#58723)
- Object Store usage reporting: Fix
obj_store_mem_max_pending_output_per_taskreporting (#58864) - Pyarrow FileSystem serialization fix: Handle filesystem serialization issue in
get_parquet_dataset(#57047) - Azure UC SAS: Handle Azure UC user delegation SAS (#59393)
- Async UDF Thread Cleanup: Close threads from async UDF after actor died (#59261)
- Object Locality Default: Default return 0s for object locality instead of -1s (#58754)
π Documentation
- Added contributing guide to Ray Data documentation (#58589)
- Added download expression to key user journeys in documentation (#59417)
- Added Kafka user guide (#58881)
- Added unstructured data templates from Ray Summit 2025 (#57063)
- Improved instructions for reading Hugging Face datasets (#58492, #58832)
- Refined batch-format guidance in docs (#58971)
- Exposed
vision_preprocessandvision_postprocessin VLM docs (#59012) - Added upgrading
huggingface_hubinstruction (#59109) - Added scaling out expensive collation functions doc (#58993)
Ray Serve
π New Features
- Deployment topology visibility. Exposes deployment dependency graphs in Serve REST API, allowing users to visualize and understand the DAG structure of their applications. (#58355)
- External autoscaler integration. Adds
external_scaler_enabledflag to application config, enabling third-party autoscalers to control replica counts. (#57727, #57698) - Node rank and local rank support. Extends replica rank system to track node-level and per-node local ranks, enabling better distributed serving coordination for multi-node deployments. (#58477, #58479)
- Custom batch size function. Allows users to define custom functions for computing logical batch sizes in
@serve.batch, useful when batch items have varying weights (e.g., token counts in LLM inference). (#59059) - Stateful application-level autoscaling. Adds policy state persistence for custom autoscaling policies, allowing policies to maintain state across control-loop iterations. (#59118)
- New autoscaling, batching, and routing metrics. Adds Prometheus metrics for autoscaling decisions (
ray_serve_deployment_target_replicas,ray_serve_autoscaling_decision_replicas), batching statistics, and router queue latency for improved observability. (#59220, #59232, #59233)
π« Enhancements
- Smarter downscaling behavior. Prioritizes stopping most recently scaled-up replicas during downscale, preserving long-lived replicas that are optimally placed and fully warmed up. (#52929)
- Autoscaling performance optimizations. Short-circuits metric aggregation for single time series cases (O(n log n) β O(1)) and lazily evaluates expensive autoscaling context fields to reduce controller CPU usage. (#58962, #58963)
- Route matching cleanup. Removes redundant route matching logic from replicas since correct route values are now included in RequestMetadata. Also allows multiple methods (
GET,PUT) corresponding to a route. (#58927) - Deployment wrapper metadata preservation. Wrapper classes from decorators like
@ingressnow preserve original class metadata (__qualname__,__module__,__doc__,__annotations__). (#58478) - Improved type annotations. Enhances generic type annotations on
DeploymentHandle,DeploymentResponse, andDeploymentResponseGeneratorfor better IDE support and type inference. Adds.result()stub toDeploymentResponseGeneratorto fix static typing errors. (#59363, #58522)
π¨ Fixes
- YAML serialization for autoscaling enums. Fixes
RepresenterErrorwhen usingserve buildwithAggregationFunctionenum values in autoscaling config. (#58509) - Autoscaling context timestamp fix. Correctly sets
last_scale_up_timeandlast_scale_down_timeon autoscaling context. (#59057) - Deadlock in chained deployment responses. Fixes hang when awaiting intermediate
DeploymentResponseobjects in a chain of deployment calls from different event loops. (#59385) - FastAPI class-based view inheritance. Fixes
make_fastapi_class_based_viewto properly handle inherited methods. (#59410)
π Documentation
- Async I/O best practices guide. New documentation covering async programming patterns and best practices for Ray Serve deployments. (#58909)
- Replica scheduling guide. New documentation covering compact scheduling, placement groups, custom resources, and guidance on when to use each feature. (#59114)
Ray Train
π New Features
- Worker Placement with Label Selectors: Added
label_selectortoScalingConfig. This allows users to control worker placement by targeting specific labeled nodes in the cluster. (#58845, #59414) - Multihost JaxTrainer on GPU: Introduced support for
JaxTrainerrunning on GPU machines. (#58322) - Checkpoint Consistency Modes: Added
CheckpointConsistencyModetoget_all_reported_checkpoints, providing options for handling checkpoint retrieval consistency. (#58271) - Per-Dataset Execution Options:
DataConfignow supports settingexecution_optionson a per-dataset basis for finer-grained control over data loading. (#58717)
π« Enhancements
- Nested Metrics Support:
Result.get_best_checkpointnow supports nested metrics, allowing for more flexible metric tracking and checkpoint selection. (#58537) - Non-Blocking Checkpoint Retrieval:
get_all_reported_checkpointsno longer blocks when only metrics are reported. (#58870) - Improved Resource Cleanup: Implemented eager cleanup of data resources and placement groups upon training run failures or aborts, preventing resource leaks. (#58325, #58515)
π¨ Fixes
- MLflow Compatibility: Updated
setup_mlflowAPI to ensure full compatibility with Ray Train V2. (#58705) - Validation for Checkpoint Uploads: A
ValueErroris now raised ifcheckpoint_upload_fnfails to return a valid checkpoint. (#58863)
π Documentation
- New API Documentation: Added comprehensive documentation for the
ray.train.get_all_reported_checkpointsmethod. (#58946)
Ray Tune
π« Enhancements:
- Nested Metrics Support:
Result.get_best_checkpointnow supports nested metrics, allowing for more flexible metric tracking and checkpoint selection. (#58537)
Ray LLM
π« Enhancements
- Cloud filesystem restructuring with provider-specific implementations (#58469)
- Bump
transformersto 4.57.3 (#58980) - Ray Data LLM config refactor (#58298)
- Update
vllm_engine.pyto check forVLLM_USE_V1attribute (#58820) - Infer
VLLM_RAY_PER_WORKER_GPUSfrom fractional placement-group bundles automatically (#58949)
π¨ Fixes
- Fix LLM DP release test configuration (#59090)
Ray RLlib
π New Features
- DreamerV3: allow
num_env_runners \> 0(#58495)
π« Enhancements
- π₯
MetricsLoggertweaks+ Stats rewrite (#56838) - move restart message into
EnvRunner(#56750) - make βFootsiesβ less verbose (optionally) (#58939)
- update an
AlgorithmConfigdeprecated argument with incorrect behavior/semantics (#59138) - Examples/docs cleanup:
- Testing / CI & infra cleanup (part of a larger effort to organize + harden RLlib testing):
- clean up tests folder layout in favor of
/component/tests(#58890) - re-enable and fix nightly tests for APPO on Atari and MuJoCo (#58853)
- re-enable all RLlib doctests (#58974)
- add pytest reporting hook (
pytest_runtest_makereport) across tests (#59003) - add/enable RLlib Py3.10 CI lane (#59226)
- fix as-release-test silently failing (#59386)
- fix recursive imports in old test-utils location (#59435)
- Remove
asv.conf.json(#58934) - Update requirement for
byod_rllib.sh(#59157)
- clean up tests folder layout in favor of
π¨ Fixes
- Fix custom model-config mismatch between EnvRunner and Learner (#58739)
- MultiAgentEnvRunner: prevent double-calling connectors (#58931)
- Error handling: log or raise when a case is not fully handled (#58889)
- Error handling: error out when data cannot be loaded (#59002)
- Assorted RLlib bugfixes (#59386)
π Documentation
- Update APPO paper reference to link to IMPACT paper (#58935)
Ray Core
π New Features
- Support zero-copy serialization for read-only PyTorch tensors via
RAY_ENABLE_ZERO_COPY_TORCH_TENSORS(#57639) - Add
.rayignorefile support for controlling cluster uploads (#58500) - Improve large-scale resource view synchronization through sync message batching (#57641)
- Autoscaler with cloud resource availability awareness (#58623)
- Token authentication UX improvements with new
AuthenticationErrorexception (#58737) - Support
X-Ray-Authorizationfallback header for auth token in dashboard (#58819)
π« Enhancements
- Limit core worker gRPC reply threads to 2 by default via
RAY_core_worker_num_server_call_thread(#58771) - Make accessor node address and liveliness cache thread safe (#58947)
- Create
OtlpGrpcMetricExporterwrapper to log export failures (#58929) - Print detailed exception information when failing to report events (#58953)
- Simplify local/global GC logic (#58671)
- Surface correct error message when
get_if_exists=Truefor actor lookup (#58628) - Throw
AuthenticationErrorfrom Python for token loading errors (#59031) - Use
secrets.token_hex(32)to generate auth tokens (#58818) - Remove
AUTH_MODE=tokencheck inget-auth-tokenCLI (#58848) - Introduce core chaos network release tests (#58868)
π¨ Fixes
- Fix
grpc_authentication_server_interceptorsstreaming response handling (#59104) - Fix handle leak in
IsProcessAliveon Windows (#59106) - Fix counter metric default branch for
RAY_enable_open_telemetry(#59095) - Fix leaking metric recorder in tests (#58952)
- Fix crash when using JVM HDFS by adding
RAY_DISABLE_FAILURE_SIGNAL_HANDLERoption (#58984) - Fix heap corruption in
RayletClientcausing driver crash (use-after-free) (#58660) - Use
shared_ptrforpins_in_flight_to prevent use-after-free (#58744) - Remove deprecated
add_command_alias(#58719) - Remove
cluster_full_of_actors_detected_*fields (unused in autoscaler v2) (#59052)
π Documentation
- Add
token-auth.mddocumentation page (#58829) - Update KubeRay authentication guide to use native Ray token authentication (#58729)
Dashboard
π« Enhancements
- Add
time_to_first_batchandget_ref_bundlesmetrics to data dashboard (#58912) - Update Ray Data histograms to show percentiles grouped by operator (#58650)
Ray Wheels and Images
- Upgraded
rich,cupy-cuda12x, andmemray(#58983) - Upgraded
lxmlto 6.0.2 (#58808) - Upgraded
requestsfrom 2.32.3 to 2.32.5 (#58724) - Added
openlineage-pythonin the dependency set (#58724)
Thanks
Thank you to everyone who contributed to this release!
@xinyuangui2, @harshit-anyscale, @Sparks0219, @israbbani, @siyuanfoundation, @robertnishihara, @thomasdesr, @spencer-p, @aslonnie, @ZacAttack, @soodoshll, @marosset, @simeetnayan81, @soffer-anyscale, @abrarsheikh, @400Ping, @richo-anyscale, @as-jding, @rueian, @kshanmol, @yancanmao, @zzchun, @coqian, @matthewdeng, @Future-Outlier, @YoussefEssDS, @ykdojo, @pseudo-rnd-thoughts, @lowdy1, @ArturNiederfahrenhorst, @myandpr, @komikndr, @machichima, @RisinT96, @curiosity-hyf, @alanwguo, @CaiZhanqi, @Aydin-ab, @MengjinYan, @suzuri-lollipop, @jeffreyjeffreywang, @rushikeshadhav, @alexeykudinkin, @meAmitPatil, @zcin, @teddygood, @elliot-barn, @dayshah, @srinathk10, @XLC127, @simonsays1980, @kevin85421, @bveeramani, @kunling-anyscale, @khluu, @andrew-anyscale, @KaisennHu, @kouroshHakha, @ryankert01, @pavitrabhalla, @jjyao, @dragongu, @SolitaryThinker, @justinrmiller, @wxwmd, @Haustle-v, @TimothySeah, @goutamvenkat-anyscale, @liulehui, @raulchen, @HassamSheikh, @Priya-753, @vaishdho1, @dancingactor, @daiping8, @eloaf, @JasonLi1909, @rayci-bot, @richardliaw, @SheldonTsen, @Yicheng-Lu-llll, @ktyxx, @pschmutz, @iamjustinhsu, @ahao-anyscale, @cem-anyscale, @eicherseiji, @edoakes, @rajeshg007, @arki05, @andrewsykim, @nrghosh, @ryanaoleary, @kyuds, @Daraan, @can-anyscale, @sampan-s-nayak, @xyuzh, @owenowenisme