This is the Training Operator v1.9.0-rc.0 pre-release.

Breaking Changes

Upgrade Kubernetes to v1.31.3 (#2330 by @astefanutti)
Upgrade Kubernetes to v1.30.7 (#2332 by @astefanutti)
Update the name of PVC in train API (#2187 by @helenxie-bit)
Remove support for MXJob (#2150 by @tariq-hasan)
Support Python 3.11 and Drop Python 3.7 (#2105 by @tenzen-y)

New Features

Distributed JAX

Add JAX controller (#2194 by @sandipanpanda)
Add JAX API (#2163 by @sandipanpanda)
JAX Integration Enhancement Proposal (#2125 by @sandipanpanda)

New Examples

FSDP Example for T5 Fine-Tuning and PyTorchJob (#2286 by @andreyvelich)
Add DeepSpeed Example with Pytorch Operator (#2235 by @Syulin7)

Control Plane Updates

Validate pytorchjob workers are configured when elasticpolicy is configured (#2320 by @tarat44)
[Feature] Support managed by external controller (#2203 by @mszadkow)
Update trainer to ensure type consistency for train_args and lora_config (#2181 by @helenxie-bit)
Support ARM64 platform in TensorFlow examples (#2119 by @akhilsaivenkata)
Feat: Support ARM64 platform in XGBoost examples (#2114 by @tico88612)
ARM64 supported in PyTorch examples (#2116 by @danielsuh05)

SDK Updates

[SDK] Adding env vars (#2285 by @tarekabouzeid)
[SDK] Use torchrun to create PyTorchJob from function (#2276 by @andreyvelich)
[SDK] move env var to constants.py (#2268 by @varshaprasad96)
[SDK] Allow customising base trainer and storage images in Train API (#2261 by @varshaprasad96)
[SDK] Read namespace from the current context (#2255 by @andreyvelich)
[SDK] Sync Transformers version for train API (#2146 by @andreyvelich)
[SDK] Explain Python version support cycle (#2144 by @andreyvelich)

Kubeflow Training V2

KEP-2170: Kubeflow Training V2 API (#2171 by @andreyvelich)
KEP-2170: Update V2 KEP with MPI Runtime info (#2345 by @andreyvelich)
Always update TrainJob status on errors (#2352 by @astefanutti)
Fix TrainJob status comparison and update (#2353 by @astefanutti)
Add required RBAC on TrainJob finalizer sub-resources (#2350 by @astefanutti)
KEP-2170: [SDK] Initial implementation of the Kubeflow Training V2 Python SDK (#2324 by @andreyvelich)
KEP-2170: Add Torch Distributed Runtime (#2328 by @andreyvelich)
KEP-2170: Add TrainJob conditions (#2322 by @tenzen-y)
KEP-2170: Add the TrainJob state transition design (#2298 by @tenzen-y)
KEP-2170: Implement Initializer builders in the JobSet plugin (#2316 by @andreyvelich)
KEP-2170: Implement JobSet, PlainML, and Torch Plugins (#2308 by @andreyvelich)
KEP-2170: Create model and dataset initializers (#2303 by @andreyvelich)
KEP-2170: Generate Python SDK for Kubeflow Training V2 (#2310 by @andreyvelich)
KEP-2170: Initialize runtimes before the manager starts (#2306 by @tenzen-y)
KEP-2170: Strictly verify the CRD marker validation and defaulting in the integration testings (#2304 by @tenzen-y)
KEP-2170: Decouple JobSet from TrainJob (#2296 by @tenzen-y)
KEP-2170: Implement TrainJob Reconciler to manage objects (#2295 by @tenzen-y)
KEP-2170: Add manifests for Kubeflow Training V2 (#2289 by @andreyvelich)
KEP-2170: Adding CEL validations on v2 TrainJob CRD (#2260 by @akshaychitneni)
KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API (#2283 by @andreyvelich)
KEP-2170: Implement runtime framework (#2248 by @tenzen-y)
[v2alpha] Move GV related codebase (#2281 by @varshaprasad96)
KEP-2170: Generate clientset, openapi spec for the V2 APIs (#2273 by @varshaprasad96)
KEP-2170: Implement skeleton webhook servers (#2251 by @tenzen-y)
KEP-2170: Initial Implementations for v2 Manager (#2236 by @tenzen-y)
KEP-2170: Generate CRD manifests for v2 CustomResources (#2237 by @tenzen-y)
KEP-2170: Update Training V2 APIs in the KEP (#2240 by @andreyvelich)
KEP-2170: Add TrainJob and TrainingRuntime APIs (#2223 by @andreyvelich)
KEP-2170: Bind repository into the build environment instead of filecopy (#2222 by @tenzen-y)
KEP-2170: Add directories for the V2 APIs (#2221 by @andreyvelich)
KEP-2170: Add the apiGroup to the TrainingRuntimeRef (#2201 by @tenzen-y)
KEP-2170: Make API specification more restricting (#2198 by @tenzen-y)

Bug Fixes

[release-1.9] V1: Fix versions in HuggingFace dataset initializer (#2370 by @andreyvelich)
Pin accelerate package version in trainer (#2340 by @gavrissh)
[fix] Resolve v2alpha API exceptions (#2317 by @varshaprasad96)
[SDK] Minor fix in wait_for_job_conditions with job_kind python training API (#2265 by @saileshd1402)
[SDK] Fix typo of "get_pvc_spec" (#2250 by @helenxie-bit)
[Bug] Finish CleanupJob early if the job is suspended. (#2243 by @mszadkow)
[SDK] Fix trainer error: Update the version of base image and add "num_labels" for downloading pretrained models (#2230 by @helenxie-bit)
Update huggingface_hub Version in the storage initializer to fix ImportError (#2180 by @helenxie-bit)
[SDK] Fix Failed condition in wait Job API (#2160 by @andreyvelich)
fix volcano podgroup update issue (#2079 by @ckyuto)
[SDK] Fix Incorrect Events in get_job_logs API (#2122 by @andreyvelich)

Misc

[release-1.9] Add release branch to the image push trigger (#2377 by @andreyvelich)
Add e2e test for train API (#2199 by @helenxie-bit)
buildx link was broken (#2356 by @Veer0x1)
Upgrade helm/kind-action to v1.11.0 (#2357 by @astefanutti)
Upgrade Go version to v1.23 (#2302 by @tenzen-y)
Ensure code generation dependencies are downloaded (#2339 by @astefanutti)
Added test for create-pytorchjob.ipynb python notebook (#2274 by @saileshd1402)
Remove zw0610 from approvers (#2343 by @zw0610)
Upgrade kustomization files to Kustomize v5 (#2326 by @oksanabaza)
Add openapi-generator CLI option to skip SDK v2 test generation (#2338 by @astefanutti)
Refine the server-side apply installation args (#2337 by @tenzen-y)
Ignore cache exporting errors in the image building workflows (#2336 by @tenzen-y)
Pin Gloo repository in JAX Dockerfile to a specific commit (#2329 by @sandipanpanda)
Update tf job examples to tf v2 (#2270 by @YosiElias)
Remove Prometheus Monitoring doc (#2301 by @sophie0730)
Upgrade Deepspeed demo dependencies (#2294 by @Syulin7)
[SDK] test: add unit test for list_jobs method of the training_client (#2267 by @seanlaii)
[SDK] Training Client Conditions related unit tests (#2253 by @Bobbins228)
[SDK] test: add unit test for get_job_logs method of the training_client (#2275 by @seanlaii)
[SDK] test: add unit test for get_job method of the training_client (#2205 by @Bobbins228)
[SDK] test: add unit tests for delete_job() method (#2232 by @Bobbins228)
[SDK] Add UTs for wait_for_job_conditions (#2196 by @Electronic-Waste)
[SDK] Unit tests for TrainingClient APIs - get_job_pod_names and update_job (#2192 by @YosiElias)
[SDK] Add more unit tests for TrainingClient APIs - get_job_pods (#2175 by @YosiElias)
Update JAX image to use image published by Kubeflow (#2264 by @sandipanpanda)
Update README and out-of-date docs (#2252 by @andreyvelich)
Clean up Go modules (#2238 by @tenzen-y)
Change isort profile to black for full compatibility (#2234 by @Ygnas)
Enhance pre-commit hooks with flake8 linting (#2195 by @Ygnas)
Implement pre-commit hooks (#2184 by @droctothorpe)
Add command to re-run GitHub Actions tests (#2167 by @andreyvelich)
Update JAX integration proposal (#2165 by @sandipanpanda)
Update release document (#2153 by @andreyvelich)
update volcano to v1.9.0 (#2148 by @lowang-bh)
Update Slack Invitation (#2142 by @andreyvelich)
Refine the integration tests for the immutable PyTorchJob queueName (#2130 by @tenzen-y)
Add GitHub Issue Template (#2129 by @andreyvelich)
Update the images to the latest tag in master branch (#2128 by @johnugeorge)
Updated Github Action Workflows as per issue #2117 (#2123 by @hkiiita)
changed package name to flake8 to fix pytests pip install (#2109 by @ChristopheBrown)
chore(fix): isort xgboost (#2098 by @harshithbelagur)
Fix isort on examples/pytorch (#2094 by @marcmaliar)

kubeflow/training-operator v1.9.0-rc.0 v1.9.0-rc.0 release on GitHub

Breaking Changes

New Features

Distributed JAX

New Examples

Control Plane Updates

SDK Updates

Kubeflow Training V2

Bug Fixes

Misc

kubeflow/training-operator v1.9.0-rc.0
v1.9.0-rc.0 release

on GitHub