This is the Training Operator v1.9.0-rc.0 pre-release.
Breaking Changes
- Upgrade Kubernetes to v1.31.3 (#2330 by @astefanutti)
- Upgrade Kubernetes to v1.30.7 (#2332 by @astefanutti)
- Update the name of PVC in
train
API (#2187 by @helenxie-bit) - Remove support for MXJob (#2150 by @tariq-hasan)
- Support Python 3.11 and Drop Python 3.7 (#2105 by @tenzen-y)
New Features
Distributed JAX
- Add JAX controller (#2194 by @sandipanpanda)
- Add JAX API (#2163 by @sandipanpanda)
- JAX Integration Enhancement Proposal (#2125 by @sandipanpanda)
New Examples
- FSDP Example for T5 Fine-Tuning and PyTorchJob (#2286 by @andreyvelich)
- Add DeepSpeed Example with Pytorch Operator (#2235 by @Syulin7)
Control Plane Updates
- Validate pytorchjob workers are configured when elasticpolicy is configured (#2320 by @tarat44)
- [Feature] Support managed by external controller (#2203 by @mszadkow)
- Update trainer to ensure type consistency for
train_args
andlora_config
(#2181 by @helenxie-bit) - Support ARM64 platform in TensorFlow examples (#2119 by @akhilsaivenkata)
- Feat: Support ARM64 platform in XGBoost examples (#2114 by @tico88612)
- ARM64 supported in PyTorch examples (#2116 by @danielsuh05)
SDK Updates
- [SDK] Adding env vars (#2285 by @tarekabouzeid)
- [SDK] Use torchrun to create PyTorchJob from function (#2276 by @andreyvelich)
- [SDK] move env var to constants.py (#2268 by @varshaprasad96)
- [SDK] Allow customising base trainer and storage images in Train API (#2261 by @varshaprasad96)
- [SDK] Read namespace from the current context (#2255 by @andreyvelich)
- [SDK] Sync Transformers version for train API (#2146 by @andreyvelich)
- [SDK] Explain Python version support cycle (#2144 by @andreyvelich)
Kubeflow Training V2
- KEP-2170: Kubeflow Training V2 API (#2171 by @andreyvelich)
- KEP-2170: Update V2 KEP with MPI Runtime info (#2345 by @andreyvelich)
- Always update TrainJob status on errors (#2352 by @astefanutti)
- Fix TrainJob status comparison and update (#2353 by @astefanutti)
- Add required RBAC on TrainJob finalizer sub-resources (#2350 by @astefanutti)
- KEP-2170: [SDK] Initial implementation of the Kubeflow Training V2 Python SDK (#2324 by @andreyvelich)
- KEP-2170: Add Torch Distributed Runtime (#2328 by @andreyvelich)
- KEP-2170: Add TrainJob conditions (#2322 by @tenzen-y)
- KEP-2170: Add the TrainJob state transition design (#2298 by @tenzen-y)
- KEP-2170: Implement Initializer builders in the JobSet plugin (#2316 by @andreyvelich)
- KEP-2170: Implement JobSet, PlainML, and Torch Plugins (#2308 by @andreyvelich)
- KEP-2170: Create model and dataset initializers (#2303 by @andreyvelich)
- KEP-2170: Generate Python SDK for Kubeflow Training V2 (#2310 by @andreyvelich)
- KEP-2170: Initialize runtimes before the manager starts (#2306 by @tenzen-y)
- KEP-2170: Strictly verify the CRD marker validation and defaulting in the integration testings (#2304 by @tenzen-y)
- KEP-2170: Decouple JobSet from TrainJob (#2296 by @tenzen-y)
- KEP-2170: Implement TrainJob Reconciler to manage objects (#2295 by @tenzen-y)
- KEP-2170: Add manifests for Kubeflow Training V2 (#2289 by @andreyvelich)
- KEP-2170: Adding CEL validations on v2 TrainJob CRD (#2260 by @akshaychitneni)
- KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API (#2283 by @andreyvelich)
- KEP-2170: Implement runtime framework (#2248 by @tenzen-y)
- [v2alpha] Move GV related codebase (#2281 by @varshaprasad96)
- KEP-2170: Generate clientset, openapi spec for the V2 APIs (#2273 by @varshaprasad96)
- KEP-2170: Implement skeleton webhook servers (#2251 by @tenzen-y)
- KEP-2170: Initial Implementations for v2 Manager (#2236 by @tenzen-y)
- KEP-2170: Generate CRD manifests for v2 CustomResources (#2237 by @tenzen-y)
- KEP-2170: Update Training V2 APIs in the KEP (#2240 by @andreyvelich)
- KEP-2170: Add TrainJob and TrainingRuntime APIs (#2223 by @andreyvelich)
- KEP-2170: Bind repository into the build environment instead of filecopy (#2222 by @tenzen-y)
- KEP-2170: Add directories for the V2 APIs (#2221 by @andreyvelich)
- KEP-2170: Add the apiGroup to the TrainingRuntimeRef (#2201 by @tenzen-y)
- KEP-2170: Make API specification more restricting (#2198 by @tenzen-y)
Bug Fixes
- [release-1.9] V1: Fix versions in HuggingFace dataset initializer (#2370 by @andreyvelich)
- Pin accelerate package version in trainer (#2340 by @gavrissh)
- [fix] Resolve v2alpha API exceptions (#2317 by @varshaprasad96)
- [SDK] Minor fix in wait_for_job_conditions with job_kind python training API (#2265 by @saileshd1402)
- [SDK] Fix typo of "get_pvc_spec" (#2250 by @helenxie-bit)
- [Bug] Finish CleanupJob early if the job is suspended. (#2243 by @mszadkow)
- [SDK] Fix trainer error: Update the version of base image and add "num_labels" for downloading pretrained models (#2230 by @helenxie-bit)
- Update
huggingface_hub
Version in the storage initializer to fix ImportError (#2180 by @helenxie-bit) - [SDK] Fix Failed condition in wait Job API (#2160 by @andreyvelich)
- fix volcano podgroup update issue (#2079 by @ckyuto)
- [SDK] Fix Incorrect Events in get_job_logs API (#2122 by @andreyvelich)
Misc
- [release-1.9] Add release branch to the image push trigger (#2377 by @andreyvelich)
- Add e2e test for train API (#2199 by @helenxie-bit)
- buildx link was broken (#2356 by @Veer0x1)
- Upgrade helm/kind-action to v1.11.0 (#2357 by @astefanutti)
- Upgrade Go version to v1.23 (#2302 by @tenzen-y)
- Ensure code generation dependencies are downloaded (#2339 by @astefanutti)
- Added test for create-pytorchjob.ipynb python notebook (#2274 by @saileshd1402)
- Remove zw0610 from approvers (#2343 by @zw0610)
- Upgrade kustomization files to Kustomize v5 (#2326 by @oksanabaza)
- Add openapi-generator CLI option to skip SDK v2 test generation (#2338 by @astefanutti)
- Refine the server-side apply installation args (#2337 by @tenzen-y)
- Ignore cache exporting errors in the image building workflows (#2336 by @tenzen-y)
- Pin Gloo repository in JAX Dockerfile to a specific commit (#2329 by @sandipanpanda)
- Update tf job examples to tf v2 (#2270 by @YosiElias)
- Remove Prometheus Monitoring doc (#2301 by @sophie0730)
- Upgrade Deepspeed demo dependencies (#2294 by @Syulin7)
- [SDK] test: add unit test for list_jobs method of the training_client (#2267 by @seanlaii)
- [SDK] Training Client Conditions related unit tests (#2253 by @Bobbins228)
- [SDK] test: add unit test for get_job_logs method of the training_client (#2275 by @seanlaii)
- [SDK] test: add unit test for get_job method of the training_client (#2205 by @Bobbins228)
- [SDK] test: add unit tests for delete_job() method (#2232 by @Bobbins228)
- [SDK] Add UTs for
wait_for_job_conditions
(#2196 by @Electronic-Waste) - [SDK] Unit tests for TrainingClient APIs - get_job_pod_names and update_job (#2192 by @YosiElias)
- [SDK] Add more unit tests for TrainingClient APIs - get_job_pods (#2175 by @YosiElias)
- Update JAX image to use image published by Kubeflow (#2264 by @sandipanpanda)
- Update README and out-of-date docs (#2252 by @andreyvelich)
- Clean up Go modules (#2238 by @tenzen-y)
- Change isort profile to black for full compatibility (#2234 by @Ygnas)
- Enhance pre-commit hooks with flake8 linting (#2195 by @Ygnas)
- Implement pre-commit hooks (#2184 by @droctothorpe)
- Add command to re-run GitHub Actions tests (#2167 by @andreyvelich)
- Update JAX integration proposal (#2165 by @sandipanpanda)
- Update release document (#2153 by @andreyvelich)
- update volcano to v1.9.0 (#2148 by @lowang-bh)
- Update Slack Invitation (#2142 by @andreyvelich)
- Refine the integration tests for the immutable PyTorchJob queueName (#2130 by @tenzen-y)
- Add GitHub Issue Template (#2129 by @andreyvelich)
- Update the images to the latest tag in master branch (#2128 by @johnugeorge)
- Updated Github Action Workflows as per issue #2117 (#2123 by @hkiiita)
- changed package name to flake8 to fix pytests pip install (#2109 by @ChristopheBrown)
- chore(fix): isort xgboost (#2098 by @harshithbelagur)
- Fix isort on examples/pytorch (#2094 by @marcmaliar)