github kubeflow/trainer v2.1.0

one day ago

This is Kubeflow Trainer v2.1.0 release.

kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/manager?ref=v2.1.0"

$ kubectl get pods -n kubeflow-system

NAME                                                  READY   STATUS    RESTARTS   AGE
jobset-controller-manager-54968bd57b-88dk4            2/2     Running   0          65s
kubeflow-trainer-controller-manager-cc6468559-dblnw   1/1     Running   0          65s

kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/runtimes?ref=v2.1.0"

You can now install controller manager with Helm charts 🚀

helm install kubeflow-trainer oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.1.0

Install Kubeflow Python SDK:

pip install -U kubeflow

For more information, please see the Kubeflow Trainer docs.

Breaking Changes

  • feat(api): Replace deprecated PodSpecOverrides API with PodTemplateOverrides in TrainJob (#2785 by @xigang)
  • feat(operator): Replace TrainJob controller settings with the Config API (#2879 by @kapil27)
  • chore(operator): Upgrade JobSet to v0.10.1 (#2875 by @astefanutti)
  • chore(operator): Upgrade Kubernetes to v1.34 (#2804 by @astefanutti)
  • Upgrade Kubernetes to v1.33 (#2756 by @astefanutti)

New Features

Distributed AI Data Cache

Stream data directly to your GPU nodes with zero-copy transfers from an in-memory cache cluster powered by Apache Arrow and Apache DataFusion. This allows users to load massive tabular datasets efficiently, maximize GPU utilization, and minimize I/O in for large-scale pre- or post-training distributed AI workloads.

Explore more about data cache in:

LLM Post-Training

Kueue Enhancements

Check out the official Kueue docs.

Volcano Scheduler

  • feat: KEP-2437 - PodGroup Creation for Volcano Scheduler (#2729 by @Doris-xm)
  • feat(docs): KEP-2437-Support Volcano Scheduler in Kubeflow Trainer V2 (#2672 by @Doris-xm)

API Updates

  • feat(runtimes): add support for launcher resource allocation in MPI jobs (#2653 by @jskswamy)
  • feat: Add PodTemplateOverrides into TrainJob V2 API (#2882 by @xigang)
  • feat(api): Sync TrainJob JobsStatus from JobSet ReplicatedJobsStatus (#2802 by @astefanutti)
  • feat: support imagePullSecrets in TrainJob pod spec overrides (#2806 by @toVersus)
  • feat(operator): enforce RFC 1035 validation for TrainJob name (#2767 by @juniemariam)

Bug Fixes

Misc

Don't miss a new trainer release

NewReleases is sending notifications on new releases.