github kubeflow/trainer v2.2.0

10 hours ago

This is the Kubeflow Trainer v2.2.0 release 🚀

You can now deploy Trainer control plane and runtimes with a single Helm install:

helm install kubeflow-trainer oci://ghcr.io/kubeflow/charts/kubeflow-trainer \
    --namespace kubeflow-system \
    --create-namespace \
    --version 2.2.0 \
    --set runtimes.defaultEnabled=true

Install Kubeflow Python SDK:

pip install kubeflow

For more information, please see the Kubeflow Trainer docs.

Breaking Changes

New Features

XGBoost & JAX Runtimes

Flux Runtime for MPI and HPC Workloads

  • feat: support for flux framework as hpc manager (#3188 by @vsoch)
  • feat: KEP 2841 Flux Policy to support Flux Framework (#2909 by @vsoch)

TrainJob Lifecycle

  • feat(api): Set RuntimePatch.Time field automatically during admission (#3363 by @astefanutti)
  • feat: add support for tracking TrainJob progress and training metrics (#3227 by @robert-bell)
  • feat(docs): KEP-2779: Track TrainJob progress and expose training metrics (#2905 by @robert-bell)
  • feat: add activeDeadlineSeconds (#3258 by @XploY04)
  • feat(docs): proposal for adding TTLSecondsAfterFinished and ActiveDeadlineSeconds fields to TrainJob CRD (#3068 by @XploY04)
  • chore: upstream istio support - superseding 3189 (#3259 by @sameerdattav)
  • feat(runtimes): Use JobSet VolumeClaimPolicies APIs for LLM Runtimes (#3150 by @andreyvelich)
  • feat(cache): KEP-2655 - Supporting readiness probes on cache nodes (#2904 by @akshaychitneni)
  • feat(initializer): add s3 model and dataset initializers (#2728 by @rudeigerc)
  • feat(api): Add securityContext support to PodTemplateSpecOverride in TrainJob (#3066 by @Sanskarzz)

Bug Fixes

Misc

Dependencies Upgrade

  • chore(deps): Bump Torch to 2.10 version (#3320 by @andreyvelich)
  • chore(deps): bump transformers from 5.2.0 to 5.3.0 in /cmd/runtimes/deepspeed (#3297 by @dependabot[bot])
  • chore(deps): bump datasets from 4.6.1 to 4.7.0 in /cmd/runtimes/mlx (#3291 by @dependabot[bot])
  • chore(deps): bump datasets from 4.5.0 to 4.7.0 in /cmd/runtimes/deepspeed (#3298 by @dependabot[bot])
  • chore(deps): update huggingface-hub requirement from <1.5,>=0.27.0 to >=0.27.0,<1.7 in /cmd/initializers/dataset (#3296 by @dependabot[bot])
  • chore(deps): bump mlx-lm from 0.30.7 to 0.31.0 in /cmd/runtimes/mlx (#3295 by @dependabot[bot])
  • chore(deps): bump docker/login-action from 3 to 4 (#3293 by @dependabot[bot])
  • chore(deps): update huggingface-hub requirement from <1.5,>=0.27.0 to >=0.27.0,<1.7 in /cmd/initializers/model (#3292 by @dependabot[bot])
  • chore(deps): bump aquasecurity/trivy-action from 0.34.2 to 0.35.0 (#3290 by @dependabot[bot])
  • chore(deps): bump rust from 1.93-bullseye to 1.94-bullseye in /cmd/data_cache (#3289 by @dependabot[bot])
  • chore(deps): bump clap from 4.5.59 to 4.5.60 in /pkg/data_cache/test (#3249 by @dependabot[bot])
  • chore(deps): bump actions/upload-artifact from 6 to 7 (#3267 by @dependabot[bot])
  • chore(deps): bump quinn-proto from 0.11.13 to 0.11.14 in /pkg/data_cache (#3305 by @dependabot[bot])
  • chore(deps): bump tokio from 1.49.0 to 1.50.0 in /pkg/data_cache/test (#3288 by @dependabot[bot])
  • chore(deps): bump tokio from 1.49.0 to 1.50.0 in /pkg/data_cache (#3299 by @dependabot[bot])
  • chore(deps): bump deepspeed from 0.18.6 to 0.18.7 in /cmd/runtimes/deepspeed (#3294 by @dependabot[bot])
  • chore(deps): bump the kubernetes group across 1 directory with 9 updates (#3287 by @dependabot[bot])
  • chore(deps): bump datasets from 4.5.0 to 4.6.1 in /cmd/runtimes/mlx (#3272 by @dependabot[bot])
  • chore(deps): bump aquasecurity/trivy-action from 0.34.1 to 0.34.2 (#3268 by @dependabot[bot])
  • chore(deps): Bump Trivy version to v0.69.2 (#3265 by @andreyvelich)
  • chore(deps): bump arrow-flight from 57.3.0 to 58.0.0 in /pkg/data_cache/test (#3248 by @dependabot[bot])
  • chore(deps): bump tonic from 0.14.3 to 0.14.5 in /pkg/data_cache/test (#3246 by @dependabot[bot])
  • chore(deps): bump mpioperator/base from v0.7.0 to v0.8.0 in /cmd/runtimes/deepspeed (#3243 by @dependabot[bot])
  • chore(deps): bump actions/setup-go from 5 to 6 (#3245 by @dependabot[bot])
  • chore(deps): bump mpioperator/base from v0.7.0 to v0.8.0 in /cmd/runtimes/mlx (#3244 by @dependabot[bot])
  • chore(deps): bump futures from 0.3.31 to 0.3.32 in /pkg/data_cache/test (#3211 by @dependabot[bot])
  • chore(deps): bump aquasecurity/trivy-action from 0.33.1 to 0.34.0 in /.github/workflows (#3222 by @dependabot[bot])
  • chore(deps): bump futures from 0.3.31 to 0.3.32 in /pkg/data_cache (#3214 by @dependabot[bot])
  • chore(deps): bump deepspeed from 0.18.5 to 0.18.6 in /cmd/runtimes/deepspeed (#3212 by @dependabot[bot])
  • chore(deps): bump transformers from 4.57.6 to 5.2.0 in /cmd/runtimes/deepspeed (#3210 by @dependabot[bot])
  • chore(deps): bump clap from 4.5.57 to 4.5.59 in /pkg/data_cache/test (#3206 by @dependabot[bot])
  • chore(deps): update huggingface-hub requirement from <1.4,>=0.27.0 to >=0.27.0,<1.5 in /cmd/initializers/dataset (#3194 by @dependabot[bot])
  • chore(deps): bump the kubernetes group with 7 updates (#3204 by @dependabot[bot])
  • feat: Add the manager field to the podTemplateOverride object (#3020 by @kaisoz)
  • chore(deps): bump mlx[cuda] from 0.30.5 to 0.30.6 in /cmd/runtimes/mlx (#3196 by @dependabot[bot])
  • chore(deps): update huggingface-hub requirement from <1.4,>=0.27.0 to >=0.27.0,<1.5 in /cmd/initializers/model (#3198 by @dependabot[bot])
  • chore(deps): bump mlx-lm from 0.30.5 to 0.30.6 in /cmd/runtimes/mlx (#3195 by @dependabot[bot])
  • chore(deps): bump clap from 4.5.56 to 4.5.57 in /pkg/data_cache/test (#3193 by @dependabot[bot])
  • chore(deps): bump sigs.k8s.io/structured-merge-diff/v6 from 6.3.2-0.20260122202528-d9cc6641c482 to 6.3.2 in the kubernetes group (#3190 by @dependabot[bot])
  • chore(deps): bump arrow-flight from 57.2.0 to 57.3.0 in /pkg/data_cache/test (#3192 by @dependabot[bot])
  • chore(deps): bump tonic from 0.14.2 to 0.14.3 in /pkg/data_cache/test (#3163 by @dependabot[bot])
  • chore(deps): bump golang.org/x/crypto from 0.47.0 to 0.48.0 in the golang group (#3191 by @dependabot[bot])
  • chore(deps): bump mlx[cuda] from 0.30.3 to 0.30.5 in /cmd/runtimes/mlx (#3162 by @dependabot[bot])
  • chore(deps): bump time from 0.3.44 to 0.3.47 in /pkg/data_cache (#3180 by @dependabot[bot])
  • chore(deps): bump deepspeed from 0.18.4 to 0.18.5 in /cmd/runtimes/deepspeed (#3161 by @dependabot[bot])
  • chore(deps): bump github.com/onsi/gomega from 1.39.0 to 1.39.1 (#3159 by @dependabot[bot])
  • chore(deps): bump clap from 4.5.54 to 4.5.56 in /pkg/data_cache/test (#3160 by @dependabot[bot])
  • chore(deps): bump github.com/onsi/ginkgo/v2 from 2.27.5 to 2.28.1 (#3158 by @dependabot[bot])
  • chore(deps): bump bytes from 1.11.0 to 1.11.1 in /pkg/data_cache (#3170 by @dependabot[bot])
  • chore(deps): bump bytes from 1.11.0 to 1.11.1 in /pkg/data_cache/test (#3169 by @dependabot[bot])
  • chore(deps): bump nvidia/cuda from 13.1.0-devel-ubuntu22.04 to 13.1.1-devel-ubuntu22.04 in /cmd/runtimes/deepspeed (#3131 by @dependabot[bot])
  • chore(deps): bump nvidia/cuda from 13.1.0-devel-ubuntu22.04 to 13.1.1-devel-ubuntu22.04 in /cmd/runtimes/mlx (#3129 by @dependabot[bot])
  • chore(deps): Bump JobSet v0.11.0 and LWS v0.8.0 (#3144 by @andreyvelich)
  • chore(deps): bump tower from 0.5.2 to 0.5.3 in /pkg/data_cache (#3137 by @dependabot[bot])
  • chore(deps): bump rust from 1.92-bullseye to 1.93-bullseye in /cmd/data_cache (#3132 by @dependabot[bot])
  • chore(deps): Bump Go 1.25, k8s v1.35, and controller-runtime v0.23.1 (#3127 by @andreyvelich)
  • chore(deps): bump mlx-lm from 0.30.4 to 0.30.5 in /cmd/runtimes/mlx (#3134 by @dependabot[bot])
  • chore(deps): bump tokio from 1.48.0 to 1.49.0 in /pkg/data_cache (#3138 by @dependabot[bot])
  • chore(deps): bump datasets from 4.4.2 to 4.5.0 in /cmd/runtimes/deepspeed (#3105 by @dependabot[bot])
  • chore(deps): bump datasets from 4.4.2 to 4.5.0 in /cmd/runtimes/mlx (#3108 by @dependabot[bot])
  • chore(deps): bump mlx[cuda] from 0.30.1 to 0.30.3 in /cmd/runtimes/mlx (#3107 by @dependabot[bot])
  • chore(deps): bump transformers from 4.57.3 to 4.57.6 in /cmd/runtimes/deepspeed (#3106 by @dependabot[bot])
  • chore(deps): bump mlx-lm from 0.30.2 to 0.30.4 in /cmd/runtimes/mlx (#3109 by @dependabot[bot])
  • chore(runtimes): Bump Torch to 2.9.1 version (#3093 by @andreyvelich)
  • chore(deps): bump axum from 0.7.9 to 0.8.8 in /pkg/data_cache (#3072 by @dependabot[bot])
  • chore(deps): bump tonic from 0.12.3 to 0.14.2 in /pkg/data_cache/test (#3054 by @dependabot[bot])
  • chore(deps): bump tower from 0.4.13 to 0.5.2 in /pkg/data_cache (#3074 by @dependabot[bot])
  • chore(deps): update huggingface-hub requirement from <1.2,>=0.27.0 to >=0.27.0,<1.4 in /cmd/initializers/dataset (#3090 by @dependabot[bot])
  • chore(deps): bump clap from 4.5.53 to 4.5.54 in /pkg/data_cache/test (#3070 by @dependabot[bot])
  • chore(deps): bump github.com/onsi/gomega from 1.38.3 to 1.39.0 (#3085 by @dependabot[bot])
  • chore(deps): bump tokio from 1.48.0 to 1.49.0 in /pkg/data_cache/test (#3069 by @dependabot[bot])
  • chore(deps): update huggingface-hub requirement from <1.2,>=0.27.0 to >=0.27.0,<1.4 in /cmd/initializers/model (#3091 by @dependabot[bot])
  • chore(deps): bump mlx-lm from 0.30.0 to 0.30.2 in /cmd/runtimes/mlx (#3089 by @dependabot[bot])
  • chore(deps): bump deepspeed from 0.18.3 to 0.18.4 in /cmd/runtimes/deepspeed (#3088 by @dependabot[bot])
  • chore(deps): bump arrow-flight from 57.1.0 to 57.2.0 in /pkg/data_cache/test (#3087 by @dependabot[bot])
  • chore(deps): bump github.com/onsi/ginkgo/v2 from 2.27.3 to 2.27.5 (#3086 by @dependabot[bot])
  • chore(deps): bump golang.org/x/crypto from 0.46.0 to 0.47.0 in the golang group (#3084 by @dependabot[bot])
  • chore(deps): bump mlx[cuda] from 0.30.0 to 0.30.1 in /cmd/runtimes/mlx (#3053 by @dependabot[bot])
  • chore(deps): bump tracing from 0.1.41 to 0.1.44 in /pkg/data_cache/test (#3051 by @dependabot[bot])
  • chore(deps): bump arrow-flight from 55.2.0 to 57.1.0 in /pkg/data_cache/test (#3055 by @dependabot[bot])
  • chore(deps): bump datasets from 4.4.1 to 4.4.2 in /cmd/runtimes/mlx (#3052 by @dependabot[bot])
  • chore(deps): bump mlx-lm from 0.28.4 to 0.30.0 in /cmd/runtimes/mlx (#3050 by @dependabot[bot])
  • chore(deps): bump datasets from 4.4.1 to 4.4.2 in /cmd/runtimes/deepspeed (#3049 by @dependabot[bot])
  • chore(deps): bump bincode from 2.0.1 to 3.0.0 in /pkg/data_cache/test (#3048 by @dependabot[bot])
  • chore(deps): bump sigs.k8s.io/kind from 0.30.0 to 0.31.0 in the kubernetes group (#3047 by @dependabot[bot])
  • chore(deps): bump transformers from 4.57.2 to 4.57.3 in /cmd/runtimes/deepspeed (#3031 by @dependabot[bot])
  • chore(deps): bump nvidia/cuda from 13.0.2-devel-ubuntu22.04 to 13.1.0-devel-ubuntu22.04 in /cmd/runtimes/mlx (#3036 by @dependabot[bot])
  • chore(deps): bump the kubernetes group with 6 updates (#3035 by @dependabot[bot])
  • chore(deps): bump actions/upload-artifact from 5 to 6 (#3038 by @dependabot[bot])
  • chore(deps): bump nvidia/cuda from 13.0.2-devel-ubuntu22.04 to 13.1.0-devel-ubuntu22.04 in /cmd/runtimes/deepspeed (#3037 by @dependabot[bot])
  • chore(deps): bump rust from 1.91-bullseye to 1.92-bullseye in /cmd/data_cache (#3040 by @dependabot[bot])
  • chore(deps): bump deepspeed from 0.18.2 to 0.18.3 in /cmd/runtimes/deepspeed (#3039 by @dependabot[bot])
  • chore(deps): bump mlx-lm from 0.28.3 to 0.28.4 in /cmd/runtimes/mlx (#3029 by @dependabot[bot])
  • chore(deps): bump github.com/onsi/ginkgo/v2 from 2.27.2 to 2.27.3 (#3026 by @dependabot[bot])
  • chore(deps): bump github.com/onsi/gomega from 1.38.2 to 1.38.3 (#3027 by @dependabot[bot])
  • chore(deps): bump golang.org/x/crypto from 0.45.0 to 0.46.0 in the golang group (#3025 by @dependabot[bot])
  • chore(deps): bump bytes from 1.10.1 to 1.11.0 in /pkg/data_cache (#3001 by @dependabot[bot])
  • chore(deps): bump go.uber.org/zap from 1.27.0 to 1.27.1 (#2998 by @dependabot[bot])
  • chore(deps): bump clap from 4.5.52 to 4.5.53 in /pkg/data_cache/test (#3004 by @dependabot[bot])
  • chore(deps): bump arrow-flight from 57.0.0 to 57.1.0 in /pkg/data_cache/test (#3003 by @dependabot[bot])
  • chore(deps): bump transformers from 4.57.1 to 4.57.2 in /cmd/runtimes/deepspeed (#3002 by @dependabot[bot])
  • chore(deps): bump actions/checkout from 5 to 6 (#3000 by @dependabot[bot])
  • chore(deps): bump mlx[cuda] from 0.29.4 to 0.30.0 in /cmd/runtimes/mlx (#2999 by @dependabot[bot])
  • chore(deps): bump sigs.k8s.io/structured-merge-diff/v6 from 6.3.0 to 6.3.1 in the kubernetes group (#2996 by @dependabot[bot])
  • chore(deps): bump github.com/open-policy-agent/cert-controller from 0.14.0 to 0.15.0 (#2997 by @dependabot[bot])
  • chore(deps): bump golang.org/x/crypto from 0.44.0 to 0.45.0 (#2994 by @dependabot[bot])
  • chore(deps): bump golang.org/x/crypto from 0.43.0 to 0.44.0 in the golang group (#2985 by @dependabot[bot])
  • chore(deps): bump clap from 4.5.51 to 4.5.52 in /pkg/data_cache/test (#2990 by @dependabot[bot])
  • chore(deps): bump async-trait from 0.1.88 to 0.1.89 in /pkg/data_cache (#2988 by @dependabot[bot])
  • chore(deps): bump pytorch/pytorch from 2.9.0-cuda12.8-cudnn9-runtime to 2.9.1-cuda12.8-cudnn9-runtime in /cmd/trainers/torchtune (#2986 by @dependabot[bot])
  • chore(deps): bump the kubernetes group with 6 updates (#2984 by @dependabot[bot])
  • chore(deps): bump bytes from 1.10.1 to 1.11.0 in /pkg/data_cache/test (#2989 by @dependabot[bot])
  • chore(deps): bump mlx-lm from 0.26.3 to 0.28.3 in /cmd/runtimes/mlx (#2950 by @dependabot[bot])
  • chore(deps): update huggingface-hub requirement from <0.28,>=0.27.0 to >=0.27.0,<1.2 in /cmd/initializers/model (#2957 by @dependabot[bot])
  • chore(deps): update huggingface-hub requirement from <0.28,>=0.27.0 to >=0.27.0,<1.2 in /cmd/initializers/dataset (#2955 by @dependabot[bot])
  • chore(deps): bump datasets from 4.0.0 to 4.4.1 in /cmd/runtimes/deepspeed (#2944 by @dependabot[bot])
  • chore(deps): bump mlx[cuda] from 0.28.0 to 0.29.3 in /cmd/runtimes/mlx (#2956 by @dependabot[bot])
  • chore(deps): bump transformers from 4.55.0 to 4.57.1 in /cmd/runtimes/deepspeed (#2961 by @dependabot[bot])
  • chore(deps): bump deepspeed from 0.17.4 to 0.18.2 in /cmd/runtimes/deepspeed (#2954 by @dependabot[bot])
  • chore(deps): bump nvidia/cuda from 12.8.1-devel-ubuntu22.04 to 13.0.2-devel-ubuntu22.04 in /cmd/runtimes/deepspeed (#2939 by @dependabot[bot])
  • chore(deps): bump pytorch/pytorch from 2.7.1-cuda12.8-cudnn9-runtime to 2.9.0-cuda12.8-cudnn9-runtime in /cmd/trainers/torchtune (#2934 by @dependabot[bot])
  • chore(deps): bump datasets from 4.0.0 to 4.4.1 in /cmd/runtimes/mlx (#2943 by @dependabot[bot])
  • chore(deps): bump nvidia/cuda from 12.8.1-devel-ubuntu22.04 to 13.0.2-devel-ubuntu22.04 in /cmd/runtimes/mlx (#2932 by @dependabot[bot])
  • chore(deps): bump mpi4py from 4.1.0 to 4.1.1 in /cmd/runtimes/deepspeed (#2958 by @dependabot[bot])
  • chore(deps): bump bincode from 1.3.3 to 2.0.1 in /pkg/data_cache/test (#2949 by @dependabot[bot])
  • chore(deps): bump tonic from 0.12.3 to 0.14.2 in /pkg/data_cache/test (#2962 by @dependabot[bot])
  • chore(deps): bump serde from 1.0.225 to 1.0.228 in /pkg/data_cache/test (#2959 by @dependabot[bot])
  • chore(deps): bump actions/checkout from 4 to 5 (#2974 by @dependabot[bot])
  • chore(deps): bump serde from 1.0.215 to 1.0.228 in /pkg/data_cache (#2978 by @dependabot[bot])
  • chore(deps): bump actions/setup-go from 5 to 6 (#2975 by @dependabot[bot])
  • chore(deps): bump amannn/action-semantic-pull-request from 5.5.3 to 6.1.1 (#2976 by @dependabot[bot])
  • chore(deps): bump arrow-flight from 55.2.0 to 57.0.0 in /pkg/data_cache/test (#2973 by @dependabot[bot])
  • chore(deps): bump actions/setup-python from 5 to 6 (#2977 by @dependabot[bot])
  • chore(deps): bump python from 3.11-slim-bookworm to 3.14-slim-bookworm in /cmd/initializers/model (#2951 by @dependabot[bot])
  • chore(deps): bump python from 3.11-slim-bookworm to 3.14-slim-bookworm in /cmd/initializers/dataset (#2941 by @dependabot[bot])
  • chore(deps): bump sentencepiece from 0.2.0 to 0.2.1 in /cmd/runtimes/deepspeed (#2948 by @dependabot[bot])
  • chore(deps): bump tokio from 1.47.1 to 1.48.0 in /pkg/data_cache/test (#2963 by @dependabot[bot])
  • chore(deps): bump clap from 4.5.43 to 4.5.51 in /pkg/data_cache/test (#2965 by @dependabot[bot])
  • chore(deps): bump tokio from 1.46.1 to 1.48.0 in /pkg/data_cache (#2966 by @dependabot[bot])
  • chore(deps): bump aquasecurity/trivy-action from 0.28.0 to 0.33.1 (#2947 by @dependabot[bot])
  • chore(deps): bump actions/stale from 9 to 10 (#2942 by @dependabot[bot])
  • chore(deps): bump mpioperator/base from v0.6.0 to v0.7.0 in /cmd/runtimes/deepspeed (#2938 by @dependabot[bot])
  • chore(deps): bump golang from 1.24 to 1.25 in /cmd/trainer-controller-manager (#2935 by @dependabot[bot])
  • chore(deps): bump actions/github-script from 7 to 8 (#2937 by @dependabot[bot])
  • chore(deps): bump actions/upload-artifact from 4 to 5 (#2936 by @dependabot[bot])
  • chore(deps): bump mpioperator/base from v0.6.0 to v0.7.0 in /cmd/runtimes/mlx (#2933 by @dependabot[bot])
  • chore(deps): bump rust from 1.85-bullseye to 1.91-bullseye in /cmd/data_cache (#2931 by @dependabot[bot])
  • chore(deps): bump github/codeql-action from 3 to 4 (#2953 by @dependabot[bot])
  • chore(deps): bump github.com/onsi/ginkgo/v2 from 2.25.3 to 2.27.2 (#2952 by @dependabot[bot])
  • chore(deps): bump sigs.k8s.io/controller-runtime from 0.22.3 to 0.22.4 in the kubernetes group (#2940 by @dependabot[bot])
  • chore(deps): bump golang.org/x/crypto from 0.41.0 to 0.43.0 in the golang group (#2945 by @dependabot[bot])

Don't miss a new trainer release

NewReleases is sending notifications on new releases.