kubeflow/training-operator v1.6.0-rc.1 on GitHub

Note: Since scheduler-plugins has changed API from sigs.k8s.io with the x-k8s.io, future releases of training operator(v1.7+) will not support scheduler-plugins v0.24.x or lower

Merged pull requests:

[SDK] pod has no metadata attr anymore in the get_job_logs() … #1760 (yaobaiwei)
Fix Python installation in CI #1759 (tenzen-y)
fix infinite loop in init-pytorch container #1756 (kidddddddddddddddddddddd)
Update mpijob_controller.go #1755 (yshalabi)
Set the default value of CleanPodPolicy to None #1754 (Syulin7)
Fix the success condition of the job in PyTorchJob's Elastic mode. #1752 (Syulin7)
Update join Slack link #1750 (Syulin7)
Add validation for verifying that the CustomJob (e.g., TFJob) name meets DNS1035 #1748 (tenzen-y)
Update latest operator image #1742 (johnugeorge)
Run E2E with various Python versions to verify Python SDK #1741 (tenzen-y)
[SDK] Use Training Client without Kube Config #1740 (andreyvelich)
Add Yuki to reviewer group #1739 (johnugeorge)
Fix XGBoost conditions bug #1737 (tenzen-y)
Add E2E test for gang-scheduling #1736 (tenzen-y)
Trim down CRD descriptions #1735 (tenzen-y)
To fix scaledown error, upgrade PyTorch version to v1.13.1 in echo example #1733 (tenzen-y)
Add CI to build example images #1731 (tenzen-y)
Fix predicates of paddlepaddle-controller for scheduling.volcano.sh/v1beta1 PodGroup #1730 (tenzen-y)
Fix indents on examples for tensorflow #1726 (tenzen-y)
Adopting coschduling plugin #1724 (tenzen-y)
docs: Update Kubernetes requirement and version matrix #1721 (terrytangyuan)
[SDK] Create Unify Training Client #1719 (andreyvelich)
chore: Update the use of MultiWorkerMirroredStrategy in TF #1715 (terrytangyuan)
Configure controller worker threads #1707 (HeGaoYuan)
Validation Spec consistency #1705 (HeGaoYuan)
Removing deprecated Job Labels #1702 (johnugeorge)
HPA support for PyTorch Elastic #1701 (johnugeorge)
fix: Mac M1 compatible Dockerfile and bump TF version #1700 (terrytangyuan)
Bump certifi from 2022.9.14 to 2022.12.7 in /py/kubeflow/tf_operator #1699 (dependabot[bot])
Fix status lost #1697 (ggaaooppeenngg)
Adding support for linux/ppc64le in github actions for training-operator #1692 (amitmukati-2604)
Add myself to reviewer. #1689 (kuizhiqing)
Upgrade the envtest version #1687 (tenzen-y)
[chore] Upgrade some actions version #1686 (tenzen-y)
Upgrade Golangci-lint #1685 (johnugeorge)
Support for k8s v1.25 in CI #1684 (johnugeorge)
Make a generic logger instead of the nil logger on dependent update #1680 (ggaaooppeenngg)
[SDK] Remove Final Keyword from constants #1676 (andreyvelich)
[PaddlePaddle] support paddlejob #1675 (kuizhiqing)
Removed GOARCH dependency for multiarch support #1674 (pranavpandit1)
Bump protobuf from 3.8.0 to 3.18.3 in /py/kubeflow/tf_operator #1669 (dependabot[bot])
Update deployment.yaml #1668 (OmriShiv)
Upgrade kubernetes versoin for test #1667 (tenzen-y)
Add PodGroup as controller watch source #1666 (ggaaooppeenngg)
Upgrade Go version to v1.19 #1663 (tenzen-y)
style: Refine name and signature of 2 replicaName functions #1660 (houz42)
Create TFJob and PyTorchJob from Function APIs in the Training SDK #1659 (andreyvelich)
Update the cmd to support MPI operator in ReadME #1656 (denkensk)
Update training operator sdk version to 1.5.0 #1651 (johnugeorge)
handle all restart policies #1649 (abin-thomas-by)
[chore] fix typo #1648 (tenzen-y)
Add finalizers to cluster-role #1646 (ArangoGutierrez)
fix: support MxNet single host training when update mxJob status #1644 (PeterChg)
fix: fix mxnet failed to update StartTime and CompletionTime #1643 (PeterChg)
Fix the default LeaderElectionID and make it an argument #1639 (goyalankit)
fix: fix wrong parameter for resolveControllerRef #1583 (fighterhit)
fix: tfjob with restartPolicy=ExitCode not work #1562 (cheimu)

Closed issues:

The default value for CleanPodPolicy is inconsistent. #1753
HPA support for PyTorch Elastic #1751
Bug: allowance of non DNS-1035 compliant PyTorchJob names results in service creation failures and missing state #1745
paddle-operator can not get podgroup status(inqueue) with volcano when enable gang #1729
*job API(master) cannot compatible with old job #1725
Support coscheduling plugin #1722
Number of worker threads used by the controller can't be configured #1706
Conformance: Training tests #1698
PyTorch and MPI Operator pulls hardcoded initContainer #1696
PaddlePaddle Training: why can't find pods #1694
Training-operator pod CrashLoopBackOff in K8s v1.23.6 with kubeflow1.6.1 #1693
[SDK] Create unify client for all Training Job types #1691
Support Kubernetes v1.25 #1682
panic happened when add podgroup watch #1679
OnDependentUpdateFunc for Job will panic when enable volcano scheduler #1678
There is no clusterrole of "MPI Jobs" in kubeflow 1.5. #1670
Change Kubernetes version for test #1665
Support for multiplatform container imege (amd64 and arm64) #1664
Training Operator pod failed to start on OCP 4.10.30 with error "memory limit too low" #1661
After setting hostNetwork to true, mpi does not work #1657
What is the purpose of /examples/pytorch/elastic/etcd.yaml #1655
When will MPIJob support v2beta1 version? #1653
Kubernetes HPA doesn't work with elastic PytorchJob #1645
training-operator can not get podgroup status(inqueue) with volcano when enable gang #1630
Training operator fails to create HPA for TorchElastic jobs #1626
Release v1.5.0 tracking #1622
upgrade client-go #1599
trainning-operator may need to monitor PodGroup #1574
Error: invalid memory address or nil pointer dereference #1553
The pytorchJob training is slow #1532
pytorch elastic scheduler error #1504

kubeflow/training-operator v1.6.0-rc.1 v1.6.0-rc.1 release on GitHub

kubeflow/training-operator v1.6.0-rc.1
v1.6.0-rc.1 release

on GitHub