kubeflow/training-operator v1.4.0 on GitHub

Full Changelog

Merged pull requests:

extends path in __init__.py for SDK correctly #1531 (cakeislife100)
Update manifests with latest image tag #1527 (johnugeorge)
add option for mpi kubectl delivery #1525 (zw0610)
restore option namespace in launch arguments #1524 (zw0610)
remove unused scripts #1521 (zw0610)
remove ChanYiLin from approvers #1513 (ChanYiLin)
add StacktraceLevel for zapr #1512 (qiankunli)
add unit tests for tensorflow controller #1511 (zw0610)
add the example of MPIJob #1508 (hackerboy01)
Added 2022 roadmap and migrated previous roadmap from kubeflow/common #1500 (terrytangyuan)
Fix a typo in mpi controller log #1495 (LuBingtan)
feat(pytorch): Add init container config to avoid DNS lookup failure #1493 (gaocegege)
chore: Fix GitHub Actions script #1491 (tenzen-y)
chore: Fix missspell in tfjob #1490 (tenzen-y)
chore: Update OWNERS #1489 (gaocegege)
Bump jinja2 from 2.10.1 to 2.11.3 in /py/kubeflow/tf_operator #1487 (dependabot[bot])
fix comments for mpi-controller #1485 (hackerboy01)
add expectation-related functions for other resources used in mpi-controller #1484 (zw0610)
Add MPI job to README now that it's supported #1480 (terrytangyuan)
add mpi doc #1477 (zw0610)
Set Go version of base image to 1.17 #1476 (tenzen-y)
update label for tf-controller #1474 (zw0610)
Add Akuity to the list of adopters #1473 (terrytangyuan)
Add PR template with doc checklist #1470 (andreyvelich)
Add e2e failure debugging guidance #1469 (Jeffwan)
chore: Add .gitattributes to ignore Jsonnet test code for linguist #1463 (terrytangyuan)
Migrate additional examples from xgboost-operator #1461 (terrytangyuan)
Minor edits to README.md #1460 (terrytangyuan)
add mpi-operator(v1) to the unified operator #1457 (hackerboy01)
fix tfjob status when enableDynamicWorker set true #1455 (zw0610)
feat(pytorch): Support elastic training #1453 (gaocegege)
fix: generate printer columns for job crds #1451 (henrysecond1)
Fix README typo #1450 (davidxia)
consistent naming for better readability #1449 (pramodrj07)
Fix set scheduler error #1448 (qiankunli)
Add CI to run the tests for Go #1440 (tenzen-y)
fix: Add missing retrying package that failed the import #1439 (terrytangyuan)
Generate a single swagger.json file for all frameworks #1437 (alembiewski)
Update links and files with the new URL #1434 (andreyvelich)
chore: update CHANGELOG.md #1432 (Jeffwan)
Add acknowledgement section in README to credit all contributors #1422 (terrytangyuan)
Add Cisco to Adopters List #1421 (andreyvelich)
Add Python SDK for Kubeflow Training Operator #1420 (alembiewski)
docs: Move myself to approvers #1419 (terrytangyuan)
fix hyperlinks in the 'overview' section #1418 (pramodrj07)
docs: Migrate adopters of all operators to this repo #1417 (terrytangyuan)
Feature/support pytorchjob set queue of volcano #1415 (qiankunli)
Bump controller-tools to 0.6.0 and enable GenerateEmbeddedObjectMeta #1409 (Jeffwan)
Update scripts to generate sdk for all frameworks #1389 (Jeffwan)

Closed issues:

Question: What is the recommended way for Data Scientists to run a distributed training job #1535
Restore KUBEFLOW_NAMESPACE options #1522
Improve test coverage #1497
swagger.json missing Pytorchjob.Spec.ElasticPolicy #1483
[bug] Missing init container in PyTorchJob #1482
PytorchJob DDP training will stop if I delete a worker pod #1478
Write down e2e failure debug process #1467
How can i add the Priorityclass to the TFjob？ #1466
github.com/go-logr/zapr.(*zapLogger).Error #1444
Display coverage % in GitHub actions list #1442
Add Go test to CI #1436
Podgroup is constantly created and deleted after tfjob is success or failure #1426
Cut official release of 1.3.0 #1425
Add "not maintained" notice to other operator repos #1423
Fail to install tf-operator in minikube because of the version of kubectl/kustomize #1381
Python SDK for Kubeflow Training Operator #1380
Rename this repo #1348
Universal Operator Phase III: Graduate operator to production grade #1318