This is a large official release since v0.5.3. Please give more feedbacks. Thanks for all contributors.
Features
- feat: Remove k8s.io/kubernetes (#1235, @gaocegege)
- Migrate to public ECR (#1256, @PatrickXYS)
- feat: Add API Documentation WIP (#1249, @gaocegege)
- feat: Update developers guide and readme (#1244, @gaocegege)
- Move TF Operator e2e tests to AWS Prow (#1204, @ChanYiLin)
- crd definition support multiple evaluator (#1240, @oikomi)
- support multiple evaluators (#1239, @oikomi)
- feat: Change the message for running condition (#1230, @gaocegege)
- feat(server): Use apiextension client to check if crd exists (#1228, @gaocegege)
- checkCRDExists func return true when k8s cluster is not connected (#1207, @oikomi)
- feat: Add CD using GitHub Actions (#1196, @gaocegege)
- Migrate controller implementation to kubeflow/common fashion (#1171, @ChanYiLin)
- Support success policy for TFJob (#1165, @terrytangyuan)
- add distributed training example of using TF 2.1 Strategy API (#1164, @jazzsir)
- Set completion time when job exceed specified deadline. (#1150, @SimonCqk)
- Support ClusterSpec Propagation Feature in TF 1.14 (#1149, @zhujl1991)
- Add watch function for TFJob python Client API (#1122, @jinchihe)
- Enhance tfjobs sdk docs (#1114, @jinchihe)
- Generate TFJob Python SDK (#1103, @jinchihe)
- feat: Support pprof when monitoring is specified (#1102, @gaocegege)
- feat: Use kubeflow/common (#1088, @gaocegege)
- Add support for aarch64 (#1098, @MrXinWang)
- feat: Do not set TF_CONFIG for local training (#1080, @gaocegege)
- feat: Replace gometalinter with golangci-lint (#1081, @gaocegege)
- Add controller-name label for Pod and service (#1067, @hougangliu)
- Add qps and burst options (#1063, @ScorpioCPH)
- Avoid unnecessary update when tfjob is complete (#1051, @cheyang)
- set annotation automatically when EnableGangScheduling is set to true (#1032, @ChanYiLin)
- feat(pod): Support custom gang scheduler via CLI argument (#1050, @gaocegege)
Bug fixes
- Fix kubeflow overlay (#1260, @PatrickXYS)
- fix: Do not validate evaluator (#1238, @gaocegege)
- fix: Remove default resync period (#1237, @gaocegege)
- fix: Observe the creation when failed to create the pod (#1236, @gaocegege)
- fix: Remove vendor cp command (#1232, @gaocegege)
- Fix completion time setting bug (#1226, @shaowei-su)
- feat(deploy): Add standalone deployment yaml (#1218, @gaocegege)
- Fix updateStatus no worker Crashoff (#1215, @kuikuikuizzZ)
- fix: Fix the log message (#1203, @gaocegege)
- Fix the typo (#1178, @pingsutw)
- Fix setup cluster issue and Pylint issue in CI tests (#1179, @jinchihe)
- Fix the link to run_e2e_workflow.py script (#1154, @terrytangyuan)
- Fix evaluator runconfig (#1146, @richardsliu)
- Fix sdk test issue that's caused by kubenertes Client bug. (#1143, @jinchihe)
- fix(controller): calculate satisfied with && instead of || (#1120, @GuoHaiqing)
- fix comment, add +optional flag to comment. (#1137, @EDGsheryl)
- fix(ConvertTFJobToUnstructured): ConvertTFJobToUnstructured uses function ToUnstructured to convert TFJob to Unstructured (#1118, @leileiwan)
- fix the reconcile flow (#1111, @ChanYiLin)
- Fix example Mnist With Summaries (#1073, @andreyvelich)
- fix bug: When executing
tf-operator.v1 -version
, GitSHA is always 'not provided' (#1046, @asdfsx) - fix(UI): show correct namespace and name when deleting job through dashboard (#1044, @gbin10533)
- Minor fix to add CoreV1 to scheme (#1037, @johnugeorge)
- fix(docs): Fix link for simple_TFJob_test (#1038, @gaocegege)
- fix: Remove dup code (#1022, @gaocegege)
Chores
- tf-operator: Consolidate manifests (#1255, @yanniszark)
- TFJob Operator: Move manifests development upstream (#1247, @yanniszark)
- Update vendor as kubeflow/common is updated. (#1252, @jiangkaihua)
- docs: Add Ant Group to ADOPTERS.md (#1243, @terrytangyuan)
- chore: Add tencent cloud (#1234, @gaocegege)
- add vip (#1233, @oikomi)
- chore: Update changelog (#1227, @gaocegege)
- Update kubeflow common to 0.3.2 (#1225, @shaowei-su)
- chore: Remove useless expectation (#1217, @gaocegege)
- chore: Update codegen (#1211, @gaocegege)
- add Evaluator type for CRD example (#1209, @oikomi)
- add err log for create client set failed and code minor optimization (#1210, @oikomi)
- chore: Remove the kanban update workflow (#1201, @gaocegege)
- chore: Refactor cmd (#1199, @gaocegege)
- bugfix for multi_worker_strategy-with-keras.py (#1198, @jiaqianjing)
- Fix error when
conditions
is empty. (#1185, @Corea) - b/168938304 - Inclusive Language Fix-It, repo has non-inclusive language (#1190, @sculd)
- chore: Update OWNERS (#1177, @gaocegege)
- Update developer_guide.md (#1176, @pingsutw)
- Update swagger-codegen-cli URL (#1172, @jinchihe)
- Use go mod (#1144, @xychu)
- Make tf_operator use static compilation in container (#1160, @MrXinWang)
- Update tf_job_client.py remove unused variable. (#1157, @NikeNano)
- Update e2e_testing.md (#1155, @NikeNano)
- Disable istio sidecar injection in simple tfjob test (#1148, @Bobgy)
- OWNERS: Add ChanYiLin as approver (#1147, @ChanYiLin)
- Remove unused function arg (#1145, @zhujl1991)
- docs: Add roadmap (#1140, @gaocegege)
- simple_tfjob_tests py3 version (#1134, @gabrielwen)
- add tf-operator test in py3 (#1133, @gabrielwen)
- Distroless image for TF operator (#1124, @krishnadurai)
- SDK support getting the TFJob training logs (#1130, @jinchihe)
- Copy third party vendor source code to Docker image (#1128, @richardsliu)
- Add third party licenses (#1127, @richardsliu)
- remove tfjob dashboard (#1119, @ChanYiLin)
- Update checking status API name (#1117, @jinchihe)
- Add more APIs for TFJob done (#1116, @jinchihe)
- feat: Add adopters in README (#1092, @gaocegege)
- Support for ppc64le (#1082, @zoyun)
- use multi-stage build to build tf-operator image (#1072, @hmtai)
- add ppc64le support for the example dist-mnist (#1084, @alongzhi)
- add the dockerfile for ppc64le (#1083, @alongzhi)
- Updating issue bot configs (#1074, @rbrishabh)
- Delete v1beta2 api (#1075, @johnugeorge)
- add ldflag verion (#1052, @yeya24)
- Add verify-codegen in travis CI (#1070, @ohmystack)
- Set tfjob defaults in test utils (#1071, @ohmystack)
- Update codegen (#1069, @ohmystack)
- rewrite dockerfile (#1062, @hmtai)
- Renaming labels to common types (#1064, @johnugeorge)
- add total suffix in counter metrics (#1055, @yeya24)
- Update k8s libraries to 1.12.3 (#1054, @johnugeorge)
- add flag kubeconfig (#1049, @yeya24)
- Easily detect the GOPATH in current development environment. (#1047, @xauthulei)
- Update gang scheduler name (#1028, @goodluckbot)
- Set worker 0 completed if pod's phase goto succeeded (#1042, @ScorpioCPH)
- Removing unnecessary Rbac authorization (#1036, @johnugeorge)
- refactor: add GenPodGroupName method to extract podGroupName in diffe… (#1034, @zlcnju)
- update release script (#1040, @kunmingg)
- Update image base to UBI8 GA (#1023, @pdmack)