kubeflow/training-operator v1.0.1-rc.0 on GitHub

v1.0.1-rc.0 (2020-12-22)

Full Changelog

Closed issues:

tf-operator panic without worker role #1192
TFJob completion with active services/endpoints resources #1191
Having trouble viewing logs using Kubernetes dashboard #1189
[feature] Support SuccessPolicy/FailurePolicy Based on % of Succeeded/Failed Workers #1188
TFJob cannot utilize GPUs in the node. #1184
[bug] With Python SDK, TFJob won't stop running #1183
[bug] [Python SDK] tfjob_client.get_logs broken #1182
How to create a python sdk for mxnet-operator #1181
[feature] python sdk should report errors in created TFJobs #1180
Could not introduce k8s.io/kube-openapi@master #1174
can tf-operator used in distribute scene, such as Multi-node #1173
Multi-worker training with Keras only use one GPU #1169
NCCL WARN Failed to open libibverbs.so[.1] #1168
tf-job-operator pod restarts #1167
swagger-codegen-cli-2.4.6.jar not found #1166
Cut release for tf-operator project #1163
Replace reconciler implementation with kubeflow/common JobController #1161
Error while replicating mnist_with_summaries #1159
Non-OK-status: GpuLaunchKernel(FillPhiloxRandomKernelLaunch<Distribution>, num_blocks, block_size, 0, d.stream(), gen, data, size, dist) status: Internal: out of memory #1158
TFjob pods hang without explanation #1156
[Proposal] Support ClusterSpec Propagation Feature in TF 1.14 #1141
evaluator� should be set in TF_CONFIG when using Estimator distribute strategy #1139
Is there any case to run the different command in tfReplicaSpecs? #1138
should gpu resource be released when tfjob failed because of image pull problem? #1136
tf-job-operator CrashLoopBackOff #1135
How to change the log level of tf-job-operator #1132
Support getting the training process via Python SDK #1129
Popgroup is not created automatically. #1121
TFConfig should be demonstrated more specifically. #1115
[chore] Remove tfjob dashboard #1113
read TF_CONFIG env from configMap #1112
Long job names result in jobs stuck forever #1101
[Question] can't the base image "registry.access.redhat.com/ubi8/ubi:latest" in Dockerfile be replaced with "debian:buster" ? #1099
can i install tf-operator alone without kubeflow? #1096
c #1095
TFJob test is failing on master and v0.7 branch for kubeflow/kubeflow #1094
TFJob tests should use pytest #1093
Multiple Evaluator replicas gives InvalidTFJobSpec #1091
Java client for current version of TFjob #1090
[enhancement] Replace common with kubeflow/common #1087
Lack of documents for deployment #1086
Performance problem about pod informer #1079
[bug] Cannot initialize the training job with TF Estimator when the user uses 1 worker and 0 PS #1078
Separate cluster scoped and namespace scoped resources #1077
TFJob 1.0 #1076
[bug] Keep tf-job-role as deprecated label in this version #1068
GenLabels may select wrong Pods #1066
Can I create a tf-operator pod without using GO? #1065
tf-job-dashboard cannot work #1060
[discussion] Should We Add CleanPodPolicy PS? #1059
Refactor dockerfile #1058
remove v1beta1 in v0.5.3 cause incompatible issue when using go mod #1057
Invalid value: "v1beta1": must appear in spec.versions #1056
Example on EKS: Device or resource busy #1053
can we add PriorityClassName when we create TF-job Podgroup? #1048
TFjob still running while chief pod is completed #1045
Is there any document for how to run TFJob in AllReduce Strategy #1039
tf-operator version conficts #1035
Add E2E test for gang-scheduling #1033
gang schedule annotation #1031
[feature] Can we use one headless service for one job? #1030
Will tf-operator upgrading k8s to 1.13? #1029
no error log for create tfjob fail #1026
Creating tfjob in dashboard usability issues #1024
Deleting tf-job through the dashboard is not working #1019
Create common CRD validate and mutating webhook for all operator #1016
error with kubeflow instalation #996
Shall we consider upgrading k8s to 1.11.3 #985
TFJob Dashboard is not support pvc #980
ERROR handle object: patching object from cluster: merging object with existing state: unable to recognize "/var/folders/tl/zzfcr4zs53vgnpqqjq4n08sh0000gn/T/ksonnet-mergepatch020443124": no matches for kind "TFJob" in version "kubeflow.org/v1beta1" #976
Create CRD conversion webhook #967
Performance issue when there is a lot of completed jobs #965
Failed to marshal the object to TFJob; the spec is invalid: Failed to marshal the object to TFJob #964
Proposal for a Common Operator #960
Delete pod with unknown status in reconcilePods #956
Create distributed training example for TF 2.0 #953
Consider using KubeBuilder to reduce boilerplate code #925
e2e test for dashboard/backend/handler/api_handler.go #921
Use pod group instead of PDB for gang scheduling #916
shareProcessNamespace not working with TFJob #902
[feasibility-research] Handle machine failure #900
Should limit the size of logs of tf_operator container #888
Log message severity isn't properly reported in stackdriver #864
E2E test for invalid spec errors #810
[v1alpha2] Delete resources according to cleanuppolicy exactly once #804
refactor the code of TFJobController for unittest #757
e2e test for cleanupTFJob #756
[build] Replace Python with Make or Bazel #739
Export TF/Tensorboard/TF Summaries to prometheus #722
[discussion] Maintain Helm Chart #716
[discussion] Capacity planning #708
[v1alpha2] Generate CRD validation in Kubernetes 1.11 #622
Set labels and annotations for svc created by tf_operator #609
mnist test isn't part of CI #597
[v1alpha2] Push the example docker image to google or dockerhub registry #590
feat: use fake client-set and informer add controller unittest. #540
Run submit_release_job.sh in CI #519
Add environment name in ControllerConfig #450
[dashboard] How to handle storage? #449
[dashboard] GPU limits are not taken into account #448
[dashboard] Ability to create a TensorBoard instance #447
[examples] Add termination policy in examples/tf_job.yaml #438
add boilerplate header #430
[logging] Extra flag problem #427
[CI] Add hack/verify-codegen.sh in Travis CI #426
E2E workflows should ignore failures #423
[enhancement] Add OWNERS in subdirectories #415
[enhancement] Fix the warnings reported by goreportcard.com #394
[discussion] Separate the operator and UI dashboard #389
[enhancemnet] Separate release image and test image #385
[enhancement][CI] Replace Travis CI with Prow #382
use Python3 for all python code? #377
What to do about example TFJob YAML specs? #375
E2E test for non-default namespace #170
OpenAPI Client Generation for Java, Python #167
Prevent scheduling deadlocks #165
TfDebugger support #132
Refactor code in py into a proper python package #114
Update instructions and code to work with Kubernetes 1.8 #108
Build sample container as part of release process #81
Run lint (Python, Go) as a presubmit test #53
Optimize scheduling of TF Processes #35
E2E test that verifies invalid jobs are failed #30
E2E test(s) to verify that permanent and retryable errors are handled correctly. #29

Merged pull requests:

chore: Remove the kanban update workflow #1201 (gaocegege)
chore: Refactor cmd #1199 (gaocegege)
bugfix for multi_worker_strategy-with-keras.py #1198 (jiaqianjing)
feat: Add CD using GitHub Actions #1196 (gaocegege)
b/168938304 - Inclusive Language Fix-It, repo has non-inclusive language #1190 (sculd)
Fix error when conditions is empty. #1185 (Corea)
Fix setup cluster issue and Pylint issue in CI tests #1179 (jinchihe)
Fix the typo #1178 (pingsutw)
chore: Update OWNERS #1177 (gaocegege)
Update developer_guide.md #1176 (pingsutw)
Update swagger-codegen-cli URL #1172 (jinchihe)
Migrate controller implementation to kubeflow/common fashion #1171 (ChanYiLin)
Support success policy for TFJob #1165 (terrytangyuan)
add distributed training example of using TF 2.1 Strategy API #1164 (jazzsir)
Make tf_operator use static compilation in container #1160 (MrXinWang)
Update tf_job_client.py remove unused variable. #1157 (NikeNano)
Update e2e_testing.md #1155 (NikeNano)
Fix the link to run_e2e_workflow.py script #1154 (terrytangyuan)
Set completion time when job exceed specified deadline. #1150 (SimonCqk)
Support ClusterSpec Propagation Feature in TF 1.14 #1149 (zhujl1991)
Disable istio sidecar injection in simple tfjob test #1148 (Bobgy)
OWNERS: Add ChanYiLin as approver #1147 (ChanYiLin)
Fix evaluator runconfig #1146 (richardsliu)
Remove unused function arg #1145 (zhujl1991)
Use go mod #1144 (xychu)
Fix sdk test issue that's caused by kubenertes Client bug. #1143 (jinchihe)
docs: Add roadmap #1140 (gaocegege)
fix comment, add +optional flag to comment. #1137 (EDGsheryl)
simple_tfjob_tests py3 version #1134 (gabrielwen)
add tf-operator test in py3 #1133 (gabrielwen)
SDK support getting the TFJob training logs #1130 (jinchihe)
Copy third party vendor source code to Docker image #1128 (richardsliu)
Add third party licenses #1127 (richardsliu)
Distroless image for TF operator #1124 (krishnadurai)
Add watch function for TFJob python Client API #1122 (jinchihe)
fix(controller): calculate satisfied with && instead of || #1120 (GuoHaiqing)
remove tfjob dashboard #1119 (ChanYiLin)
fix(ConvertTFJobToUnstructured): ConvertTFJobToUnstructured uses function ToUnstructured to convert TFJob to Unstructured #1118 (leileiwan)
Update checking status API name #1117 (jinchihe)
Add more APIs for TFJob done #1116 (jinchihe)
Enhance tfjobs sdk docs #1114 (jinchihe)
fix the reconcile flow #1111 (ChanYiLin)
Generate TFJob Python SDK #1103 (jinchihe)
feat: Support pprof when monitoring is specified #1102 (gaocegege)
Add support for aarch64 #1098 (MrXinWang)
feat: Add adopters in README #1092 (gaocegege)
feat: Use kubeflow/common #1088 (gaocegege)
add ppc64le support for the example dist-mnist #1084 (alongzhi)
add the dockerfile for ppc64le #1083 (alongzhi)
Support for ppc64le #1082 (zoyun)
feat: Replace gometalinter with golangci-lint #1081 (gaocegege)
feat: Do not set TF_CONFIG for local training #1080 (gaocegege)
Delete v1beta2 api #1075 (johnugeorge)
Updating issue bot configs #1074 (rbrishabh)
Fix example Mnist With Summaries #1073 (andreyvelich)
use multi-stage build to build tf-operator image #1072 (hmtai)
Set tfjob defaults in test utils #1071 (ohmystack)
Add verify-codegen in travis CI #1070 (ohmystack)
Update codegen #1069 (ohmystack)
Add controller-name label for Pod and service #1067 (hougangliu)
Renaming labels to common types #1064 (johnugeorge)
Add qps and burst options #1063 (ScorpioCPH)
rewrite dockerfile #1062 (hmtai)
add total suffix in counter metrics #1055 (yeya24)
Update k8s libraries to 1.12.3 #1054 (johnugeorge)
add ldflag verion #1052 (yeya24)
Avoid unnecessary update when tfjob is complete #1051 (cheyang)
feat(pod): Support custom gang scheduler via CLI argument #1050 (gaocegege)
add flag kubeconfig #1049 (yeya24)
Easily detect the GOPATH in current development environment. #1047 (xauthulei)
fix bug: When executing tf-operator.v1 -version, GitSHA is always 'not provided' #1046 (asdfsx)
fix(UI): show correct namespace and name when deleting job through dashboard #1044 (gbin10533)
Set worker 0 completed if pod's phase goto succeeded #1042 (ScorpioCPH)
update release script #1040 (kunmingg)
fix(docs): Fix link for simple_TFJob_test #1038 (gaocegege)
Minor fix to add CoreV1 to scheme #1037 (johnugeorge)
Removing unnecessary Rbac authorization #1036 (johnugeorge)
refactor: add GenPodGroupName method to extract podGroupName in diffe… #1034 (zlcnju)
Update gang scheduler name #1028 (goodluckbot)