v1.0.1-rc.0 (2020-12-22)
Closed issues:
- tf-operator panic without worker role #1192
- TFJob completion with active services/endpoints resources #1191
- Having trouble viewing logs using Kubernetes dashboard #1189
- [feature] Support SuccessPolicy/FailurePolicy Based on % of Succeeded/Failed Workers #1188
- TFJob cannot utilize GPUs in the node. #1184
- [bug] With Python SDK, TFJob won't stop running #1183
- [bug] [Python SDK] tfjob_client.get_logs broken #1182
- How to create a python sdk for mxnet-operator #1181
- [feature] python sdk should report errors in created TFJobs #1180
- Could not introduce k8s.io/kube-openapi@master #1174
- can tf-operator used in distribute scene, such as Multi-node #1173
- Multi-worker training with Keras only use one GPU #1169
- NCCL WARN Failed to open libibverbs.so[.1] #1168
- tf-job-operator pod restarts #1167
- swagger-codegen-cli-2.4.6.jar not found #1166
- Cut release for tf-operator project #1163
- Replace reconciler implementation with kubeflow/common JobController #1161
- Error while replicating mnist_with_summaries #1159
- Non-OK-status: GpuLaunchKernel(FillPhiloxRandomKernelLaunch<Distribution>, num_blocks, block_size, 0, d.stream(), gen, data, size, dist) status: Internal: out of memory #1158
- TFjob pods hang without explanation #1156
- [Proposal] Support ClusterSpec Propagation Feature in TF 1.14 #1141
- evaluator� should be set in TF_CONFIG when using Estimator distribute strategy #1139
- Is there any case to run the different command in tfReplicaSpecs? #1138
- should gpu resource be released when tfjob failed because of image pull problem? #1136
- tf-job-operator CrashLoopBackOff #1135
- How to change the log level of tf-job-operator #1132
- Support getting the training process via Python SDK #1129
- Popgroup is not created automatically. #1121
- TFConfig should be demonstrated more specifically. #1115
- [chore] Remove tfjob dashboard #1113
- read TF_CONFIG env from configMap #1112
- Long job names result in jobs stuck forever #1101
- [Question] can't the base image "registry.access.redhat.com/ubi8/ubi:latest" in Dockerfile be replaced with "debian:buster" ? #1099
- can i install tf-operator alone without kubeflow? #1096
- c #1095
- TFJob test is failing on master and v0.7 branch for kubeflow/kubeflow #1094
- TFJob tests should use pytest #1093
- Multiple Evaluator replicas gives InvalidTFJobSpec #1091
- Java client for current version of TFjob #1090
- [enhancement] Replace common with kubeflow/common #1087
- Lack of documents for deployment #1086
- Performance problem about pod informer #1079
- [bug] Cannot initialize the training job with TF Estimator when the user uses 1 worker and 0 PS #1078
- Separate cluster scoped and namespace scoped resources #1077
- TFJob 1.0 #1076
- [bug] Keep tf-job-role as deprecated label in this version #1068
- GenLabels may select wrong Pods #1066
- Can I create a tf-operator pod without using GO? #1065
- tf-job-dashboard cannot work #1060
- [discussion] Should We Add CleanPodPolicy PS? #1059
- Refactor dockerfile #1058
- remove v1beta1 in v0.5.3 cause incompatible issue when using go mod #1057
- Invalid value: "v1beta1": must appear in spec.versions #1056
- Example on EKS: Device or resource busy #1053
- can we add PriorityClassName when we create TF-job Podgroup? #1048
- TFjob still running while chief pod is completed #1045
- Is there any document for how to run TFJob in AllReduce Strategy #1039
- tf-operator version conficts #1035
- Add E2E test for gang-scheduling #1033
- gang schedule annotation #1031
- [feature] Can we use one headless service for one job? #1030
- Will tf-operator upgrading k8s to 1.13? #1029
- no error log for create tfjob fail #1026
- Creating tfjob in dashboard usability issues #1024
- Deleting tf-job through the dashboard is not working #1019
- Create common CRD validate and mutating webhook for all operator #1016
- error with kubeflow instalation #996
- Shall we consider upgrading k8s to 1.11.3 #985
- TFJob Dashboard is not support pvc #980
- ERROR handle object: patching object from cluster: merging object with existing state: unable to recognize "/var/folders/tl/zzfcr4zs53vgnpqqjq4n08sh0000gn/T/ksonnet-mergepatch020443124": no matches for kind "TFJob" in version "kubeflow.org/v1beta1" #976
- Create CRD conversion webhook #967
- Performance issue when there is a lot of completed jobs #965
- Failed to marshal the object to TFJob; the spec is invalid: Failed to marshal the object to TFJob #964
- Proposal for a Common Operator #960
- Delete pod with unknown status in reconcilePods #956
- Create distributed training example for TF 2.0 #953
- Consider using KubeBuilder to reduce boilerplate code #925
- e2e test for dashboard/backend/handler/api_handler.go #921
- Use pod group instead of PDB for gang scheduling #916
- shareProcessNamespace not working with TFJob #902
- [feasibility-research] Handle machine failure #900
- Should limit the size of logs of tf_operator container #888
- Log message severity isn't properly reported in stackdriver #864
- E2E test for invalid spec errors #810
- [v1alpha2] Delete resources according to cleanuppolicy exactly once #804
- refactor the code of TFJobController for unittest #757
- e2e test for cleanupTFJob #756
- [build] Replace Python with Make or Bazel #739
- Export TF/Tensorboard/TF Summaries to prometheus #722
- [discussion] Maintain Helm Chart #716
- [discussion] Capacity planning #708
- [v1alpha2] Generate CRD validation in Kubernetes 1.11 #622
- Set labels and annotations for svc created by tf_operator #609
- mnist test isn't part of CI #597
- [v1alpha2] Push the example docker image to google or dockerhub registry #590
- feat: use fake client-set and informer add controller unittest. #540
- Run submit_release_job.sh in CI #519
- Add environment name in ControllerConfig #450
- [dashboard] How to handle storage? #449
- [dashboard] GPU limits are not taken into account #448
- [dashboard] Ability to create a TensorBoard instance #447
- [examples] Add termination policy in examples/tf_job.yaml #438
- add boilerplate header #430
- [logging] Extra flag problem #427
- [CI] Add hack/verify-codegen.sh in Travis CI #426
- E2E workflows should ignore failures #423
- [enhancement] Add OWNERS in subdirectories #415
- [enhancement] Fix the warnings reported by goreportcard.com #394
- [discussion] Separate the operator and UI dashboard #389
- [enhancemnet] Separate release image and test image #385
- [enhancement][CI] Replace Travis CI with Prow #382
- use Python3 for all python code? #377
- What to do about example TFJob YAML specs? #375
- E2E test for non-default namespace #170
- OpenAPI Client Generation for Java, Python #167
- Prevent scheduling deadlocks #165
- TfDebugger support #132
- Refactor code in py into a proper python package #114
- Update instructions and code to work with Kubernetes 1.8 #108
- Build sample container as part of release process #81
- Run lint (Python, Go) as a presubmit test #53
- Optimize scheduling of TF Processes #35
- E2E test that verifies invalid jobs are failed #30
- E2E test(s) to verify that permanent and retryable errors are handled correctly. #29
Merged pull requests:
- chore: Remove the kanban update workflow #1201 (gaocegege)
- chore: Refactor cmd #1199 (gaocegege)
- bugfix for multi_worker_strategy-with-keras.py #1198 (jiaqianjing)
- feat: Add CD using GitHub Actions #1196 (gaocegege)
- b/168938304 - Inclusive Language Fix-It, repo has non-inclusive language #1190 (sculd)
- Fix error when
conditions
is empty. #1185 (Corea) - Fix setup cluster issue and Pylint issue in CI tests #1179 (jinchihe)
- Fix the typo #1178 (pingsutw)
- chore: Update OWNERS #1177 (gaocegege)
- Update developer_guide.md #1176 (pingsutw)
- Update swagger-codegen-cli URL #1172 (jinchihe)
- Migrate controller implementation to kubeflow/common fashion #1171 (ChanYiLin)
- Support success policy for TFJob #1165 (terrytangyuan)
- add distributed training example of using TF 2.1 Strategy API #1164 (jazzsir)
- Make tf_operator use static compilation in container #1160 (MrXinWang)
- Update tf_job_client.py remove unused variable. #1157 (NikeNano)
- Update e2e_testing.md #1155 (NikeNano)
- Fix the link to run_e2e_workflow.py script #1154 (terrytangyuan)
- Set completion time when job exceed specified deadline. #1150 (SimonCqk)
- Support ClusterSpec Propagation Feature in TF 1.14 #1149 (zhujl1991)
- Disable istio sidecar injection in simple tfjob test #1148 (Bobgy)
- OWNERS: Add ChanYiLin as approver #1147 (ChanYiLin)
- Fix evaluator runconfig #1146 (richardsliu)
- Remove unused function arg #1145 (zhujl1991)
- Use go mod #1144 (xychu)
- Fix sdk test issue that's caused by kubenertes Client bug. #1143 (jinchihe)
- docs: Add roadmap #1140 (gaocegege)
- fix comment, add +optional flag to comment. #1137 (EDGsheryl)
- simple_tfjob_tests py3 version #1134 (gabrielwen)
- add tf-operator test in py3 #1133 (gabrielwen)
- SDK support getting the TFJob training logs #1130 (jinchihe)
- Copy third party vendor source code to Docker image #1128 (richardsliu)
- Add third party licenses #1127 (richardsliu)
- Distroless image for TF operator #1124 (krishnadurai)
- Add watch function for TFJob python Client API #1122 (jinchihe)
- fix(controller): calculate satisfied with && instead of || #1120 (GuoHaiqing)
- remove tfjob dashboard #1119 (ChanYiLin)
- fix(ConvertTFJobToUnstructured): ConvertTFJobToUnstructured uses function ToUnstructured to convert TFJob to Unstructured #1118 (leileiwan)
- Update checking status API name #1117 (jinchihe)
- Add more APIs for TFJob done #1116 (jinchihe)
- Enhance tfjobs sdk docs #1114 (jinchihe)
- fix the reconcile flow #1111 (ChanYiLin)
- Generate TFJob Python SDK #1103 (jinchihe)
- feat: Support pprof when monitoring is specified #1102 (gaocegege)
- Add support for aarch64 #1098 (MrXinWang)
- feat: Add adopters in README #1092 (gaocegege)
- feat: Use kubeflow/common #1088 (gaocegege)
- add ppc64le support for the example dist-mnist #1084 (alongzhi)
- add the dockerfile for ppc64le #1083 (alongzhi)
- Support for ppc64le #1082 (zoyun)
- feat: Replace gometalinter with golangci-lint #1081 (gaocegege)
- feat: Do not set TF_CONFIG for local training #1080 (gaocegege)
- Delete v1beta2 api #1075 (johnugeorge)
- Updating issue bot configs #1074 (rbrishabh)
- Fix example Mnist With Summaries #1073 (andreyvelich)
- use multi-stage build to build tf-operator image #1072 (hmtai)
- Set tfjob defaults in test utils #1071 (ohmystack)
- Add verify-codegen in travis CI #1070 (ohmystack)
- Update codegen #1069 (ohmystack)
- Add controller-name label for Pod and service #1067 (hougangliu)
- Renaming labels to common types #1064 (johnugeorge)
- Add qps and burst options #1063 (ScorpioCPH)
- rewrite dockerfile #1062 (hmtai)
- add total suffix in counter metrics #1055 (yeya24)
- Update k8s libraries to 1.12.3 #1054 (johnugeorge)
- add ldflag verion #1052 (yeya24)
- Avoid unnecessary update when tfjob is complete #1051 (cheyang)
- feat(pod): Support custom gang scheduler via CLI argument #1050 (gaocegege)
- add flag kubeconfig #1049 (yeya24)
- Easily detect the GOPATH in current development environment. #1047 (xauthulei)
- fix bug: When executing
tf-operator.v1 -version
, GitSHA is always 'not provided' #1046 (asdfsx) - fix(UI): show correct namespace and name when deleting job through dashboard #1044 (gbin10533)
- Set worker 0 completed if pod's phase goto succeeded #1042 (ScorpioCPH)
- update release script #1040 (kunmingg)
- fix(docs): Fix link for simple_TFJob_test #1038 (gaocegege)
- Minor fix to add CoreV1 to scheme #1037 (johnugeorge)
- Removing unnecessary Rbac authorization #1036 (johnugeorge)
- refactor: add GenPodGroupName method to extract podGroupName in diffe… #1034 (zlcnju)
- Update gang scheduler name #1028 (goodluckbot)