github kubeflow/training-operator v1.0.1-rc.0

latest releases: v1.8.1, v1.8.0, v1.8.0-rc.1...
pre-release3 years ago

v1.0.1-rc.0 (2020-12-22)

Full Changelog

Closed issues:

  • tf-operator panic without worker role #1192
  • TFJob completion with active services/endpoints resources #1191
  • Having trouble viewing logs using Kubernetes dashboard #1189
  • [feature] Support SuccessPolicy/FailurePolicy Based on % of Succeeded/Failed Workers #1188
  • TFJob cannot utilize GPUs in the node. #1184
  • [bug] With Python SDK, TFJob won't stop running #1183
  • [bug] [Python SDK] tfjob_client.get_logs broken #1182
  • How to create a python sdk for mxnet-operator #1181
  • [feature] python sdk should report errors in created TFJobs #1180
  • Could not introduce k8s.io/kube-openapi@master #1174
  • can tf-operator used in distribute scene, such as Multi-node #1173
  • Multi-worker training with Keras only use one GPU #1169
  • NCCL WARN Failed to open libibverbs.so[.1] #1168
  • tf-job-operator pod restarts #1167
  • swagger-codegen-cli-2.4.6.jar not found #1166
  • Cut release for tf-operator project #1163
  • Replace reconciler implementation with kubeflow/common JobController #1161
  • Error while replicating mnist_with_summaries #1159
  • Non-OK-status: GpuLaunchKernel(FillPhiloxRandomKernelLaunch<Distribution>, num_blocks, block_size, 0, d.stream(), gen, data, size, dist) status: Internal: out of memory #1158
  • TFjob pods hang without explanation #1156
  • [Proposal] Support ClusterSpec Propagation Feature in TF 1.14 #1141
  • evaluator� should be set in TF_CONFIG when using Estimator distribute strategy #1139
  • Is there any case to run the different command in tfReplicaSpecs? #1138
  • should gpu resource be released when tfjob failed because of image pull problem? #1136
  • tf-job-operator CrashLoopBackOff #1135
  • How to change the log level of tf-job-operator #1132
  • Support getting the training process via Python SDK #1129
  • Popgroup is not created automatically. #1121
  • TFConfig should be demonstrated more specifically. #1115
  • [chore] Remove tfjob dashboard #1113
  • read TF_CONFIG env from configMap #1112
  • Long job names result in jobs stuck forever #1101
  • [Question] can't the base image "registry.access.redhat.com/ubi8/ubi:latest" in Dockerfile be replaced with "debian:buster" ? #1099
  • can i install tf-operator alone without kubeflow? #1096
  • c #1095
  • TFJob test is failing on master and v0.7 branch for kubeflow/kubeflow #1094
  • TFJob tests should use pytest #1093
  • Multiple Evaluator replicas gives InvalidTFJobSpec #1091
  • Java client for current version of TFjob #1090
  • [enhancement] Replace common with kubeflow/common #1087
  • Lack of documents for deployment #1086
  • Performance problem about pod informer #1079
  • [bug] Cannot initialize the training job with TF Estimator when the user uses 1 worker and 0 PS #1078
  • Separate cluster scoped and namespace scoped resources #1077
  • TFJob 1.0 #1076
  • [bug] Keep tf-job-role as deprecated label in this version #1068
  • GenLabels may select wrong Pods #1066
  • Can I create a tf-operator pod without using GO? #1065
  • tf-job-dashboard cannot work #1060
  • [discussion] Should We Add CleanPodPolicy PS? #1059
  • Refactor dockerfile #1058
  • remove v1beta1 in v0.5.3 cause incompatible issue when using go mod #1057
  • Invalid value: "v1beta1": must appear in spec.versions #1056
  • Example on EKS: Device or resource busy #1053
  • can we add PriorityClassName when we create TF-job Podgroup? #1048
  • TFjob still running while chief pod is completed #1045
  • Is there any document for how to run TFJob in AllReduce Strategy #1039
  • tf-operator version conficts #1035
  • Add E2E test for gang-scheduling #1033
  • gang schedule annotation #1031
  • [feature] Can we use one headless service for one job? #1030
  • Will tf-operator upgrading k8s to 1.13? #1029
  • no error log for create tfjob fail #1026
  • Creating tfjob in dashboard usability issues #1024
  • Deleting tf-job through the dashboard is not working #1019
  • Create common CRD validate and mutating webhook for all operator #1016
  • error with kubeflow instalation #996
  • Shall we consider upgrading k8s to 1.11.3 #985
  • TFJob Dashboard is not support pvc #980
  • ERROR handle object: patching object from cluster: merging object with existing state: unable to recognize "/var/folders/tl/zzfcr4zs53vgnpqqjq4n08sh0000gn/T/ksonnet-mergepatch020443124": no matches for kind "TFJob" in version "kubeflow.org/v1beta1" #976
  • Create CRD conversion webhook #967
  • Performance issue when there is a lot of completed jobs #965
  • Failed to marshal the object to TFJob; the spec is invalid: Failed to marshal the object to TFJob #964
  • Proposal for a Common Operator #960
  • Delete pod with unknown status in reconcilePods #956
  • Create distributed training example for TF 2.0 #953
  • Consider using KubeBuilder to reduce boilerplate code #925
  • e2e test for dashboard/backend/handler/api_handler.go #921
  • Use pod group instead of PDB for gang scheduling #916
  • shareProcessNamespace not working with TFJob #902
  • [feasibility-research] Handle machine failure #900
  • Should limit the size of logs of tf_operator container #888
  • Log message severity isn't properly reported in stackdriver #864
  • E2E test for invalid spec errors #810
  • [v1alpha2] Delete resources according to cleanuppolicy exactly once #804
  • refactor the code of TFJobController for unittest #757
  • e2e test for cleanupTFJob #756
  • [build] Replace Python with Make or Bazel #739
  • Export TF/Tensorboard/TF Summaries to prometheus #722
  • [discussion] Maintain Helm Chart #716
  • [discussion] Capacity planning #708
  • [v1alpha2] Generate CRD validation in Kubernetes 1.11 #622
  • Set labels and annotations for svc created by tf_operator #609
  • mnist test isn't part of CI #597
  • [v1alpha2] Push the example docker image to google or dockerhub registry #590
  • feat: use fake client-set and informer add controller unittest. #540
  • Run submit_release_job.sh in CI #519
  • Add environment name in ControllerConfig #450
  • [dashboard] How to handle storage? #449
  • [dashboard] GPU limits are not taken into account #448
  • [dashboard] Ability to create a TensorBoard instance #447
  • [examples] Add termination policy in examples/tf_job.yaml #438
  • add boilerplate header #430
  • [logging] Extra flag problem #427
  • [CI] Add hack/verify-codegen.sh in Travis CI #426
  • E2E workflows should ignore failures #423
  • [enhancement] Add OWNERS in subdirectories #415
  • [enhancement] Fix the warnings reported by goreportcard.com #394
  • [discussion] Separate the operator and UI dashboard #389
  • [enhancemnet] Separate release image and test image #385
  • [enhancement][CI] Replace Travis CI with Prow #382
  • use Python3 for all python code? #377
  • What to do about example TFJob YAML specs? #375
  • E2E test for non-default namespace #170
  • OpenAPI Client Generation for Java, Python #167
  • Prevent scheduling deadlocks #165
  • TfDebugger support #132
  • Refactor code in py into a proper python package #114
  • Update instructions and code to work with Kubernetes 1.8 #108
  • Build sample container as part of release process #81
  • Run lint (Python, Go) as a presubmit test #53
  • Optimize scheduling of TF Processes #35
  • E2E test that verifies invalid jobs are failed #30
  • E2E test(s) to verify that permanent and retryable errors are handled correctly. #29

Merged pull requests:

Don't miss a new training-operator release

NewReleases is sending notifications on new releases.