github kubeflow/training-operator v0.2.0-rc1

latest releases: v1.8.1, v1.8.0, v1.8.0-rc.1...
6 years ago

tf-operator release v0.2.0, part of Kubeflow release v0.2.0.

Features and improvements:

  • [v1alpha2] Set event for tfjob when spec is not valid #620
  • [enhancement] Fix the gofmt support #586
  • [go] Use dep instead of glide to reduce the size of vendor #556
  • [v1alpha2] Enhance the logic about sync #547
  • [v1alpha2] Use structured log #537
  • [log] investigate zap #534
  • [v1alpha2] Try to not to always claim pods #533
  • [v1alpha2] Suppport customized port #532
  • [v1alpha2] start using kubeconfig #522
  • v1alpha2 integration #521
  • TFJob operator surface queue metrics #503
  • [api] Remove pending pods from active pods #484
  • [enhancement] Set StartTime for TFJob status #475
  • [Feature] Support "eval" worker in tf-operator #444
  • Add appropriate logging fields to the tf-operator log messages #424
  • [enhancement] Refactor docs #379
  • Deprecate TfPort and set default port for users #327
  • [enhancement] Add e2e test cases for recorder #317
  • Make the TfJob controller more event driven #314
  • Potential data race, maybe #302
  • Don't leave pods running just to get logs #128
  • Add hyperparameter tuning? #112
  • Use headless services for Training jobs #40
  • More validation of TfJob #25

Fixed bugs:

  • [v1alpha2] RealServiceControl does not set owner reference #616
  • TfJob operator stops working on invalid spec #561
  • [v1alpha2]tfjob restartPolicy for Never #555
  • [v1alpha2] Potential bugs when there is one worker succeeded #538
  • [v1alpha2][test] Avoid potential data race problem #530
  • Phase is wrong unexpected TfJob phase: Done #110

Closed issues:

  • [v1alpha2] Make restart policy a pointer #692
  • [v1alpha2] Need conditions Succeeded and Failed indicating when job is done #673
  • [v1alpha2] add pod label with job name (without namespace) #672
  • [v1alpha2] Pods not deleted when job finishes #671
  • [v1alpha2] conditions not updated #668
  • [v1alpha2] Move control interface to separate pakckage #665
  • [v1alpha2] Move test util to separate package #664
  • Speedup E2E test by running build and setup cluster in parallel #659
  • In TFjob, when the workers Completed, i want the ps Completed too, how can i do? #657
  • [v1alpha2] service names are prefixed with namespace #654
  • [v1alpha2] Create a simple python server to be used for E2E tests of controller behavior #653
  • dep ensure give warning on k8s.io/apiserver #647
  • [v1alpha2] pod names don't include random salt #644
  • [v1alpha2]Unable to create pod #641
  • GPU tests failing; ks env doesn't exist #640
  • TFJob not marked as success when master exits but not workers #634
  • v1alpha2 - pod names don't include replica type #633
  • tensorflow on kubernetes how to pass in worker_host and ps_host to container if I use tf-operator #630
  • tf_job_client blocks forever #606
  • [v1alpha2] Need to add the v1alpha2 binaries to our Docker image #600
  • [v1alpha2] Need ksonnet package #599
  • Support deploying v1alpha2 and v1alpha1 controllers simultaneously #598
  • [v1alpha2] Remove controller_utils.go #591
  • [v1alpha2] Add CI test #589
  • [question] dist_mnist example failed to run #588
  • can not set labels #580
  • v1alpha2 should use headless services #574
  • TFJob operator should pass through annotations to the pod #573
  • [test] Test failed because of ImagePullBackOff #567
  • Servable not found for request: Latest(mnist) #552
  • [v1alpha2] The state of distributed model training. #544
  • [test] copy labels and anotations to pod from tfjob #543
  • Unable to deploy the example TfJob in the user guide #535
  • [v1alpha2] Do not set default to always for restartpolicy #524
  • E2E test steps should exit with non zero exit code if test fails #514
  • [v1alpha2] Sync commits with v1alpha1 #490
  • Use OpenAPI validation for CRDs in k8s 1.9 #437
  • default install of kubeflow no longer install tf-job-dashboard #435
  • Use DAG functionality of Argo in our E2E tests #422
  • Post submits are failing with Argo #370
  • tf-job-operator pod hangs and doesn't restart if it can't delete one of the TfJob pods #366
  • Refactor TFJobStatus in CRD API #333
  • Deprecate the TfImage field #330
  • [discussion] Differences between tensorflow/k8s and caicloud/kubeflow-controller #283
  • Does TfJob controller need to do master election? #263
  • Setup Prow PR Dashboard #255
  • API: some comments about API changes from PR #215 review #249
  • e2e test for the case that the chief is not master #235
  • Use conditions instead of phase #223
  • Submitted tfjobs cease to start running under unknown conditions #203
  • Tutorials #195
  • Copy chart to kubernetes/charts #93
  • Create a web page to list releases #70
  • tensorflow 1.4 and estimator support #61
  • Set a default value for restartPolicy #55

Merged pull requests:

Don't miss a new training-operator release

NewReleases is sending notifications on new releases.