tf-operator release v0.2.0, part of Kubeflow release v0.2.0.
Features and improvements:
- [v1alpha2] Set event for tfjob when spec is not valid #620
- [enhancement] Fix the gofmt support #586
- [go] Use dep instead of glide to reduce the size of vendor #556
- [v1alpha2] Enhance the logic about sync #547
- [v1alpha2] Use structured log #537
- [log] investigate zap #534
- [v1alpha2] Try to not to always claim pods #533
- [v1alpha2] Suppport customized port #532
- [v1alpha2] start using kubeconfig #522
- v1alpha2 integration #521
- TFJob operator surface queue metrics #503
- [api] Remove pending pods from active pods #484
- [enhancement] Set StartTime for TFJob status #475
- [Feature] Support "eval" worker in tf-operator #444
- Add appropriate logging fields to the tf-operator log messages #424
- [enhancement] Refactor docs #379
- Deprecate TfPort and set default port for users #327
- [enhancement] Add e2e test cases for recorder #317
- Make the TfJob controller more event driven #314
- Potential data race, maybe #302
- Don't leave pods running just to get logs #128
- Add hyperparameter tuning? #112
- Use headless services for Training jobs #40
- More validation of TfJob #25
Fixed bugs:
- [v1alpha2] RealServiceControl does not set owner reference #616
- TfJob operator stops working on invalid spec #561
- [v1alpha2]tfjob restartPolicy for Never #555
- [v1alpha2] Potential bugs when there is one worker succeeded #538
- [v1alpha2][test] Avoid potential data race problem #530
- Phase is wrong unexpected TfJob phase: Done #110
Closed issues:
- [v1alpha2] Make restart policy a pointer #692
- [v1alpha2] Need conditions Succeeded and Failed indicating when job is done #673
- [v1alpha2] add pod label with job name (without namespace) #672
- [v1alpha2] Pods not deleted when job finishes #671
- [v1alpha2] conditions not updated #668
- [v1alpha2] Move control interface to separate pakckage #665
- [v1alpha2] Move test util to separate package #664
- Speedup E2E test by running build and setup cluster in parallel #659
- In TFjob, when the workers Completed, i want the ps Completed too, how can i do? #657
- [v1alpha2] service names are prefixed with namespace #654
- [v1alpha2] Create a simple python server to be used for E2E tests of controller behavior #653
dep ensure
give warning onk8s.io/apiserver
#647- [v1alpha2] pod names don't include random salt #644
- [v1alpha2]Unable to create pod #641
- GPU tests failing; ks env doesn't exist #640
- TFJob not marked as success when master exits but not workers #634
- v1alpha2 - pod names don't include replica type #633
- tensorflow on kubernetes how to pass in worker_host and ps_host to container if I use tf-operator #630
- tf_job_client blocks forever #606
- [v1alpha2] Need to add the v1alpha2 binaries to our Docker image #600
- [v1alpha2] Need ksonnet package #599
- Support deploying v1alpha2 and v1alpha1 controllers simultaneously #598
- [v1alpha2] Remove controller_utils.go #591
- [v1alpha2] Add CI test #589
- [question] dist_mnist example failed to run #588
- can not set labels #580
- v1alpha2 should use headless services #574
- TFJob operator should pass through annotations to the pod #573
- [test] Test failed because of ImagePullBackOff #567
- Servable not found for request: Latest(mnist) #552
- [v1alpha2] The state of distributed model training. #544
- [test] copy labels and anotations to pod from tfjob #543
- Unable to deploy the example TfJob in the user guide #535
- [v1alpha2] Do not set default to always for restartpolicy #524
- E2E test steps should exit with non zero exit code if test fails #514
- [v1alpha2] Sync commits with v1alpha1 #490
- Use OpenAPI validation for CRDs in k8s 1.9 #437
- default install of kubeflow no longer install tf-job-dashboard #435
- Use DAG functionality of Argo in our E2E tests #422
- Post submits are failing with Argo #370
- tf-job-operator pod hangs and doesn't restart if it can't delete one of the TfJob pods #366
- Refactor TFJobStatus in CRD API #333
- Deprecate the TfImage field #330
- [discussion] Differences between tensorflow/k8s and caicloud/kubeflow-controller #283
- Does TfJob controller need to do master election? #263
- Setup Prow PR Dashboard #255
- API: some comments about API changes from PR #215 review #249
- e2e test for the case that the chief is not master #235
- Use conditions instead of phase #223
- Submitted tfjobs cease to start running under unknown conditions #203
- Tutorials #195
- Copy chart to kubernetes/charts #93
- Create a web page to list releases #70
- tensorflow 1.4 and estimator support #61
- Set a default value for restartPolicy #55
Merged pull requests:
- *: Add cleanpod policy for v1alpha2 #691 (gaocegege)
- status: Fail the TFJob if PS is failed #690 (gaocegege)
- Use tf_job_name not tf_job_key as the label name. #689 (jlewi)
- pkg: Delete pods and services after finished #686 (gaocegege)
- informer: Add comments and TODO #684 (gaocegege)
- and some safety check #683 (u2takey)
- Remove code that is no longer used. #681 (jlewi)
- change comment with more related link #679 (u2takey)
- return err if the spec area is nil after unmashal for tfjob v1alpha2 #678 (jiaxuanzhou)
- fix typo #677 (u2takey)
- fix restart policy with comment #676 (u2takey)
- label: Remove namespace from labels #675 (gaocegege)
- controller: Move control interface to control package #670 (gaocegege)
- defaults: Rename the type #669 (gaocegege)
- Enable the E2E tests for v1alpha2. #667 (jlewi)
- *: Move test util to separate package #666 (gaocegege)
- Update dep and vendor #663 (xychu)
- server: Make threadiness configurable #662 (gaocegege)
- dist-mnist: Move to examples #660 (gaocegege)
- tfjob: Add test for copy labels and annotations #658 (gaocegege)
- *: Remove namespace from service name #656 (gaocegege)
- pod: Add test for exit code #652 (gaocegege)
- [v1alpha2] Estimator support - Do not include
evaluator
in cluster spec #650 (xychu) - pods: Add cluster spec test #649 (gaocegege)
- pod: Submit an event when the user specifies the restartpolicy for pod template #648 (gaocegege)
- v1alpha2 E2E tests for termination policy #646 (jlewi)
- status: Add test cases for failure #643 (gaocegege)
- Add proper error handling for deploying the tests. #642 (jlewi)
- pods: Add restart policy #638 (gaocegege)
- status: Support chief #637 (gaocegege)
- *: Set name for the pod
tfjob.name-type-index
#636 (gaocegege) - Modify presubmits to support testing with v1alpha2 #632 (jlewi)
- Updates to enable e2e test for v1alpha2 #629 (ankushagarwal)
- Use logging.exception to capture stack traces in logs #627 (ankushagarwal)
- Pass TFJob API version instead of hardcoding it #626 (ankushagarwal)
- [v1alpha2] Add distributed state management #625 (yph152)
- pkg: Send events when reveive invalid spec #623 (gaocegege)
- pkg: Support customized port #621 (gaocegege)
- Add apiVersion parameter to simple_tfjob component #619 (ankushagarwal)
- api_handler: Fix import order #618 (gaocegege)
- service_control: Set owner ref for service and add test cases #617 (gaocegege)
- test: Add test cases for service ref manager and control interface #615 (gaocegege)
- controller: Refactor and add test cases for helper #614 (gaocegege)
- [dashboard] Upgrade to v1alpha2 #613 (wbuchwalter)
- controller: Improve coding styles #612 (gaocegege)
- Informer: Use unstructured #610 (gaocegege)
- TFJob client should not block forever trying to get the namespace object #607 (jlewi)
- crd: Add validation using OpenAPI 3.0 #605 (gaocegege)
- dist_mnist: Add unused_argv #604 (gaocegege)
- service: Refactor to the slice structure #603 (gaocegege)
- Delete the old releaser code which is no longer used. #602 (jlewi)
- Add tf-operator.v2 to release.py so that we build a Docker image containing the v1alph2 controller #601 (jlewi)
- Add a new command-line argument for release.py #595 (chaoleili)
- controller: Remove dup code and use k8s.io/kubernetes/controller #594 (gaocegege)
- test: Fix data race problem #593 (gaocegege)
- .travis.yml: Fix cmd errors #592 (gaocegege)
- Fix the gometalinter support #587 (wgliang)
- Format go code and fix spelling errors #585 (wgliang)
- docs: Add quick start for v1alpah2 #584 (gaocegege)
- mnist: Add correponding yaml config #583 (gaocegege)
- pod: Add update logic #582 (gaocegege)
- .pylinrc: Add dist_mnist #581 (gaocegege)
- .travis.yml: Add failure notification in GitHub #579 (gaocegege)
- controller_status: Remove pending pods from active pods #578 (gaocegege)
- api: OpenAPI support #577 (gaocegege)
- controller_service: Headless service #576 (gaocegege)
- replace glide with dep in the developer guide #572 (ChanYiLin)
- [v1alpha2]fix bug int to string for index #571 (yph152)
- Fix missing string for logging placeholder #570 (zacharyzhao)
- Update py_lint and py_test #569 (ankushagarwal)
- Update test worker image to kubeflow-ci #568 (ankushagarwal)
- chart: Remove #566 (gaocegege)
- add OwnerReferences to pdb #565 (ChanYiLin)
- Correct typos in README #559 (ntenenz)
- vendor: Use dep instead of glide and prune it #557 (gaocegege)
- set completion time on success #554 (u2takey)
- README: Add tf-operator v1alpha2 design doc #553 (gaocegege)
- Add dist mnist model for e2e test #549 (ScorpioCPH)
- controller: Refactor controller_pod #548 (gaocegege)
- Replace kubeflow-images-staging with kubeflow-images-public #546 (ankushagarwal)
- copy labels and anotations to pod from pod template #542 (u2takey)
- add workqueue and reflect metrics #541 (zjj2wry)
- fix the bug of keeping creating new pdb #539 (ChanYiLin)
- signals: Add #531 (gaocegege)
- OWNERS: Add @ddysher and @willb as reviewers #529 (gaocegege)
- developer_guide: Add instructions for v1alpha2 #528 (gaocegege)
- tests: Fix #527 (gaocegege)
- v1alpha2: Add implementation #526 (gaocegege)
- v1alpha2: update flag kubeconfig #525 (yph152)
- v1alpha2: Add API and codegen #523 (gaocegege)
- Reenable cluster teardown. #520 (jlewi)
- Only identify specific exit codes as retryable error #518 (0olwzo0)
- update OWNERS #516 (mitake)
- Create a script to release the TFJob operator image #515 (jlewi)
- Fix output on test failure #511 (jose5918)
- Adds gcloudignore #510 (jose5918)
- RFC: Add a new command for generating example TFjobs #509 (mitake)
- Use a CentOS 7 base image for the tf-operator image #469 (tmckayus)