kubeflow/training-operator v0.2.0-rc1 on GitHub

tf-operator release v0.2.0, part of Kubeflow release v0.2.0.

Features and improvements:

[v1alpha2] Set event for tfjob when spec is not valid #620
[enhancement] Fix the gofmt support #586
[go] Use dep instead of glide to reduce the size of vendor #556
[v1alpha2] Enhance the logic about sync #547
[v1alpha2] Use structured log #537
[log] investigate zap #534
[v1alpha2] Try to not to always claim pods #533
[v1alpha2] Suppport customized port #532
[v1alpha2] start using kubeconfig #522
v1alpha2 integration #521
TFJob operator surface queue metrics #503
[api] Remove pending pods from active pods #484
[enhancement] Set StartTime for TFJob status #475
[Feature] Support "eval" worker in tf-operator #444
Add appropriate logging fields to the tf-operator log messages #424
[enhancement] Refactor docs #379
Deprecate TfPort and set default port for users #327
[enhancement] Add e2e test cases for recorder #317
Make the TfJob controller more event driven #314
Potential data race, maybe #302
Don't leave pods running just to get logs #128
Add hyperparameter tuning? #112
Use headless services for Training jobs #40
More validation of TfJob #25

Fixed bugs:

[v1alpha2] RealServiceControl does not set owner reference #616
TfJob operator stops working on invalid spec #561
[v1alpha2]tfjob restartPolicy for Never #555
[v1alpha2] Potential bugs when there is one worker succeeded #538
[v1alpha2][test] Avoid potential data race problem #530
Phase is wrong unexpected TfJob phase: Done #110

Closed issues:

[v1alpha2] Make restart policy a pointer #692
[v1alpha2] Need conditions Succeeded and Failed indicating when job is done #673
[v1alpha2] add pod label with job name (without namespace) #672
[v1alpha2] Pods not deleted when job finishes #671
[v1alpha2] conditions not updated #668
[v1alpha2] Move control interface to separate pakckage #665
[v1alpha2] Move test util to separate package #664
Speedup E2E test by running build and setup cluster in parallel #659
In TFjob, when the workers Completed, i want the ps Completed too, how can i do? #657
[v1alpha2] service names are prefixed with namespace #654
[v1alpha2] Create a simple python server to be used for E2E tests of controller behavior #653
dep ensure give warning on k8s.io/apiserver #647
[v1alpha2] pod names don't include random salt #644
[v1alpha2]Unable to create pod #641
GPU tests failing; ks env doesn't exist #640
TFJob not marked as success when master exits but not workers #634
v1alpha2 - pod names don't include replica type #633
tensorflow on kubernetes how to pass in worker_host and ps_host to container if I use tf-operator #630
tf_job_client blocks forever #606
[v1alpha2] Need to add the v1alpha2 binaries to our Docker image #600
[v1alpha2] Need ksonnet package #599
Support deploying v1alpha2 and v1alpha1 controllers simultaneously #598
[v1alpha2] Remove controller_utils.go #591
[v1alpha2] Add CI test #589
[question] dist_mnist example failed to run #588
can not set labels #580
v1alpha2 should use headless services #574
TFJob operator should pass through annotations to the pod #573
[test] Test failed because of ImagePullBackOff #567
Servable not found for request: Latest(mnist) #552
[v1alpha2] The state of distributed model training. #544
[test] copy labels and anotations to pod from tfjob #543
Unable to deploy the example TfJob in the user guide #535
[v1alpha2] Do not set default to always for restartpolicy #524
E2E test steps should exit with non zero exit code if test fails #514
[v1alpha2] Sync commits with v1alpha1 #490
Use OpenAPI validation for CRDs in k8s 1.9 #437
default install of kubeflow no longer install tf-job-dashboard #435
Use DAG functionality of Argo in our E2E tests #422
Post submits are failing with Argo #370
tf-job-operator pod hangs and doesn't restart if it can't delete one of the TfJob pods #366
Refactor TFJobStatus in CRD API #333
Deprecate the TfImage field #330
[discussion] Differences between tensorflow/k8s and caicloud/kubeflow-controller #283
Does TfJob controller need to do master election? #263
Setup Prow PR Dashboard #255
API: some comments about API changes from PR #215 review #249
e2e test for the case that the chief is not master #235
Use conditions instead of phase #223
Submitted tfjobs cease to start running under unknown conditions #203
Tutorials #195
Copy chart to kubernetes/charts #93
Create a web page to list releases #70
tensorflow 1.4 and estimator support #61
Set a default value for restartPolicy #55

Merged pull requests:

*: Add cleanpod policy for v1alpha2 #691 (gaocegege)
status: Fail the TFJob if PS is failed #690 (gaocegege)
Use tf_job_name not tf_job_key as the label name. #689 (jlewi)
pkg: Delete pods and services after finished #686 (gaocegege)
informer: Add comments and TODO #684 (gaocegege)
and some safety check #683 (u2takey)
Remove code that is no longer used. #681 (jlewi)
change comment with more related link #679 (u2takey)
return err if the spec area is nil after unmashal for tfjob v1alpha2 #678 (jiaxuanzhou)
fix typo #677 (u2takey)
fix restart policy with comment #676 (u2takey)
label: Remove namespace from labels #675 (gaocegege)
controller: Move control interface to control package #670 (gaocegege)
defaults: Rename the type #669 (gaocegege)
Enable the E2E tests for v1alpha2. #667 (jlewi)
*: Move test util to separate package #666 (gaocegege)
Update dep and vendor #663 (xychu)
server: Make threadiness configurable #662 (gaocegege)
dist-mnist: Move to examples #660 (gaocegege)
tfjob: Add test for copy labels and annotations #658 (gaocegege)
*: Remove namespace from service name #656 (gaocegege)
pod: Add test for exit code #652 (gaocegege)
[v1alpha2] Estimator support - Do not include evaluator in cluster spec #650 (xychu)
pods: Add cluster spec test #649 (gaocegege)
pod: Submit an event when the user specifies the restartpolicy for pod template #648 (gaocegege)
v1alpha2 E2E tests for termination policy #646 (jlewi)
status: Add test cases for failure #643 (gaocegege)
Add proper error handling for deploying the tests. #642 (jlewi)
pods: Add restart policy #638 (gaocegege)
status: Support chief #637 (gaocegege)
*: Set name for the pod tfjob.name-type-index #636 (gaocegege)
Modify presubmits to support testing with v1alpha2 #632 (jlewi)
Updates to enable e2e test for v1alpha2 #629 (ankushagarwal)
Use logging.exception to capture stack traces in logs #627 (ankushagarwal)
Pass TFJob API version instead of hardcoding it #626 (ankushagarwal)
[v1alpha2] Add distributed state management #625 (yph152)
pkg: Send events when reveive invalid spec #623 (gaocegege)
pkg: Support customized port #621 (gaocegege)
Add apiVersion parameter to simple_tfjob component #619 (ankushagarwal)
api_handler: Fix import order #618 (gaocegege)
service_control: Set owner ref for service and add test cases #617 (gaocegege)
test: Add test cases for service ref manager and control interface #615 (gaocegege)
controller: Refactor and add test cases for helper #614 (gaocegege)
[dashboard] Upgrade to v1alpha2 #613 (wbuchwalter)
controller: Improve coding styles #612 (gaocegege)
Informer: Use unstructured #610 (gaocegege)
TFJob client should not block forever trying to get the namespace object #607 (jlewi)
crd: Add validation using OpenAPI 3.0 #605 (gaocegege)
dist_mnist: Add unused_argv #604 (gaocegege)
service: Refactor to the slice structure #603 (gaocegege)
Delete the old releaser code which is no longer used. #602 (jlewi)
Add tf-operator.v2 to release.py so that we build a Docker image containing the v1alph2 controller #601 (jlewi)
Add a new command-line argument for release.py #595 (chaoleili)
controller: Remove dup code and use k8s.io/kubernetes/controller #594 (gaocegege)
test: Fix data race problem #593 (gaocegege)
.travis.yml: Fix cmd errors #592 (gaocegege)
Fix the gometalinter support #587 (wgliang)
Format go code and fix spelling errors #585 (wgliang)
docs: Add quick start for v1alpah2 #584 (gaocegege)
mnist: Add correponding yaml config #583 (gaocegege)
pod: Add update logic #582 (gaocegege)
.pylinrc: Add dist_mnist #581 (gaocegege)
.travis.yml: Add failure notification in GitHub #579 (gaocegege)
controller_status: Remove pending pods from active pods #578 (gaocegege)
api: OpenAPI support #577 (gaocegege)
controller_service: Headless service #576 (gaocegege)
replace glide with dep in the developer guide #572 (ChanYiLin)
[v1alpha2]fix bug int to string for index #571 (yph152)
Fix missing string for logging placeholder #570 (zacharyzhao)
Update py_lint and py_test #569 (ankushagarwal)
Update test worker image to kubeflow-ci #568 (ankushagarwal)
chart: Remove #566 (gaocegege)
add OwnerReferences to pdb #565 (ChanYiLin)
Correct typos in README #559 (ntenenz)
vendor: Use dep instead of glide and prune it #557 (gaocegege)
set completion time on success #554 (u2takey)
README: Add tf-operator v1alpha2 design doc #553 (gaocegege)
Add dist mnist model for e2e test #549 (ScorpioCPH)
controller: Refactor controller_pod #548 (gaocegege)
Replace kubeflow-images-staging with kubeflow-images-public #546 (ankushagarwal)
copy labels and anotations to pod from pod template #542 (u2takey)
add workqueue and reflect metrics #541 (zjj2wry)
fix the bug of keeping creating new pdb #539 (ChanYiLin)
signals: Add #531 (gaocegege)
OWNERS: Add @ddysher and @willb as reviewers #529 (gaocegege)
developer_guide: Add instructions for v1alpha2 #528 (gaocegege)
tests: Fix #527 (gaocegege)
v1alpha2: Add implementation #526 (gaocegege)
v1alpha2: update flag kubeconfig #525 (yph152)
v1alpha2: Add API and codegen #523 (gaocegege)
Reenable cluster teardown. #520 (jlewi)
Only identify specific exit codes as retryable error #518 (0olwzo0)
update OWNERS #516 (mitake)
Create a script to release the TFJob operator image #515 (jlewi)
Fix output on test failure #511 (jose5918)
Adds gcloudignore #510 (jose5918)
RFC: Add a new command for generating example TFjobs #509 (mitake)
Use a CentOS 7 base image for the tf-operator image #469 (tmckayus)