Enhancements
- Added support for RH OCP4.1 and RH OCP4.2
- Added additional installation methods
- Using kustomize and kubeflow/manifests
- Using Helm Chart
- Added support for Go Modules and removed vendor directories
- Added default ephemeral storage for init container
- Overwrite NVIDIA env vars to avoid using GPUs on launcher
- Added health check and callbacks around various leader election phases
- Honor user-specified worker command
- Exposed main container name as a configurable field
- Added RunPolicy to MPIJobSpec that reuses kubeflow/common spec
- Allow to specify the name of the gang scheduler and priority for pod group
- Added error log when pod spec does not have any containers
- Switched to use distroless images
- Refactored the kubectl-delivery to improve the launcher performance
- Added Prometheus metrics for job monitoring
- Added experimental version of v1 MPIJob controller and APIs
- Support Volcano as a scheduler
- Switched to use pods for launcher job and statefulset workers
- Switched to use klog for logging
- More consistent labels with other Kubeflow operators
Fixes
- Fixed nil pointer exceptions that could accidentally restart the pod
- Updated status to running only when launcher is active and all workers are ready
- Fixed the incorrect namespace for initializing informers and endpoints of leader election
- Fixed issue in v1 controller's CRD existence check
Documentation
- Added the list of adopters
- Added roadmap document
- Revamped contributing guidelines
- Added MPIJob API reference page on Kubeflow website
- Added a blog post for an introduction to MPI Operator and its industry adoption
- Added a CPU-only example
- Added licenses used by the dependencies