New features
- Train/Fine-tune API Proposal for LLMs #1945 (deepanker13)
- Adding Training image needed for train api #1963 (deepanker13)
- [SDK] Train API #1962 (deepanker13)
- Train api dataset download changes #1959 (deepanker13)
- Train api init container creation #1958 (deepanker13)
- Publish trainer hugging face image #1985 (deepanker13)
- Support arm64 for Hugging Face trainer #2028 (tariq-hasan)
- Modify LLM Trainer to support BERT and Tiny LLaMA #2031 (andreyvelich)
- Implement webhook validations for the PyTorchJob #2035 (tenzen-y)
- Implement webhook validations for the XGBoostJob #2052 (tenzen-y)
- Implement webhook validation for the TFJob #2051 (tenzen-y)
- Implement webhook warnings for the MXJob #2058 (tenzen-y)
- Implement webhook validations for the PaddleJob #2057 (tenzen-y)
- Fail job for non-retryable exit codes #2071 (kellyaa)
- Adding fine tune example with s3 as the dataset store #2006 (deepanker13)
Bug fixes
- fix nproc env in elastic mode for pytorchjob #1948 (kuizhiqing)
- IsMasterRole fix in pytorchjob controller #1969 (deepanker13)
- fix: volcano podgroup should has a non-empty queue name #1977 (lowang-bh)
- Fix Master Label for PyTorchJob #1974 (andreyvelich)
- [SDK] Fix Worker and Master templates for PyTorchJob #1988 (andreyvelich)
- Fix import for HuggingFace Dataset Provider #2085 (andreyvelich)
- Upgrade controller-gen to v0.14.0 #2026 (champon1020)
- Fix Distributed Data Samplers in PyTorch Examples #2012 (andreyvelich)
- Fix URL in python SDK setup.py #2011 (garymm)
Misc
- Adding parallel support for coveralls #1956 (johnugeorge)
- torchrun example with cpu version pytorch #1965 (kuizhiqing)
- [SDK] Get Kubernetes Events for Job #1975 (andreyvelich)
- Fix Master Label for PyTorchJob #1974 (andreyvelich)
- [SDK] Add information about TrainingClient logging #1973 (andreyvelich)
- PyTorchJob: Always show warnings when using elasticPolicy.nProcPerNode #2067 (tenzen-y)
- SDK: Upgrade the minimum required Kubernetes version to v1.27.2 #2066 (tenzen-y)
- Test: Simplify and Identify pod-controller envtest #2084 (tenzen-y)
- E2E: Replace outdated images with latest ones #2083 (tenzen-y)
- Upgrade scheduler-plugins to v0.28.9 #2065 (tenzen-y)