Core Features:
- feat: add HAMi-core mode for Ascend devices
- feat: Optimize HAMi-core performance
- feat: HAMi-DRA(NVIDIA) is ready for use
- feat: Volcano-vgpu-device-plugin sync with version 0.19, and support CDI
- feat: HAMi-skills for debugging and developing
- Document: Add latest benchmark to value the performance of HAMi-core
- Fix: Initialization error when using tp on vllm > 0.18
Details
✨ New Features
- feat: add vGPUmonitor --metrics-bind-address flag by @dongjiang1989 in #1613
- feat: add promtheus serviceMonitor in helm-charts by @dongjiang1989 in #1614
- feat: add serviceMonitor for device plugin by @dongjiang1989 in #1633
- feat: check resource quota in webhook by @DSFans2014 in #1605
- feat: support module-pair allocation for Ascend 910C devices in SuperPod environments by @ashergaga in #1610
- feat(skill): Add k8s-debug-gpu-pod skill for HAMi GPU troubleshooting by @haitwang-cloud in #1654
- feat: add support for vastai device by @DSFans2014 in #1645
- feat(helm): add config namespaceSelector & objectSelector for webhook by @haitwang-cloud in #1653
- feat(metrics): align prometheus metric and label names with best practices by @MyoungHaSong in #1644
- feat(logging): optimize log verbosity and add unit tests by @haitwang-cloud in #1710
- feat(workflow): stale issues and PRs older than one year by @Shouren in #1725
- feat: increase operations-per-run for stale action by @Shouren in #1756
- feat: add local-deploy target for deploying to minikube/kind clusters by @anandj91 in #1760
- feat: add hami_vgpu_metrics_summarizer skill by @haitwang-cloud in #1755
- feat: add Ascend ResourceCoreName to support hami-vnpu-core virtualization by @ashergaga in #1771
- feat: add Ascendxxx-core resource by @DSFans2014 in #1804
- feat: support node filtering based on hami-vnpu-core annotation by @ashergaga in #1812
- refactor: filter device at nvml-manager by @DSFans2014 in #1825
- feat: add DeepCopy function for DeviceUsage and its nested types by @Shouren in #1818
- feat: Supports multi-device requests with hami-vnpu-core enabled. by @ashergaga in #1837
- feat: update skills and add CLAUDE.md for codebase guidance by @haitwang-cloud in #1842
- feat: add enableGetPreferredAllocation flag by @DSFans2014 in #1824
🐛 Bug Fixes
- fix device typos by @DSFans2014 in #1608
- fix: precedence bug in schedulerName check by @hoteye in #1627
- fix: add nil checks to prevent panics in leaderelection by @haitwang-cloud in #1603
- fix: panic on nil resourceReqs in scheduler calcScore by @yxxhero in #1626
- Fix the issue Iluvatar device scheduling policy binpack and spread are reversed by @qiangwei1983 in #1631
- Fix contact link and add slack channel to README.md by @archlitchi in #1635
- fix: resolve cardinality explosion in Device_memory_desc_of_container by @maishivamhoo123 in #1628
- fix: handle GetMemoryInfo ERROR_NOT_SUPPORTED for unified memory GPUs by @jsl9208 in #1637
- fix typos by @DSFans2014 in #1657
- Fix: optimize nodelock scalability with exponential backoff and listers by @maishivamhoo123 in #1663
- fix apply kubescheduler config version by @CoderTH in #1666
- fix: failing readiness probe when replica > 1 by @Shouren in #1677
- fix: ci deepth by @Atroxgod in #1690
- fix(scheduler): correct slot usage prediction and add device type fil… by @maishivamhoo123 in #1700
- fix vastai fit loop iteration direction by @DSFans2014 in #1715
- fix: retain terminating pod in cache to prevent premature eviction by @maishivamhoo123 in #1719
- fix(chart): derive ld.so.preload from devicePlugin.libPath to fix non-default path deployments by @ilia-medvedev in #1714
- fix: support device allocation for multi-container with init containers by @haitwang-cloud in #1650
- fix: Reponse with correct content-type by @Shouren in #1604
- fix: suppress scheduler cleanup noise for unrelated vendors by @Yonsun-w in #1749
- fix(scheduler): add missing return in ondelpod by @CFH2436 in #1759
- fix: global image tag always covers per-component image tag by @FouoF in #1774
- fix(device-plugin): align kubelet allocation with scheduler annotations (#1741) by @xrwang8 in #1743
- fix(chart): sanitize managedNodeSelector keys for environment variables by @almazkhalikov in #1783
- fix: filter device does not work by @DSFans2014 in #1817
- Fix: Handle Kernel 6.17 handshake edge cases in NVIDIA health checks by @maishivamhoo123 in #1810
- fix: git errors when building with latest hami-core by @Shouren in #1782
- fix: mig does not work by @DSFans2014 in #1819
- fix(scheduler): guard against zero-value division in ComputeScore (#1… by @lin121291 in #1820
- fix(ascend): check unmarshal error before iterating node devices by @mesutoezdil in #1831
- fix(device): parse handshake annotation timestamp in local timezone by @mesutoezdil in #1816
- fix: recover scheduling on nodes with stale Deleted_ handshake by @saiyam1814 in #1843
- fix A100-80G template by @DSFans2014 in #1847
- fix: allocation failed when using mig in CDI mode by @DSFans2014 in #1826
📚 Documentation
- docs: remove stale tasklist.md, migrate content to separate issues by @ManishSharma1609 in #1801
- docs: add CNCF copyright disclaimer to README files by @mesutoezdil in #1830
- docs: complete tasklist.md migration - add hardware entries, resource… by @maishivamhoo123 in #1829
- docs: add v2.9.0 documentation audit report by @mesutoezdil in #1761
🔨 Other Changes
- add device type label in metrics by @xiyichan in #1612
- security: add io.LimitReader to scheduler routes to prevent DoS #554 by @maishivamhoo123 in #1620
- add fouof to approvers by @FouoF in #1642
- Update Maintainers information by @archlitchi in #1646
- [Snyk] Security upgrade tensorflow/tensorflow from 2.20.0rc0-gpu to 2.21.0rc0-gpu by @wawa0210 in #1652
- chore: remove deprecated scheduler policy configmap by @haitwang-cloud in #1651
- [Snyk] Security upgrade tensorflow/tensorflow from 2.21.0rc0-gpu to 2.21.0rc1-gpu by @wawa0210 in #1681
- vGPUmonitor: skip devices with invalid UTF-8 UUID during container init by @charford in #1703
- Helm - Render nvidia.overwriteEnv from values with a default of false. by @jcustenborder in #1706
- Update dashboard.md fix nvidia.com/gpucores comment by @Nov11 in #1712
- update nvidia_dp and nvidia_container_runtime module by @archlitchi in #1731
- chore: resolve staticcheck and modernize linter warnings by @maishivamhoo123 in #1728
- chore: using LeaderElectionConfiguration for kubernetes v1.23 and newer version by @Shouren in #1737
- Add general Technical review to documents by @archlitchi in #1752
- test: improve coverage for pkg/version/version.go by @Yonsun-w in #1748
- Configure ASCEND_VISIBLE_DEVICES env for container and RuntimeClassName for pods by @peachest in #1738
- test: improve nvidia device.go UT coverage to 98.5% by @kenwoodjw in #1757
- security: upgrade golang for security issue by @Shouren in #1772
- Upgrade Go to 1.26.2 by @luohua13 in #1791
- update benchmarks by @maverick123123 in #1790
- add DSFans2014 to approvers by @DSFans2014 in #1805
- Disable host network for device plugin by @luohua13 in #1789
- bump HAMi-DRA version to v0.2.0 by @FouoF in #1845
- Preparations to release v2.9.0 by @archlitchi in #1850
New Contributors
- @maishivamhoo123 made their first contribution in #1620
- @hoteye made their first contribution in #1627
- @jsl9208 made their first contribution in #1637
- @ashergaga made their first contribution in #1610
- @Atroxgod made their first contribution in #1690
- @MyoungHaSong made their first contribution in #1644
- @charford made their first contribution in #1703
- @jcustenborder made their first contribution in #1706
- @Nov11 made their first contribution in #1712
- @ilia-medvedev made their first contribution in #1714
- @Yonsun-w made their first contribution in #1749
- @CFH2436 made their first contribution in #1759
- @kenwoodjw made their first contribution in #1757
- @anandj91 made their first contribution in #1760
- @ManishSharma1609 made their first contribution in #1801
- @maverick123123 made their first contribution in #1790
- @almazkhalikov made their first contribution in #1783
- @lin121291 made their first contribution in #1820
- @mesutoezdil made their first contribution in #1831
Full Changelog: v2.8.0...v2.9.0