What's Changed
✨ Key Features
- Metax sGPU topology aware by @Kyrie336 in #1193
- NVIDIA Resourcequota by @FouoF in #1359
- Kunlunxin topology-aware scheduling by @FouoF in #1141
- Kunlunxin vxpu sopport #1016 by @ouyangluwei163 in #1337
- Enflame GCU topology-awareness (#1040) by @zhaikangqi331 in #1334
- AWS-neuron device and device-core allocation by @archlitchi in #1238
- Aggregated Scheduling Failure Events by @Wangmin362 in #1333
✨ Other Features
- Optimize Fit-in-device logic to make it device-specific by @archlitchi in #1097
- feat(scheduler): make node lock timeout configurable by @Kevinz857 in #1117
- featue: mig mode-change #1116 by @ouyangluwei163 in #1124
- feat: Add new labels in .github/release.yml by @Shouren in #1066
- feat(scheduler-role): use a scoped-down role for scheduler by @Antvirf in #1152
- feat(helm): optionally disable admission webhook by @Antvirf in #1145
- remove redundant metrics for vgpu allocation by @FouoF in #1169
- refactor: clean up code and improve maintainability by @Wangmin362 in #1195
- refactor: Ranging over SplitSeq is more efficient by @Shouren in #1239
- feat:NodeLockTimeout set from env by @miaobyte in #1244
- refactor: move watchAndFeedback function to feedback.go by @miaobyte in #1248
- feat: add informer-based pod cache to reduce API server load by @miaobyte in #1250
- feat: Add option to disable device plugin at values.yaml. by @FouoF in #1274
- refactor(util/nodelock): replace manual polling with k8s.io/client-go/util/retry by @mayooot in #1252
- refactor: Remove annotation in Devices interfaces by @Shouren in #1343
- feat: update the
Ascend910
scheduling policy by @DSFans2014 in #1344 - feat(nvidia): default gpucores=100 when memory is exclusive and cores… by @xrwang8 in #1354
🐛 Bug Fixes
- fix: Before executing MIG partitioning, suppress NVML usage in o… by @Goend in #1095
- Fix golint-CI by @archlitchi in #1127
- fix: override node socre failure for kunlun #1137 by @ouyangluwei163 in #1138
- fix: Multi-node scoring nodes are inaccurate by @ouyangluwei163 in #1147
- fix: An error occurred while create Iluvatar pod by @ouyangluwei163 in #1149
- Fix e2e CI by @archlitchi in #1165
- fix: Add option for overwrite schedulerName by @Shouren in #1163
- fix: using go-safecast to fix incorrect conversion of numbers by @Shouren in #1183
- fix: deal with security issues reported by Trivy in image by @Shouren in #1189
- fix: wrong Pod's UID and emtpy Pod's name in log of webhook.go by @Shouren in #1092
- fix: concurrent map writes error in scheduler.calcScore #1269 by @Shouren in #1270
- fix: release dangling node lock by @peachest in #1271
- fix: fix err which retrieved incorrect NUMA node information issue #1275 by @abstractmj in #1276
- fix(security): resolve issues reported by Code scanning in Security by @Shouren in #1280
- fix: fix golangci-lint error by @DSFans2014 in #1319
- Fix: device allocation missing containers with no device request by @FouoF in #1299
- fix: update int8Slice to uint8Slice for better type clarity and consistency by @yxxhero in #1357
📚 Documentation
- documentation: add Known Issues for dynamic mig support by @Goend in #1122
- docs: fix broken link by @lixd in #1125
- clearly list supported devices doc references at README by @FouoF in #1155
- docs: update ascend910b-support docs by @DSFans2014 in #1321
🔨 Other Changes
- Prerelease-v2.6 by @archlitchi in #1108
- add new reviewers Shouren and ouyangluwei163 by @wawa0210 in #1131
- Support topology-awareness for Kunlunxin device by @archlitchi in #1121
- Support Metax sGPU Qos Policy by @Kyrie336 in #1123
- add global image for chart by @calvin0327 in #1133
- fix: Skip admission webhook when Pod's scheduler is already assigned. by @ghostloda in #1041
- Add node configs to docs by @wylswz in #1159
- build(deps): upgrade golang to 1.24.4 by @Shouren in #1172
- build(deps): Upgrade golang image in ci to 1.24.4 by @Shouren in #1176
- build(deps): Upgrade controller-runtime to 0.21.0 by @Shouren in #1171
- build(deps): Dump github.com/NVIDIA/nvidia-container-toolkit by @Shouren in #1170
- Add unit tests for Fit Function for enflame,hygon, metax, mthreads, nvidia by @Wangmin362 in #1199
- [Misc] update hami-core version by @chaunceyjiang in #1201
- Improve the impl of DevicePluginConfigs.Nodeconfig overwriting NvidiaConfig by @FouoF in #1158
- Add unit tests for cambricon's Fit Function by @Wangmin362 in #1198
- Add unit tests for Ascend's Fit Function by @Wangmin362 in #1197
- 修复生成 pod 请求资源时不必要的重复计算 by @litaixun in #1215
- 修复更新节点注解时的日志提示词 by @litaixun in #1214
- If the mem applied for the Mig device is the same as the template value,>will result in CardNotFoundCustom Filter Rule. by @zgqqiang in #1179
- updated dri section to combine text for better readability by @mpetason in #1216
- feat: Add nvidia gpu topoloy scheduler by @fyp711 in #1028
- add issue translate robot by @wawa0210 in #1232
- add issue translate robot by @wawa0210 in #1234
- perf(util/nodelock): Use clientset Patch instead of Update. by @mayooot in #1192
- Update hami-core and fix readme documents by @archlitchi in #1240
- Update hami-core version to fix by @archlitchi in #1256
- [Snyk] Security upgrade tensorflow/tensorflow from latest-gpu to 2.20.0rc0-gpu by @wawa0210 in #1243
- feat: Add an action of 'Close stale issue and PRs' in github worklfow by @Shouren in #1083
- Welcome fyp711 to become a HAMi member by @wawa0210 in #1288
- Add values readme by @clcc2019 in #1267
- Support Metax sGPU device health check by @Kyrie336 in #1295
- Optimize pkg/util.go and distribute logics to corresponding logics by @archlitchi in #1296
- cleanup: Clear and correct ascend device name by @FouoF in #1315
- bugfix: Nvidia card abnormal pod will still continue to schedule by @zgqqiang in #1336
- FIx CI, add 910B4-1 template and fix vGPUmonitor metrics error by @archlitchi in #1345
- add httpTargetPort to values.yaml by @flpanbin in #1356
- Update kunlunxin documents by @archlitchi in #1366
- update chart version and hami-core by @archlitchi in #1369
New Contributors
- @Kevinz857 made their first contribution in #1117
- @FouoF made their first contribution in #1141
- @Antvirf made their first contribution in #1152
- @wylswz made their first contribution in #1159
- @litaixun made their first contribution in #1215
- @zgqqiang made their first contribution in #1179
- @mpetason made their first contribution in #1216
- @fyp711 made their first contribution in #1028
- @mayooot made their first contribution in #1192
- @miaobyte made their first contribution in #1244
- @peachest made their first contribution in #1271
- @abstractmj made their first contribution in #1276
- @clcc2019 made their first contribution in #1267
- @DSFans2014 made their first contribution in #1319
- @xrwang8 made their first contribution in #1354
Full Changelog: v2.6.1...v2.7.0