Project-HAMi/HAMi v2.7.0 on GitHub

What's Changed

✨ Key Features

Metax sGPU topology aware by @Kyrie336 in #1193
NVIDIA Resourcequota by @FouoF in #1359
Kunlunxin topology-aware scheduling by @FouoF in #1141
Kunlunxin vxpu sopport #1016 by @ouyangluwei163 in #1337
Enflame GCU topology-awareness (#1040) by @zhaikangqi331 in #1334
AWS-neuron device and device-core allocation by @archlitchi in #1238
Aggregated Scheduling Failure Events by @Wangmin362 in #1333

✨ Other Features

Optimize Fit-in-device logic to make it device-specific by @archlitchi in #1097
feat(scheduler): make node lock timeout configurable by @Kevinz857 in #1117
featue: mig mode-change #1116 by @ouyangluwei163 in #1124
feat: Add new labels in .github/release.yml by @Shouren in #1066
feat(scheduler-role): use a scoped-down role for scheduler by @Antvirf in #1152
feat(helm): optionally disable admission webhook by @Antvirf in #1145
remove redundant metrics for vgpu allocation by @FouoF in #1169
refactor: clean up code and improve maintainability by @Wangmin362 in #1195
refactor: Ranging over SplitSeq is more efficient by @Shouren in #1239
feat:NodeLockTimeout set from env by @miaobyte in #1244
refactor: move watchAndFeedback function to feedback.go by @miaobyte in #1248
feat: add informer-based pod cache to reduce API server load by @miaobyte in #1250
feat: Add option to disable device plugin at values.yaml. by @FouoF in #1274
refactor(util/nodelock): replace manual polling with k8s.io/client-go/util/retry by @mayooot in #1252
refactor: Remove annotation in Devices interfaces by @Shouren in #1343
feat: update the Ascend910 scheduling policy by @DSFans2014 in #1344
feat(nvidia): default gpucores=100 when memory is exclusive and cores… by @xrwang8 in #1354

🐛 Bug Fixes

fix: Before executing MIG partitioning, suppress NVML usage in o… by @Goend in #1095
Fix golint-CI by @archlitchi in #1127
fix: override node socre failure for kunlun #1137 by @ouyangluwei163 in #1138
fix: Multi-node scoring nodes are inaccurate by @ouyangluwei163 in #1147
fix: An error occurred while create Iluvatar pod by @ouyangluwei163 in #1149
Fix e2e CI by @archlitchi in #1165
fix: Add option for overwrite schedulerName by @Shouren in #1163
fix: using go-safecast to fix incorrect conversion of numbers by @Shouren in #1183
fix: deal with security issues reported by Trivy in image by @Shouren in #1189
fix: wrong Pod's UID and emtpy Pod's name in log of webhook.go by @Shouren in #1092
fix: concurrent map writes error in scheduler.calcScore #1269 by @Shouren in #1270
fix: release dangling node lock by @peachest in #1271
fix: fix err which retrieved incorrect NUMA node information issue #1275 by @abstractmj in #1276
fix(security): resolve issues reported by Code scanning in Security by @Shouren in #1280
fix: fix golangci-lint error by @DSFans2014 in #1319
Fix: device allocation missing containers with no device request by @FouoF in #1299
fix: update int8Slice to uint8Slice for better type clarity and consistency by @yxxhero in #1357

📚 Documentation

documentation: add Known Issues for dynamic mig support by @Goend in #1122
docs: fix broken link by @lixd in #1125
clearly list supported devices doc references at README by @FouoF in #1155
docs: update ascend910b-support docs by @DSFans2014 in #1321

🔨 Other Changes

Prerelease-v2.6 by @archlitchi in #1108
add new reviewers Shouren and ouyangluwei163 by @wawa0210 in #1131
Support topology-awareness for Kunlunxin device by @archlitchi in #1121
Support Metax sGPU Qos Policy by @Kyrie336 in #1123
add global image for chart by @calvin0327 in #1133
fix: Skip admission webhook when Pod's scheduler is already assigned. by @ghostloda in #1041
Add node configs to docs by @wylswz in #1159
build(deps): upgrade golang to 1.24.4 by @Shouren in #1172
build(deps): Upgrade golang image in ci to 1.24.4 by @Shouren in #1176
build(deps): Upgrade controller-runtime to 0.21.0 by @Shouren in #1171
build(deps): Dump github.com/NVIDIA/nvidia-container-toolkit by @Shouren in #1170
Add unit tests for Fit Function for enflame,hygon, metax, mthreads, nvidia by @Wangmin362 in #1199
[Misc] update hami-core version by @chaunceyjiang in #1201
Improve the impl of DevicePluginConfigs.Nodeconfig overwriting NvidiaConfig by @FouoF in #1158
Add unit tests for cambricon's Fit Function by @Wangmin362 in #1198
Add unit tests for Ascend's Fit Function by @Wangmin362 in #1197
修复生成 pod 请求资源时不必要的重复计算 by @litaixun in #1215
修复更新节点注解时的日志提示词 by @litaixun in #1214
If the mem applied for the Mig device is the same as the template value,>will result in CardNotFoundCustom Filter Rule. by @zgqqiang in #1179
updated dri section to combine text for better readability by @mpetason in #1216
feat: Add nvidia gpu topoloy scheduler by @fyp711 in #1028
add issue translate robot by @wawa0210 in #1232
add issue translate robot by @wawa0210 in #1234
perf(util/nodelock): Use clientset Patch instead of Update. by @mayooot in #1192
Update hami-core and fix readme documents by @archlitchi in #1240
Update hami-core version to fix by @archlitchi in #1256
[Snyk] Security upgrade tensorflow/tensorflow from latest-gpu to 2.20.0rc0-gpu by @wawa0210 in #1243
feat: Add an action of 'Close stale issue and PRs' in github worklfow by @Shouren in #1083
Welcome fyp711 to become a HAMi member by @wawa0210 in #1288
Add values readme by @clcc2019 in #1267
Support Metax sGPU device health check by @Kyrie336 in #1295
Optimize pkg/util.go and distribute logics to corresponding logics by @archlitchi in #1296
cleanup: Clear and correct ascend device name by @FouoF in #1315
bugfix: Nvidia card abnormal pod will still continue to schedule by @zgqqiang in #1336
FIx CI, add 910B4-1 template and fix vGPUmonitor metrics error by @archlitchi in #1345
add httpTargetPort to values.yaml by @flpanbin in #1356
Update kunlunxin documents by @archlitchi in #1366
update chart version and hami-core by @archlitchi in #1369

New Contributors

@Kevinz857 made their first contribution in #1117
@FouoF made their first contribution in #1141
@Antvirf made their first contribution in #1152
@wylswz made their first contribution in #1159
@litaixun made their first contribution in #1215
@zgqqiang made their first contribution in #1179
@mpetason made their first contribution in #1216
@fyp711 made their first contribution in #1028
@mayooot made their first contribution in #1192
@miaobyte made their first contribution in #1244
@peachest made their first contribution in #1271
@abstractmj made their first contribution in #1276
@clcc2019 made their first contribution in #1267
@DSFans2014 made their first contribution in #1319
@xrwang8 made their first contribution in #1354

Full Changelog: v2.6.1...v2.7.0