Key feature:
- Optimize scheduler log
- Support enflame gcu-share
- Support metax GPU and metax sGPU
- Helm chart add checksum annotation for restarting hami component after ConfigMap modification
- Support for using RuntimeClass with nvidia devices
- Add support for profiling via net/http/pprof package
- Add nvidia gpu topoloy score registry to node
- Feat: vGPUmonitor support MigInfo metrics
Bug fix:
- Fix stuck in driver 570+
- Fix device memory not counted properly in comfyUI task
- Fix cambricon devices not allocated properly
- Fix wrong log and container request device count error
- Fix vgpu-devices-allocated annotations are inconsistent
- Fix removing node devices from node manager
- Fix: Dynamic GPU partitioning lacks single-GPU-level granularity
- Fix device memory count error on cuMallocAsync
- Fix scheduler crash if a 'mig' task running accidentally on a 'hami-core' GPU
- Fix multi-process device memory count
What's Changed
⬆️ Dependencies
- Bump aquasecurity/trivy-action from 0.28.0 to 0.29.0 by @dependabot in #631
- Bump nvidia/cuda from 12.4.1-base-ubuntu22.04 to 12.6.3-base-ubuntu22.04 in /docker by @dependabot in #676
- Bump actions/upload-artifact from 4.4.3 to 4.5.0 by @dependabot in #717
- Bump docker/build-push-action from 6.9.0 to 6.10.0 by @dependabot in #644
- Bump docker/build-push-action from 6.10.0 to 6.11.0 by @dependabot in #792
- Bump golang.org/x/net from 0.26.0 to 0.33.0 by @dependabot in #839
- Bump docker/build-push-action from 6.11.0 to 6.13.0 by @dependabot in #837
- Bump golang.org/x/net from 0.26.0 to 0.35.0 by @dependabot in #859
- Bump aquasecurity/trivy-action from 0.29.0 to 0.30.0 by @dependabot in #941
- Bump docker/login-action from 3.3.0 to 3.4.0 by @dependabot in #942
- Bump docker/build-push-action from 6.13.0 to 6.15.0 by @dependabot in #899
- build(deps): bump docker/build-push-action from 6.15.0 to 6.16.0 by @dependabot in #1024
- build(deps): bump docker/build-push-action from 6.16.0 to 6.17.0 by @dependabot in #1052
- build(deps): bump docker/build-push-action from 6.17.0 to 6.18.0 by @dependabot in #1091
🔨 Other Changes
- Fix Kubernetes version string handling by stripping metadata by @Nimbus318 in #623
- Update vGPUmonitor to add dynamic adjustment on core and memory limit by @archlitchi in #624
- feat: support device plugin daemonset update strategy by @devenami in #628
- add ut about schedule policy by @yt-huang in #638
- Fix: Refactor the license based on the approaches used in OpenSearch and ElasticSearch. by @haitwang-cloud in #626
- add ut for the scheduler by @shijinye in #645
- docs(issue-tmpl): add FAQ link to issue templates by @Nimbus318 in #647
- fix: filter device registry to node by @lengrongfu in #639
- Add self-hosted runner by @archlitchi in #659
- fix-example-yaml by @WQL782795 in #667
- update docs by @yangshiqi in #668
- add ut for ascend by @shijinye in #664
- optimization map init in test by @lengrongfu in #678
- Optimize monitor by @for800000 in #683
- fix code lint faild by @lengrongfu in #685
- fix(helm): Add NODE_NAME env var to the vgpu-monitor container from spec.nodeName by @Nimbus318 in #687
- fix vGPUmonitor deviceidx is always 0 by @lengrongfu in #684
- add ut for pkg/scheduler/event.go by @Penguin-zlh in #688
- add ut for nodes by @shijinye in #695
- add license for pkg/scheduler/event_test.go by @Penguin-zlh in #706
- fix: exception happen when creating multiple ascend-gpu pods concurrently by @lijm87 in #575
- add ut for device/nvidia by @shijinye in #657
- add ut for pkg/monitor/nvidia/v0/spec.go by @yt-huang in #670
- Enable Dynamic-mig feature for HAMi by @archlitchi in #708
- Fix chart can not be deployed properly by @archlitchi in #711
- Fix NodeLock issue by @archlitchi in #714
- fix example yaml by @lixd in #709
- add ut for device/cambricon by @shijinye in #712
- Update dynamic mig documents and examples by @archlitchi in #718
- random time may be zero by @shijinye in #697
- fix grafana dashboard and clarify dashboard usage more clearly. by @jiangsanyin in #543
- doc(README): add examples for GPU sharing and update-examples by @xiaoyao in #665
- add ut for github.com/Project-HAMi/HAMi/pkg/scheduler/pod.go by @yt-huang in #673
- Add design document to 'dynamic-mig' feature by @archlitchi in #725
- fix(doc): fix a typo and resolve markdown warnings in the tasklist by @elrondwong in #724
- add ut for pkg/util/nodelock/nodelock.go by @learner0810 in #719
- test: add ut for pkg/version/version.go by @Penguin-zlh in #677
- Update on mig mode by @archlitchi in #726
- Update documents for config & config_cn by @archlitchi in #729
- set PASS_DEVICE_SPECS ENV to device-plugin by @jingzhe6414 in #690
- fix device-plugin-version by @learner0810 in #743
- feat: Return the nodes that failed to be scheduled back to the scheduler by @chaunceyjiang in #746
- fix(log): fix missing log output in nvidiadeviceplugin server by @elrondwong in #735
- support configuration resources limits and requests by @flpanbin in #739
- feat(test): add TestMarshalNodeDevices scenarios by @elrondwong in #747
- print flags for device-plugin and scheduler by @flpanbin in #756
- Fix typos, add more contributors and maintainers. by @yangshiqi in #765
- Add a mind map(Chinese and English) to help understand this project by @oceanweave in #764
- [Docs] update config pages by @windsonsea in #760
- add ut for device-map by @KubeKyrie in #762
- refactor(ci): use go.mod file for Go version in workflows by @yxxhero in #766
- support set log level for device plugin by @flpanbin in #771
- feat: Restart/Upgrade device-plugin will not affect services. by @chaunceyjiang in #767
- add ut nvml devices by @KubeKyrie in #773
- add ut for device-map by @KubeKyrie in #772
- Optimize the time format layout by @learner0810 in #741
- fix: nvidia-device-plugin no version info by @chaunceyjiang in #779
- HAMi supports e2e by @Rei1010 in #775
- Proposal: enable E2E test by @Rei1010 in #633
- add ut for device/iluvatar by @shijinye in #795
- add ut for device/hygon by @shijinye in #787
- add ut for pkg/monitor/nvidia/v1 by @shijinye in #780
- refactor(logging): enhance log messages for device resource counting by @haitwang-cloud in #778
- Enrich pod health check by @Rei1010 in #801
- docs: fix broken link by @lixd in #802
- Optimize the E2E execution logic by @Rei1010 in #803
- optimize MetricsBindAddress to MetricsBindPort by @phoenixwu0229 in #796
- fix: handle the node nil issue & E2E test failure by @haitwang-cloud in #804
- add ut for device/mthreads by @shijinye in #808
- fix: Resolve formatting issue in ConfigMap causing display anomalies by @lixd in #814
- [docs] Update ascend910b-support.md by @windsonsea in #816
- Refine metrics logs by @haitwang-cloud in #817
- Update mig-related logics and refine logs by @archlitchi in #833
- Add 910B4 config to device-configmap for ascend by @lijm87 in #828
- [docs] fix: glibc version requirement in README by @chinaran in #826
- Update HAMi-core for v2.5.0 by @archlitchi in #834
- FIx multi-process device memory count issue by @archlitchi in #835
- bump version to v2.5.0 by @wawa0210 in #836
- Fix CI by @archlitchi in #838
- Fix CI release by @archlitchi in #840
- Fix release ci by @archlitchi in #841
- Fix Dockerfile to make CI pass by @archlitchi in #846
- Fix E2E failure with pod status check by @Rei1010 in #847
- Fix scheduler crash if a 'mig' task running accidentally on a 'hami-core' GPU by @archlitchi in #848
- fix: Enhance GPU metrics collection and error handling in vGPU monitor by @haitwang-cloud in #827
- refactor: update service configurations for device plugin and scheduler by @haitwang-cloud in #799
- add ut for scheduler/score by @shijinye in #853
- add ut for device/metax by @shijinye in #850
- Remove duplicate log fields by @learner0810 in #860
- [docs] Fix default nvidia.resourceCoreName value in config.md by @chinaran in #842
- Fix: Update handling of version strings in Helm template and helpers.tpl by @HJJ256 in #845
- Update libvgpu.so by @archlitchi in #876
- update example.png by @rockpanda in #874
- fix: Set passDeviceSpecsEnabled to false by default in device plugin by @Nimbus318 in #872
- support ascend 910B2 by @ouyangluwei163 in #885
- fix docs typos by @JinVei in #869
- Accelerate node score calculations using multiple goroutines by @learner0810 in #824
- fix: scheduler ignore KUBECONFIG env even if this environment variable is set by @Shouren in #681
- fix: correct device filter initialization order by @Nimbus318 in #857
- fix parseNvidiaNumaInfo index out of range by @flpanbin in #889
- Support Metax SGPU to sharing GPU by @Kyrie336 in #895
- docs: fix broken commmunity links by @agilgur5 in #907
- add config gpu core isolation policy for webhook by @lengrongfu in #901
- feat: support scheduler replicas > 1 by @Azusa-Yuan in #898
- docs: add syntax highlighting to various code blocks by @agilgur5 in #906
- Fix UT not be properly executed during CI phase by @archlitchi in #911
- typo: fix typos in log and comment by @popsiclexu in #917
- feat: Add kube-qps and kube-burst parameters. by @chaunceyjiang in #769
- docs: Update MAINTAINERS file with current contributor information by @Nimbus318 in #918
- Nominate chaunceyjiang to reviewer by @chaunceyjiang in #926
- build: update dependencies and remove unused cdiapi by @yxxhero in #903
- add lengrongfu to reviewers by @lengrongfu in #937
- Fix cambricon pods not been recognized by HAMi scheduler by @archlitchi in #947
- chore: add namespace override for multi-namespace deployments by @chinaran in #924
- fix: hygon dcu concurrent creation conflict by @joy717 in #921
- Fix the wrong describe of device registry in protocol.md by @hurricane1988 in #910
- fix ubuntu base image in Dockerfile.withlib by @flpanbin in #944
- chore: helm chart support scheduler webhook cert-manager by @chinaran in #951
- refactor(scheduler): replace init methods with constructor functions by @yxxhero in #905
- add Dependencies policy and Security policy by @yangshiqi in #934
- fix: Add error handling for nvml.Init in NvidiaDevicePlugin by @yxxhero in #982
- scheduler: fix blocked the nodeNotify channel when node changes by @Iceber in #964
- docs: Update Ascend910 support documentation by @zhaikangqi331 in #988
- update iluvatar's docs by @yangshiqi in #995
- refactor: replace interface{} with any in various files by @yxxhero in #1000
- scheduler: fix duplicate handling of the node label selector by @Iceber in #965
- refactor(.github/workflows/ci.yaml): Update golangci-lint to v2.0 and modify .golangci.yaml by @yxxhero in #1002
- update hami arch by @wawa0210 in #1007
- Update README.md by @yowenter in #1005
- refactor: simplify code by using modern constructs by @Shouren in #978
- scheduler: fix removing node devices from node manager by @Iceber in #966
- feat: Add support for profiling via net/http/pprof package by @Shouren in #963
- Support Enflame gcushare for enflame devices by @archlitchi in #1013
- docs: Remove ACTIVE_OOM_KILLER environment variable description by @chinaran in #1015
- refactor(vGPUmonitor): change Run to RunE and return errors by @yxxhero in #999
- refactored the filter logs and event messages to enhance their clarity, by @Wangmin362 in #1023
- feat: Support for using RuntimeClass with nvidia devices by @chinaran in #1021
- fix wrong log and container request device count error by @Wangmin362 in #1020
- feat: helm chart add checksum annotation for restarting hami component after ConfigMap modification by @chinaran in #1022
- fix vgpu-devices-allocated annotations are inconsistent #991 by @ouyangluwei163 in #1012
- add Enflame GCU S60 into roadmap. by @winston-zhang-orz in #1030
- Fix device memory count error on cuMallocAsync by @archlitchi in #1029
- Fix 31993 not launched properly when using 'master' branch of HAMi by @archlitchi in #1031
- add nvidia-smi command show cuda version info by @lengrongfu in #953
- Separate options from client to make the responsibility more clear. by @yangshiqi in #938
- Add nvidia gpu topoloy score registry to node by @lengrongfu in #1018
- fix(cicd): update ci.yaml to upload coverage to Codecov by @Shouren in #1056
- feat(Actions): Add an action to label pr automatically by @Shouren in #1053
- fix: Improve Metax GPU usability and fix related issues by @Kyrie336 in #1063
- fix(chart): support GKE pre-release versions via kubeVersion '-0' by @Nimbus318 in #1072
- fix: Dynamic GPU partitioning lacks single-GPU-level granularity. (#1… by @Goend in #1061
- update maintainer information by @wawa0210 in #1079
- add LIBCUDA_LOG_LEVEL env to device-plugin by @lengrongfu in #1087
- fix: missing apiVersion in serviceMonitor dashboard docs by @ntheanh201 in #1077
- test(pkg/util): Add some unit tests for pkg/util by @Shouren in #1067
- feat: vGPUmonitor support MigInfo metrics by @ouyangluwei163 in #1048
- update hami-core version by @lengrongfu in #1082
New Contributors
- @yt-huang made their first contribution in #638
- @shijinye made their first contribution in #645
- @WQL782795 made their first contribution in #667
- @yangshiqi made their first contribution in #668
- @for800000 made their first contribution in #683
- @Penguin-zlh made their first contribution in #688
- @lixd made their first contribution in #709
- @jiangsanyin made their first contribution in #543
- @xiaoyao made their first contribution in #665
- @elrondwong made their first contribution in #724
- @learner0810 made their first contribution in #719
- @jingzhe6414 made their first contribution in #690
- @flpanbin made their first contribution in #739
- @oceanweave made their first contribution in #764
- @windsonsea made their first contribution in #760
- @KubeKyrie made their first contribution in #762
- @yxxhero made their first contribution in #766
- @Rei1010 made their first contribution in #775
- @phoenixwu0229 made their first contribution in #796
- @chinaran made their first contribution in #826
- @HJJ256 made their first contribution in #845
- @rockpanda made their first contribution in #874
- @ouyangluwei163 made their first contribution in #885
- @JinVei made their first contribution in #869
- @Shouren made their first contribution in #681
- @Kyrie336 made their first contribution in #895
- @agilgur5 made their first contribution in #907
- @Azusa-Yuan made their first contribution in #898
- @popsiclexu made their first contribution in #917
- @hurricane1988 made their first contribution in #910
- @Iceber made their first contribution in #964
- @zhaikangqi331 made their first contribution in #988
- @yowenter made their first contribution in #1005
- @Wangmin362 made their first contribution in #1023
- @winston-zhang-orz made their first contribution in #1030
- @Goend made their first contribution in #1061
- @ntheanh201 made their first contribution in #1077
Full Changelog: v2.4.1...v2.6.0