Features
- add cloud provider disruption reasons (#1574) #1574 (Nick Tran)
- add cloudprovider specific Eventual disruption methods (#1588) #1588 (Amanuel Engeda)
- implement graceful error handling in operator package (#1683) #1683 (Daniel Wang)
- add log for scheduling progress for long running scheduling sim… (#1788) #1788 (Nick Tran)
- add instance type drift for instance types that are no longer discoverable (#1787) #1787 (Nick Tran)
- Add status conition controller for node objects (#1808) #1808 (Amanuel Engeda)
- add pod acknowledged time metric (#1803) #1803 (Nick Tran)
- Node Repair implementation (#1793) #1793 (Amanuel Engeda)
- only operate on cloudprovider managed resources (#1818) #1818 (Jason Deal)
Bug Fixes
- Wait for pods to be fully terminated in groups (#1478) #1478 (Jonathan Innis)
- Fix unintentionally importing envtest when running NewTestingQueue (#1660) #1660 (Jonathan Innis)
- add default disruption stanza (#1662) #1662 (Nick Tran)
- Ensure all patch calls can conflict when resource version doesn't match (#1658) #1658 (Jonathan Innis)
- Fix race condition with using newer NodePool for hash annotation (#1666) #1666 (Jonathan Innis)
- handle NodeClaim termination in the lifecycle controller (#1721) #1721 (Jason Deal)
- correct error type for NodeClaim helpers (#1739) #1739 (Jason Deal)
- clarify state node logging (#1766) #1766 (Reed Schalo)
- check for nil pointer dereference (#1763) #1763 (Reed Schalo)
- dedupe expiration reconciliations (#1794) #1794 (Jason Deal)
- ensure node leases aren't leaked (#1807) #1807 (Jason Deal)
- spurious disruption budget eventing (#1854) #1854 (Jason Deal)
- hydration race for terminating nodes (#1853) #1853 (Jason Deal)
Documentation
- RFC for disruption.terminationGracePeriod feature (#834) #834 (wmgroot)
- update kowk nodepool example (#1587) #1587 (helen)
- add Alibaba Cloud repo link (#1707) #1707 (jwcesign)
Performance Improvements
- don't include terminated nodes in budget (#1735) #1735 (Nick Tran)
- Unregister the topology domain when failing NodeClaim creation (#1819) #1819 (Jonathan Innis)
- Drop unneeded recalculation for resources and available (#1823) #1823 (Jonathan Innis)
- Pre-filter instance types on nodepool requirements (#1824) #1824 (Jonathan Innis)
- Don't reparse selector on each toplogy select check (#1822) #1822 (Jonathan Innis)
- Improve TopologyGroup node domain iteration (#1820) #1820 (Jonathan Innis)
- Cache the pod request calculation in memory (#1825) #1825 (Jonathan Innis)
- Cache taints for existing nodes (#1827) #1827 (Jonathan Innis)
Tests
- ensure NodeClaim root condition reamains known during termination (#1845) #1845 (Jason Deal)
Continuous Integration
- Enable
copyloopvar
in linter (#1589) #1589 (Jonathan Innis) - Drop running kind jobs on main until they are green (#1834) #1834 (Jonathan Innis)
Chores
- deps: bump sigs.k8s.io/controller-runtime from 0.18.4 to 0.18.5 in the k8s-go-deps group (#1567) #1567 (dependabot[bot])
- deps: bump the go-deps group with 2 updates (#1568) #1568 (dependabot[bot])
- deps: bump the go-deps group with 3 updates (#1583) #1583 (dependabot[bot])
- non-constant format string in call to fmt.Errorf (#1579) #1579 (helen)
- Bump go to use 1.23 (#1584) #1584 (Jonathan Innis)
- Drop node lease garbage collection controller (#1586) #1586 (Jigisha Patil)
- deps: bump the go-deps group with 2 updates (#1600) #1600 (dependabot[bot])
- Drop conversion webhooks (#1598) #1598 (Jigisha Patil)
- Add readiness check on cache sync (#1593) #1593 (Jigisha Patil)
- Remove v1beta1 references (#1602) #1602 (Jigisha Patil)
- Drop knative imports (#1606) #1606 (Jigisha Patil)
- Drop custom workqueue metrics provider (#1607) #1607 (Jigisha Patil)
- added cluster utilization metric and a couple of node lifetime metrics (#1603) #1603 (Youssef Beltagy)
- deps: bump the go-deps group with 3 updates (#1626) #1626 (dependabot[bot])
- deps: bump actions/setup-python from 5.1.1 to 5.2.0 in the actions-deps group (#1627) #1627 (dependabot[bot])
- deps: bump the k8s-go-deps group with 9 updates (#1582) #1582 (dependabot[bot])
- don't emit disruption events for non-karpenter nodes (#1644) #1644 (Nick Tran)
- Bump github.com/awslabs/operatorpkg to latest (#1656) #1656 (Jonathan Innis)
- Bump github.com/awslabs/operatorpkg to latest (#1661) #1661 (Jonathan Innis)
- deps: bump the k8s-go-deps group with 7 updates (#1670) #1670 (dependabot[bot])
- deps: bump the go-deps group with 2 updates (#1671) #1671 (dependabot[bot])
- Fix possible Registration TTL race condition (#1665) #1665 (Jonathan Innis)
- Allow setting leader election namespace when running out of cluster (#1704) #1704 (Jonathan Innis)
- deps: bump the go-deps group with 2 updates (#1706) #1706 (dependabot[bot])
- Additional upstream metrics (#1672) #1672 (Jigisha Patil)
- Bump operatorpkg (#1701) #1701 (Jigisha Patil)
- upgrade CI to use 1.31 (#1722) #1722 (Nick Tran)
- deps: bump actions/checkout from 4.1.7 to 4.2.0 in /.github/actions/install-prometheus in the action-deps group (#1724) #1724 (dependabot[bot])
- deps: bump actions/checkout from 4.1.7 to 4.2.0 in /.github/actions/install-pyroscope in the action-deps group (#1725) #1725 (dependabot[bot])
- deps: bump actions/checkout from 4.1.7 to 4.2.0 in the actions-deps group (#1726) #1726 (dependabot[bot])
- deps: bump actions/cache from 4.0.2 to 4.1.0 in /.github/actions/install-deps in the action-deps group (#1744) #1744 (dependabot[bot])
- deps: bump actions/checkout from 4.2.0 to 4.2.1 in the actions-deps group (#1741) #1741 (dependabot[bot])
- deps: bump actions/checkout from 4.2.0 to 4.2.1 in /.github/actions/install-pyroscope in the action-deps group (#1742) #1742 (dependabot[bot])
- deps: bump actions/checkout from 4.2.0 to 4.2.1 in /.github/actions/install-prometheus in the action-deps group (#1745) #1745 (dependabot[bot])
- Add pod metrics (#1738) #1738 (Jigisha Patil)
- deps: bump the go-deps group with 2 updates (#1743) #1743 (dependabot[bot])
- remove dead disruptable code (#1751) #1751 (Reed Schalo)
- deps: bump actions/cache from 4.1.0 to 4.1.1 in /.github/actions/install-deps in the action-deps group (#1752) #1752 (dependabot[bot])
- Scope role down to needed permissions (#1758) #1758 (Jonathan Innis)
- deps: bump github.com/prometheus/client_golang from 1.20.4 to 1.20.5 in the go-deps group (#1764) #1764 (dependabot[bot])
- remove unnecessary for delete errors (#1770) #1770 (helen)
- deps: bump kubernetes-sigs/release-actions from 0.2.0 to 0.3.0 in the actions-deps group (#1765) #1765 (dependabot[bot])
- deps: bump actions/checkout from 4.2.1 to 4.2.2 in /.github/actions/install-prometheus in the action-deps group (#1783) #1783 (dependabot[bot])
- deps: bump the k8s-go-deps group with 8 updates (#1781) #1781 (dependabot[bot])
- deps: bump the actions-deps group with 2 updates (#1782) #1782 (dependabot[bot])
- deps: bump the action-deps group in /.github/actions/install-deps with 2 updates (#1779) #1779 (dependabot[bot])
- deps: bump actions/checkout from 4.2.1 to 4.2.2 in /.github/actions/install-pyroscope in the action-deps group (#1780) #1780 (dependabot[bot])
- upgrade to go 1.23.2 (#1784) #1784 (Andrew J. Brown)
- deps: bump the go-deps group with 2 updates (#1795) #1795 (dependabot[bot])
- make leader election ID configurable (#1797) #1797 (Saurav Agarwalla)
- Update max delay for NodeClaim lifecycle controller (#1773) #1773 (edibble21)
- deps: bump the go-deps group with 2 updates (#1806) #1806 (dependabot[bot])
- Add unfinished_work_seconds metric (#1809) #1809 (Jigisha Patil)
- create new error type to wrap message and error (#1811) #1811 (Jigisha Patil)
- Convert
karpenter_scheduler_unfinished_work_seconds
to seconds (#1812) #1812 (Jonathan Innis) - Bump operatorpkg to include object based metrics (#1814) #1814 (Amanuel Engeda)
- fix unfinishedWorkSeconds metric (#1816) #1816 (Jigisha Patil)
- increase bucket sizes for pod scheduling metrics (#1817) #1817 (Nick Tran)
- Update kwok chart to latest (#1821) #1821 (Jonathan Innis)
- Remove a controller that was used for testing (#1830) #1830 (Amanuel Engeda)
- Fix undecided time naming (#1833) #1833 (Jonathan Innis)
- Remove Eventual Disruption cloud provider interface (#1832) #1832 (Amanuel Engeda)
- fix nodeclaim going unknown during instance termination (#1835) #1835 (Reed Schalo)
- Add pod scheduled prometheus label (#1836) #1836 (Jonathan Innis)
- deps: bump the go-deps group with 2 updates (#1841) #1841 (dependabot[bot])
- deps: bump the k8s-go-deps group with 8 updates (#1840) #1840 (dependabot[bot])
- Limit Node Repair based by Nodepool (#1831) #1831 (Amanuel Engeda)
Commits
- make the import order follow the best practice (#1520) #1520 (jwcesign)
- make kwok docs better for local development (#1609) #1609 (jwcesign)
- Capacity Reservations (#1760) #1760 (Jonathan Innis)
- f7abd62: Implemented UnschedulablePodsCount metric (#1698) (edibble21) #1698
- 48c14fa: Fix pod metrics that don't get deleted when the pod is deleted (#1796) (Jigisha Patil) #1796
- update alibabacloud repo link (#1798) #1798 (Wei)
- Node Auto Repair (#1768) #1768 (Amanuel Engeda)
- 2f80354: Abstract prometheus metrics into interfaces (#1801) (Jonathan Innis) #1801
- 965ff61: add cluster-api provider to readme (#1815) (Michael McCune) #1815
- bump operatorpkg (#1843) #1843 (Jason Deal)