This is the new major release packed with new features, improvements and fixes. Read the upgrade notes below carefully before upgrading. Thanks to the contributors!
New
- Added support for external datastores for HA clusters as alternative to embedded etcd; using a database like Postgres or MySQL makes for better scalability compared to embedded etcd
- Added support for Cilium as CNI, offering significantly improved performance and scalability compared to default Flannel. Please note that this has been tested with latest Ubuntu (24.04). There are some known issues when using Cilium with k3s on Ubuntu 22.04 and perhaps other Linux flavours/versions
- Added a configuration option to disable the private network, allowing for much larger clusters (private networks are limited to 100 servers per network)
- Updated all manifests (CCM, CSI, Autoscaler, System Upgrade Controller)
- Spegel is a new optional software component (enabled by default) that can be installed by hetzner-k3s in the cluster. This allows peer-to-peer distribution of container images between nodes, which helps work around a known issue with some Hetzner IPs being banned by some registries. With Spegel, if an image is already present on other nodes it will be fetched from those nodes instead of the registry, so it helps with the banned IPs issue while also speeding up image pulling
- Added support for Visual Studio Code dev container to make developing the tool easier
- Added support for the Hillsboro, Oregon region, which was not available in the previous version due to a conflict with network zones
- Enabled local path storage class, for workloads like databases that benefit from max IOPS (see this for more info)
- With HA cluster a load balancer for the API is no longer created. Instead, a multi context kubeconfig file is generated to be able to interact with the cluster with the selected master. This saves costs and is more secure since connections to the masters directly can be restricted to networks you specify in the config file.
Improvements
- Made creation of cloud resources more reliable with automatic recovery in some scenarios (thing will be retried multiple attempts when something fails e.g. due to high concurrency of temporary issue with the Hetzner API)
- Implemented automatic handling of Hetzner API rate limits. The tool will automatically wait when the rate limit has been hit and will resume automatically when possible. This makes it easier to create larger clusters that might require a lot of API calls
- Reduced the number of API calls required to handle existing instances when rerunning the tool. In order to save on API calls, then tool now checks if the node is already a member of the cluster by using kubectl; if it is and it can be reachable ad the external IP address reported by kubectl, then it doesn't need to make API calls to find information about the instance and can proceed with updating the instance directly; this makes it easier to add many nodes to an existing cluster since the number of total API calls required for a second run is lower than it would have been before
- Massively improved cluster creation speed: during tests with private network disabled (private networks support max 100 servers per network), I was able to create a 200 node cluster in less than 4 minutes
- Improved logging of all actions so it's easier to identify which item some log lines refer to
- More clever handling of placement groups: each Hetzner Cloud project allows max 50 placement groups, with max 10 servers per placement group. This means that a cluster using placement groups would normally allow max 500 servers (with private network disabled), but with this update hetzner-k3s will more cleverly make use of placement groups, allowing to create extra servers without one if the limit has been reached already
- Improved structure of the configuration file to group related settings together in a more coherent way
- The instance type is no longer included in the instance names as this was causing confusion when changing instance type directly from the Hetzner console.
- Raised the timeout for SSH connections to 5 seconds since sometimes a connection can take longer than 1 second (which was the previous limits) causing the verification of whether a server is ready to hang
- Added information on contributing with the VSCode dev container (by @jpetazzo)
- Use the IP of the load balancer in the kubeconfig instead of a hostname, since the IP of the LB cannot be known in advance potentially causing problems with the first interactions with the API server until the DNS record resolves to the correct IP (by @axgkl)
Fixes
- Addressed a known issue with DNS ("Nameserver Limits Exceeded" warning) by forcing k3s to use a custom resolv.conf with a single nameserver (Google's)
- Made it possible to reliably replace the "seed" master, that is the first master used to initialize the cluster. Prior to this change, if the first master needed to be replaced due to faulty hardware or else, the risk of compromising the whole cluster was significant
- Placement groups are now deleted automatically when unused (e.g. after deleting a node pool or the whole cluster)
- Fixed an issue with the detection of the private network interface (by @cwilhelm)
- Default node port range is now automatically open in the firewall
- Ensured we wait for Cloud Init to complete the initialization process before setting up k3s (by @axgkl)
- When public IPs are disabled in the configuration, now they are also disabled for autoscaled nodes (by @Funzinator)
- Fixed support for multiline post create commands
- Fixed an issue when using a custom SSH port with newer distros that use socket activation (by @jpetazzo)
- Autoscaled nodes are now automatically deleted like static pool nodes
Upgrading from v1.1.5
Important: Read these upgrade notes carefully and test the upgrade with a test cluster first, if possible.
Before upgrading:
- Delete existing kubeconfig
- Create the file
/etc/k8s-resolv.conf
on ALL instances (both masters and workers); the file should include a single line:nameserver 8.8.8.8
- Update the config file following the new structure you can see here. For example move the setting
use_ssh_agent
from the root of the config file to
networking:
ssh:
use_agent: ...
Follow the same pattern for these settings:
ssh_port -> networking.ssh.port
public_ssh_key_path -> networking.ssh.public_key_path
private_ssh_key_path -> networking.ssh.private_key_path
ssh_allowed_networks -> networking.allowed_networks.ssh
api_allowed_networks -> networking.allowed_networks.api
private_network_subnet -> networking.private_network.subnet
disable_flannel -> networking.cni.enabled = false
enable_encryption -> networking.cni.encryption = true
cluster_cidr -> networking.cluster_cidr
service_cidr -> networking.service_cidr
cluster_dns -> networking.cluster_dns
enable_public_net_ipv4 -> networking.public_network.ipv4
enable_public_net_ipv6 -> networking.public_network.ipv6
existing_network -> settings.networking.private_network.existing_network_name
cloud_controller_manager_manifest_url -> manifests.cloud_controller_manager_manifest_url
csi_driver_manifest_url -> manifests.csi_driver_manifest_url
system_upgrade_controller_deployment_manifest_url -> manifests.system_upgrade_controller_deployment_manifest_url
system_upgrade_controller_crd_manifest_url -> manifests.system_upgrade_controller_crd_manifest_url
cluster_autoscaler_manifest_url -> manifests.cluster_autoscaler_manifest_url
- set
networking.private_network.enabled
totrue
as all existing clusters were using a private network while the new default isfalse
to allow creating larger clusters more easily - set
include_instance_type_in_instance_name
totrue
; this is because historically the instance type was included in the names of the instances, causing confusion when changing instance type from the Hetzner console. Since clusters created prior to v2 used that old naming scheme, this new setting must be set totrue
to preserve that behavior with v2. - If the cluster is HA, delete the load balancer created by previous versions of hetzner-k3s for the Kubernetes API as it's no longer needed (see improvements section)
Contributing:
If you are a Visual Studio Code user and would like to contribute to the project, you can now more easily work on it by using a dev container with Code. Crystal and all the other dependencies are already included in the container. See docs for more details.