Potentially breaking changes
Terraform 0.13.3 or later required
This release requires Terraform 0.13.3 or later because it is affected by these bugs that are fixed in 0.13.3:
- hashicorp/terraform#26226
- hashicorp/terraform#26252
- hashicorp/terraform#26166
- hashicorp/terraform#26180
It remains possibly affected by hashicorp/terraform#25631 but we hope we have worked around that for now.
Securing the Cluster Autoscaler
Previously, setting enable_cluster_autoscaler = true
turned on tagging sufficient for the Kubernetes Cluster Autoscaler to discover and manage the node group, and also added a policy to the node group worker role that allowed the workers to perform the autoscaling function. Since pods by default use the EC2 instance role, which in EKS node groups is the node group worker role, this allowed the Kubernetes Cluster Autoscaler to work from any node, but also allowed any rogue pod to perform autoscaling actions.
With this release, enable_cluster_autoscaler
is deprecated and its functions are replaced with 2 new variables:
cluster_autoscaler_enabled
, whentrue
, causes this module to perform the labeling and tagging needed for the Kubernetes Cluster Autoscaler to discover and manage the node groupworker_role_autoscale_iam_enabled
, whentrue
, causes this module to add the IAM policy to the worker IAM role to enable the workers (and by default, any pods running on the workers) to perform autoscaling operations
Going forward, we recommend not using enable_cluster_autoscaler
(it will eventually be removed) and leaving worker_role_autoscale_iam_enabled
at its default value of false
. If you want to use the Kubernetes Cluster Autoscaler, set cluster_autoscaler_enabled = true
and use EKS IAM roles for service accounts to give the Cluster Autoscaler service account IAM permissions to perform autoscaling operations. Our Terraform module terraform-aws-eks-iam-role is available to help with this.
Known issues
There remains a bug in amazon-vpc-cni-k8s (a.k.a. amazon-k8s-cni:v1.6.3
) where after deleting a node group, some ENIs for that node group may be left behind. If any are left behind, they will prevent any security group they are attached to (such as the security group created by this module to enable remote SSH access) from being deleted, and Terrform will relay an error message like
Error deleting security group: DependencyViolation: resource sg-067899abcdef01234 has a dependent object
There is a feature request that should resolve this issue for our use case. Meanwhile the good news is that the trigger is deleting a security group, which does not often happen, and even when the security group is deleted we have been able to reduce the chance the problem occurs. When it does happen, there are some workarounds:
Workarounds
apply
will succeed in deleting the security group.
Name=node.k8s.amazonaws.com/instance_id,Value=<instance-id>
where <instance-id>
is the EC2 instance ID of the instance the ENI is supposed to be associated with. A cleanup script could fine ENIs with state = AVAILABLE and tagged as belonging to instances that are terminated or do not exist and delete them.
-remoteAccess
so you can easily identify it. If you delete it inappropriately, Terraform will re-create it on the next plan/apply cycle, so this is a relatively save operation.
Reminder from 0.11.0: create_before_destroy
Starting with 0.11.0 you have the option of enabling create_before_destroy
behavior for the node groups. We recommend doing it, as destroying a node group before creating its replacement can result in a significant cluster outage, but it is not without its downsides. Read the description and discussion in PR #31 for more details .
Additional Release Notes
Remove autoscaler permissions from worker role @Nuru (#34) (click to expand)
what
- Disable by default the permission for workers to perform autoscaling operations
- Workaround hashicorp/terraform#25631 by not keeping a reference to the remote access security group ID in
random_pet
"keepers" - Attempt to work around failure of AWS EKS and/or AWS Terraform provider to detach instances from a security group automatically when deleting the security group by forcing the node group to be deleted before the security group. Not entirely successful (see "Known issues")
why
- General security principle of least privilege, plus Cloud Posse convention of boolean feature flags having names ending with
_enabled
. - Without the workaround of hashicorp/terraform#25631,
terraform apply
would fail with error like
Error: Provider produced inconsistent final plan
When expanding the plan for
module.region_node_group["main"].module.node_group["us-west-2b"].module.eks_node_group.random_pet.cbd[0]
to include new values learned so far during apply, provider
"registry.terraform.io/hashicorp/random" produced an invalid new value for
.keepers["source_security_group_ids"]: was cty.StringVal(""), but now
cty.StringVal("sg-0465427f44089a888").
This is a bug in the provider, which should be reported in the provider's own
issue tracker.