fizz.today

Your EKS nodes are dying and the ASG doesn’t care

Two nodes went NotReady in the same afternoon. The first had been flapping for two days — ten NodeNotReady events, kubelet stopped posting status, all pods on the node stuck in limbo. I terminated it and the ASG launched a replacement. Twenty-five minutes later the replacement also went NotReady.

The ASG never noticed either one. Its health check uses EC2 status checks — hardware reachability and network connectivity. The kubelet was hung but the instance was fine. The ASG saw two healthy instances. Kubernetes saw two dead nodes.

$ kubectl describe node ip-10-0-13-167.ec2.internal
Conditions:
  Ready     Unknown   Kubelet stopped posting node status.
$ aws autoscaling describe-auto-scaling-instances --instance-ids i-074a086a880b3f713
  HealthStatus: HEALTHY

Same instance, two opinions.

The root cause was memory pressure — t4g.medium nodes (4GB) running apiserver pods that loaded weasyprint in every gunicorn worker. Four workers per tenant, three tenants, 12 workers fighting for 4GB. The kubelet OOMed before the pods did because the pod memory limits were set higher than the node could actually provide. Kubernetes let the pods request more memory than existed, the kernel killed the kubelet to reclaim memory, and the node dropped off the cluster.

I terminated both instances manually via aws ec2 terminate-instances. The ASG launched replacements within minutes. Bigger nodes (t4g.large, 8GB) and fewer gunicorn workers fixed the memory pressure.

EKS managed node groups have auto-repair — --node-repair-config enabled=true and a NotReady node gets terminated and replaced after 30 minutes. I haven’t enabled it yet. Auto-repair on a cluster that’s OOMing creates an infinite loop of replacement nodes that die the same way. Fix the root cause first.

For detection, kube-state-metrics gives you kube_node_status_condition{condition="Ready",status="true"} == 0. Route that through Alertmanager to Slack and your phone buzzes before the 30-minute repair timer starts. Auto-repair handles the remediation. The alert handles the awareness. Neither one helps if you haven’t fixed the thing that killed the node in the first place.

#aws #eks #kubernetes #platformengineering