fizz.today

TIL: three layers between headscale and a private EKS endpoint

kubectl get nodes timed out. The headscale subnet router was running in the same VPC as the EKS cluster, advertising routes, connected to the coordination server. Everything looked correct. Three things weren’t.

Advertised routes

The subnet router advertises specific CIDRs to headscale clients — /32 routes for individual services, not a blanket VPC range. The EKS private API endpoint resolves to two IPs behind an ENI in the VPC (10.0.120.132 and 10.0.69.139). Neither was in the advertised_routes list. The router was reachable, but it wasn’t offering to carry EKS traffic.

variable "advertised_routes" {
  type = list(string)
  default = [
    "172.31.5.32/32",  # Prod PostgreSQL RDS
    "10.0.57.47/32",   # Prod MySQL RDS
    "10.0.120.132/32", # Prod EKS API endpoint
    "10.0.69.139/32",  # Prod EKS API endpoint
  ]
}

Adding the IPs and applying replaced the instance — the routes are baked into userdata. New instance, new headscale node registration.

Route approval

Headscale doesn’t auto-approve routes on new nodes. The subnet router registered, announced its four CIDRs, and waited. Until someone runs headscale nodes approve-routes, the routes exist in the coordination server’s database but aren’t distributed to clients. No error, no warning in the client logs — traffic just has nowhere to go.

headscale nodes list
headscale nodes approve-routes --identifier 11

After approval, the headscale client on my laptop learned the routes. Packets now flowed to the subnet router. Still timed out.

Security group

The EKS cluster security group allowed inbound 443 from 172.31.0.0/16 (the default VPC range) and 100.64.0.0/10 (EKS pod networking). The subnet router sits in a 10.0.0.0/16 subnet — the eksctl-created VPC. Its traffic arrived at the EKS API endpoint’s ENI and got dropped at the security group boundary.

resource "aws_security_group_rule" "eks_api_from_subnet_router" {
  type                     = "ingress"
  from_port                = 443
  to_port                  = 443
  protocol                 = "tcp"
  source_security_group_id = aws_security_group.subnet_router.id
  security_group_id        = var.eks_cluster_security_group_id
}

SG-to-SG reference instead of a CIDR — the router’s IP changes on every instance replacement, but the security group follows it.

What I assumed

I treated “the subnet router is in the VPC” as sufficient for reachability. It isn’t. There are three independent gates: the router has to advertise the route, headscale has to approve it, and the destination has to accept the traffic. Each one fails silently — no rejected connection, no ICMP unreachable, just a timeout that looks identical at all three layers. I diagnosed them in sequence because nothing told me which layer was dropping packets.

Three fixes, one kubectl get nodes, one node returned.

#headscale #eks #aws #kubernetes #til