I was the cross-AZ villain all along
March’s AWS bill came in at $3,100 — up from $2,000 in February. I dug through Cost Explorer by service, usage type, and daily granularity. The RDS extended support surcharge explained $245. A Bedrock spike on a single day explained $89. Then I found $108 in cross-AZ data transfer that shouldn’t exist.
10.8 TB of traffic between us-east-1a and us-east-1b in a single billing period. Not spread evenly — concentrated in a burst from March 3 through March 12 that peaked at $17/day. Three dates did most of the damage. If the tenant had been running all month at that rate, the bill would have been three times worse.
I traced it to three VPC-attached Lambda functions running entity-matching inference for a Kubeflow tenant. Each one is a 10 GB container making API calls to Data Service, LiteLLM, MLflow, and a tracing endpoint. All of those services run on EKS nodes in us-east-1a. The ALB sits in front, but the pods are all in one AZ.
The Lambdas were configured across two subnets: SubnetPrivateUSEAST1A and SubnetPrivateUSEAST1B. Every invocation that landed in 1b sent every API call across the AZ boundary through the ALB and back.
I knew exactly how this happened, because I did it.
Earlier that month, I’d added VPC attachment to one of these Lambdas so it could reach services that had moved behind a private ALB. It worked, everyone was happy, the Lambda was happy and so was the team. I added both private subnets because that’s what you do — two AZs for availability. Then a teammate copied my VPC configuration to the other two Lambdas in the same tenant.
One misconfiguration, copied twice. Three Lambdas burning $108/month in cross-AZ fees to talk to services that only exist in one AZ.
I pinned all three to SubnetPrivateUSEAST1A only. If 1a goes down, the EKS pods are down too, so the second AZ never provided the resilience I thought it did. It just routed half my traffic across a toll bridge.
Two subnets means high availability — until everything you’re talking to is in one of them.