fizz.today

Your deploy workflow shows green. Your pods didn’t roll.

A colleague merged seven PRs overnight. Every build succeeded. Every deploy workflow ran and reported success. The pods never updated. The old commit hash was still in /api/v2/config the next morning, and there were no startup logs in CloudWatch for any of the new builds.

I checked the deployment. The spec had the new image — terraform had applied it. But updatedReplicas was 0 and the deployment had a ReplicaFailure condition:

exceeded quota: tenant-quota, requested: limits.memory=4Gi,
used: limits.memory=5Gi, limited: limits.memory=6Gi

The tenant ResourceQuota was 6Gi. The apiserver pod has a 4Gi memory limit. A rolling update creates the new pod before killing the old one — it needs both running simultaneously, so it needs 8Gi minimum. The quota said no. Kubernetes rejected the new pod, the rollout timed out, and the old pod kept running.

Why the workflow didn’t catch it

The deploy pipeline has two steps: build and deploy. The build pushes a new image to ECR — that succeeded. The deploy runs terraform apply to write the new image to the Kubernetes deployment spec — that also succeeded. Terraform’s job is to reconcile the declared state with the API. It wrote the new image to the deployment object. Done. Green checkmark.

What terraform doesn’t do is wait for Kubernetes to act on the change. The deployment spec says “run this image.” The scheduler tries to create a pod with that image and fails on the quota. Terraform never sees this — it already exited successfully.

Seven builds, seven deploys, seven green checkmarks. The deployment spec was overwritten seven times. The pods never moved.

What the green checkmark actually means

I bumped the quota from 6Gi to 10Gi in the terraform module and applied it. The pods rolled immediately — the pending image update from the last deploy was still in the deployment spec, waiting for a pod that could be scheduled.

The deeper fix is a kubectl rollout status check after terraform apply in the deploy workflow. If the deployment doesn’t reach Ready within a timeout, the workflow should fail. The green checkmark should mean the code is running, not that the config was written.

I had created a Jira for the quota issue the day before. Nobody broke it overnight — the quota was always too small for a rolling update. It just didn’t matter until someone started deploying frequently.

#kubernetes #terraform #ci-cd #platformengineering