fizz.today

Docker buildx served my CI a stale COPY layer and nobody noticed

The auth-handler was crashing on a missing AWS_REGION attribute in its pydantic Settings class. I checked the source code at the exact commit the image was built from. AWS_REGION was there — AWS_REGION: Optional[str] = "us-east-1", right where it should be. The file on disk inside the running container didn’t have it.

The image was built by GitHub Actions using docker/build-push-action@v5 with --cache-from type=gha. The Dockerfile has two stages:

# builder
FROM python:3.11-slim AS builder
WORKDIR /app
COPY . /app
COPY pyproject.toml uv.lock ./
RUN uv sync --frozen

# runtime
FROM python:3.11-slim
WORKDIR /app
COPY --from=builder /app/.venv /app/.venv
COPY . /app

The second COPY . /app is supposed to copy the current build context — the checked-out source code — into the runtime image. The GHA cache served a cached layer for this instruction from a previous build. The source files changed between builds, but the cache key didn’t invalidate.

How buildx GHA cache keys work

Docker’s build cache keys layers by the instruction and the content hash of the build context. A COPY . /app layer should bust whenever any file in the context changes. With the local cache, it does.

The GHA cache (--cache-from type=gha) stores layers in GitHub’s Actions cache storage, keyed by the layer’s chain ID. The problem is that the cache metadata for a COPY layer can match even when the source files have changed — especially with cross-platform builds (--platform linux/arm64 on an amd64 runner via QEMU). I don’t know the exact mechanism. What I know is that the layer matched, the old files were used, and the image shipped without the code change.

How I found it

The BUILD_COMMIT environment variable in the running container said 0e421ef — the correct commit. But cat /app/common/settings.py inside the container was missing the AWS_REGION field that existed at that commit. The build commit was baked into the image as an env var during an earlier stage, and the COPY . /app in the runtime stage served stale files. The commit was right but the code was wrong.

The cache bust and the real fix

I pushed a commit that added a file to the build context — a .build-bust file with a timestamp. The next build produced a new COPY layer with the correct source files. The image digest changed, confirming the previous build had been serving cached content.

That fixed the immediate problem. But the CI was green the whole time. The image pushed. The deployment succeeded. The app crashed at runtime because a pydantic Settings field was missing. The only reason I caught it was that the error message named the specific field and I happened to know it had been added in the most recent commit. If the stale cache had shipped a subtler difference — a bug fix that didn’t change the interface, a config default that only matters under load — it would have gone unnoticed.

Comparing image digests doesn’t help — they change between builds regardless (different metadata, different timestamps). The check that catches it is hashing the application files inside the built image and comparing against the source tree. I built a post-build verification step that does exactly this: docker-image-integrity-check. Run the image, hash the app directory, compare against the checkout. If the hashes don’t match, the cache served stale code and the build fails.

./verify-image-integrity.sh myapp:latest /app .

It adds a few seconds to the build. I’d rather have a few seconds of hashing than another night of debugging a Settings field that exists in the source but not in the container.

#docker #github-actions #ci-cd #platformengineering