Docker buildx served my CI a stale COPY layer and nobody noticed

2026-03-25

A service I was deploying was crashing on a missing AWS_REGION attribute in its pydantic Settings class. I checked the source code at the exact commit the image was built from. AWS_REGION was there — AWS_REGION: Optional[str] = "us-east-1", right where it should be. The file on disk inside the running container didn’t have it.

The image was built by GitHub Actions using docker/build-push-action@v5 with --cache-from type=gha. The Dockerfile has two stages:

# builder
FROM python:3.11-slim AS builder
WORKDIR /app
COPY . /app
COPY pyproject.toml uv.lock ./
RUN uv sync --frozen

# runtime
FROM python:3.11-slim
WORKDIR /app
COPY --from=builder /app/.venv /app/.venv
COPY . /app

The second COPY . /app is supposed to copy the current build context — the checked-out source code — into the runtime image. The GHA cache served a cached layer for this instruction from a previous build. The source files changed between builds, but the cache key didn’t invalidate.

How buildx GHA cache keys work

Docker’s build cache keys layers by the instruction and the content hash of the build context. A COPY . /app layer should bust whenever any file in the context changes. With the local cache, it does.

The GHA cache (--cache-from type=gha) stores layers in GitHub’s Actions cache storage, keyed by the layer’s chain ID. The problem is that the cache metadata for a COPY layer can match even when the source files have changed — especially with cross-platform builds (--platform linux/arm64 on an amd64 runner via QEMU). I don’t know the exact mechanism. What I know is that the layer matched, the old files were used, and the image shipped without the code change.

How I found it

The BUILD_COMMIT environment variable in the running container said 0e421ef — the correct commit. But cat /app/common/settings.py inside the container was missing the AWS_REGION field that existed at that commit. The build commit was baked into the image as an env var during an earlier stage, and the COPY . /app in the runtime stage served stale files. The commit was right but the code was wrong.

The cache bust and the real fix

I pushed a commit that added a file to the build context — a .build-bust file with a timestamp. The next build produced a new COPY layer with the correct source files. The image digest changed, confirming the previous build had been serving cached content.

That fixed the immediate problem. But the CI was green the whole time. The image pushed. The deployment succeeded. The app crashed at runtime because a pydantic Settings field was missing. The only reason I caught it was that the error message named the specific field and I happened to know it had been added in the most recent commit. If the stale cache had shipped a subtler difference — a bug fix that didn’t change the interface, a config default that only matters under load — it would have gone unnoticed.

Comparing image digests doesn’t help — they change between builds regardless (different metadata, different timestamps). The check that catches it is hashing the application files inside the built image and comparing against the source tree. I built a post-build verification step that does exactly this: docker-image-integrity-check. Run the image, hash the app directory, compare against the checkout. If the hashes don’t match, the cache served stale code and the build fails.

./verify-image-integrity.sh myapp:latest /app .

It adds a few seconds to the build. I’d rather have a few seconds of hashing than another night of debugging a Settings field that exists in the source but not in the container.

Update — two weeks later, I went back for the evidence

I wrote the first version of this post the night it happened, and I said something fuzzy about “the exact mechanism.” I didn’t love leaving it there, so I went back with gh run view.

Production run 23470914827 on 2026-03-24 at 02:57:32Z — the build that shipped the image without AWS_REGION. The buildkit output is unambiguous:

#14 [builder 4/7] COPY . /app    CACHED
#17 [stage-1 5/5] COPY . /app    CACHED

Both COPY . /app instructions hit CACHED. The commit right before that build had added AWS_REGION to common/settings.py. The cache key matched anyway. The image was built from a cached layer whose content did not reflect the current build context.

That’s the proof. Now I wanted to reproduce it.

Trying to reproduce it

I built a minimal repro repo at fizz/docker-image-integrity-check and ran eight versions of the same experiment, each matching more of production’s conditions:

#	Variable tested	Reproduced?
1	Single-stage Dockerfile, amd64	no
2	Multi-stage, amd64	no
3	Multi-stage, arm64 via QEMU	no
4	Multi-stage, arm64/QEMU, buildkit v0.28.0 pinned	no
5	Full production Dockerfile shape (`pyproject.toml` → `uv sync` → `.venv` COPY)	no
6	Priming the cache in one workflow run and re-using it in a second run after committing a file change	no
7	Bulked-up build context (90 files, 332 KB — close to production’s 108 files, 612 KB)	no
8	Three consecutive `build-push-action` calls matching the pre-refactor production pattern, with `push: true` to a real ghcr.io registry	no

By #8 I was matching every controllable variable I could identify. Same buildkit version (v0.28.0, pinned via driver-opts because buildx-stable-1 is a moving tag). Same cross-platform setup (arm64 via QEMU on an amd64 runner). Same Dockerfile topology (two-stage with COPY pyproject.toml uv.lock ./ → RUN uv sync → COPY . /app → COPY --from=builder /build/.venv /opt/venv → COPY . /app). Same cache backend (cache-from type=gha, cache-to type=gha,mode=max). Same build-push-action@v5 SHA. Cache primed across separate workflow runs. Source context bulked to almost match production.

Every single repro correctly invalidated the cache. Every COPY . /app layer showed DONE, not CACHED. The image always contained the current file contents. The verify script always passed.

The three-call pattern — the thing that was removed when the bug stopped

While I was chasing the repro, I went back to the git history of our production workflow — a reusable build-push-action wrapper workflow in the infrastructure repo. On 2026-04-02 — nine days after the bug hit — I’d committed a change titled “eliminate redundant rebuilds.”

Before that commit, the workflow called docker/build-push-action@v5 three times per build:

Build step 1 — push: false, load: true, cache-from: type=gha, cache-to: type=gha,mode=max
Build step 2 (commercial ECR push) — push: true, cache-from: type=gha, cache-to: type=gha,mode=max — a full rebuild that re-imported and re-exported the GHA cache
Build step 3 (GovCloud ECR push) — push: true, cache-from: type=gha — another rebuild that re-imported the cache

Three consecutive build-push-action calls. Two overlapping cache-to type=gha,mode=max writes. Three cache-from type=gha reads. After the refactor, the second and third calls became plain docker push commands — the image is built once and then pushed to both registries without touching the cache backend again.

The build that shipped the stale image on March 23 was running the three-call pattern. The bug stopped happening after the three-call pattern was removed. I didn’t know that’s what I was fixing at the time; the refactor was a performance optimization, not a bug fix.

I rebuilt the repro workflow to match the three-call pattern exactly — pushing real images to ghcr.io with push: true, overlapping cache-from/cache-to on two consecutive calls. The repro still correctly invalidated the cache. The bug didn’t fire.

So the three-call pattern is correlated with the bug’s appearance and the bug’s disappearance, but I can’t prove it’s the mechanism. It might be necessary but not sufficient — opening a window for cache corruption that some other condition (network timing, cache service state, runner load) has to line up with to actually poison the manifest. Or it might be coincidence and the bug was fixed somewhere upstream between March 23 and today.

The closest matching open issues in the upstream bug trackers are docker/build-push-action#766 (“Strange cache misshit when using gha cache”) and docker/build-push-action#829 (multi-stage cache pre-stage issues). Neither has been closed as fixed.

The bug is a ghost

I have irrefutable evidence the bug happened. I have a clean test environment that faithfully matches every controllable production condition, including the three-call pattern that was in place when it fired. The bug refuses to knock twice.

I don’t know why it happened. I know it did happen. I know the three-call pattern is the only production variable that changed between “bug lives in our CI” and “bug stops appearing.” And I know that between the time a build says ✅ and the time an app crashes on a missing field, there’s a gap where a stale cache can live undetected.

What I did instead of catching the bug

I can’t prevent what I can’t reproduce, but I can detect it. The docker-image-integrity-check script hashes files inside the built image and compares against the source tree. If the hashes don’t match, the build fails and the stale layer never ships. It runs after every build in CI as a guard.

It’s not a fix. It’s a trip wire. The bug is still out there, somewhere in the interaction between buildx’s cache key computation, buildkit’s chain ID logic, and GitHub’s cache storage service. If you’re building Docker images in GHA with type=gha cache and your source files ever end up mismatching the shipped image, you are not crazy. I have the log.

#docker #github-actions #ci-cd #platformengineering