fizz.today

your container is phoning home to china on every cold start

I spent thirty minutes debugging a dead container on a bare-metal host before I found the problem: a Python OCR library was downloading model weights from www.modelscope.cn every time the container started.

What happened

We run Docling for document parsing — PDFs, scanned images, the usual CMMC evidence pipeline. The official docling-serve image worked fine. Our custom image, built from python:3.10-slim with pip install docling-serve, did not. The container would start, hang for 30 seconds, then exit.

The health check said timeout. The logs said nothing. docker exec into the running container showed the process stuck in Python module initialization.

I traced it to RapidOCR. On first import, RapidOCR() calls home to www.modelscope.cn to download detection and recognition models. The host couldn’t resolve the domain. No models, no initialization, no service.

Official images bake models at build time

The official docling-serve image runs docling-tools models download during the Docker build. Models land in the image layer. At runtime, the library finds them locally and skips the download. Our Dockerfile skipped that step.

Two lines in the Dockerfile

RUN python -c "from rapidocr_onnxruntime import RapidOCR; RapidOCR()" && \
    python -c "from docling.document_converter import DocumentConverter; \
    DocumentConverter().initialize_pipeline('pdf')"

Two lines in the Dockerfile. Models bake into the image at build time. Runtime downloads: zero.

Compliance doesn’t care about intent

ModelScope is a legitimate, well-run model hub — China’s answer to HuggingFace. Nothing wrong with it. But if you’re building for CMMC, FedRAMP, or any environment with network restrictions, runtime model downloads to any external host are a liability. Your container’s cold start depends on a foreign CDN being reachable. In an air-gapped or egress-filtered environment, it fails silently — no error message, just a hung process.

The rule: if a library downloads anything at import time, force that download during docker build. Check HF_HOME, TORCH_HOME, XDG_CACHE_HOME — every ML framework has a cache directory that doubles as a download trigger. If the cache is empty at runtime, the library reaches out. Bake it in or watch it hang.

#docker #ml-ops #cmmc