Owner chain explains lineage. Reconciler chain explains behavior.
I burned a Friday morning editing a Deployment image that kept reverting.
The old image was gcr.io/ml-pipeline/frontend:2.0.5. The target was ghcr.io/kubeflow/kfp-frontend:2.5.0. The edit applied cleanly, then snapped back. Repeatedly.
The first useful clue came from ownership:
- Deployment ownerRef:
kind: Namespace,name: admin
That looked absurd against my default Kubernetes mental model. I treat namespace as a container, not an active parent object. But in this cluster, namespace was the parent in a metacontroller flow that rendered child resources.
That gave me the principle I wish I had earlier:
Owner chain explains lineage. Reconciler chain explains behavior.
Why owner chain was insufficient
ReplicaSet -> Deployment -> Namespace told me where the object sat in hierarchy. It did not identify all actors that would overwrite the object.
A controller can reconcile an object even when it is not the immediate owner in the way you expect. If you only follow ownerRefs, you can still miss the actual source of desired state.
In practice, that means a perfect kubectl edit deployment can still be the wrong move.
Fast debugging checklist
- Follow ownerRefs to understand lineage.
- Inspect
managedFieldsand controller annotations to identify writers. - Find controller inputs (
ConfigMap,CR, Helm values), not just child specs. - Validate service-account permissions with
kubectl auth can-i.
Example:
kubectl --context mlinfra-prod auth can-i \
list workflows.argoproj.io --all-namespaces \
--as=system:serviceaccount:kubeflow:argo
That command answers in one line what I used to test with ad hoc pods and trial-and-error.
The config trap that looked like metadata
In this stack, pipeline-install-config.appVersion was not informational. It fed KFP_VERSION into the profile-controller path. If explicit image env overrides were missing, reconciliation could drift child image tags back toward old defaults.
So we had mixed state:
- leaf runtime image already at
2.5.0in some places - controller config still carrying
2.0.5intent in others
That is how incidents feel random when they are actually deterministic.
Durable pattern
- Edit controller inputs, not reconciled children.
- Align config intent (
appVersion, explicit image overrides). - Trigger reconcile from the parent model when needed.
- Add repeatable health checks so restarts don’t rediscover drift.
I published the two scripts we now run for that:
kubeflow-version-snapshot.sh: https://gist.github.com/fizz/307ce198f24c78b55a721f80971e491ekubeflow-rbac-smoke.sh: https://gist.github.com/fizz/2e64204a5fd8767ced6a4ac247aa4b5f
If a controller manages it, child edits are local anesthesia. The surgery is at source-of-truth.