Kubeflow version is a matrix, so snapshot runtime plus reconciler inputs
I asked a simple platform question: what version of Kubeflow are we on?
The useful answer was not one number.
It was a matrix.
What we saw in one cluster
At the same time, all of these were true:
- core Kubeflow components on
v1.8.0 - KFP frontend running
2.5.0 - multiple KFP services still on
2.0.5 - controller config values that could still influence reconcile behavior
If you only report the control-plane version, you miss runtime drift.
If you only report one upgraded component, you miss fallback defaults that can reapply older behavior on next reconcile.
Why this is an incident risk, not a dashboard nit
In controller-heavy platforms, version is distributed policy.
The system doesn’t ask, “what number do humans prefer to say?”
It asks, “what values are wired into reconcile logic right now?”
That distinction matters under restart and rollout pressure.
A cluster can look healthy and still carry downgrade pressure in controller inputs. The risk only becomes visible when reconcile runs again and resolves conflicting truths.
This is how teams end up saying “we upgraded this” and “it reverted” in the same incident while both statements are locally correct.
What to snapshot every time
You need both:
- Runtime truth: what images are running now.
- Reconciler truth: what config/env/defaults will be rendered next.
In practice, our minimum snapshot includes:
- core component image tags
- KFP component image tags
pipeline-install-config.appVersion- controller override env config maps
- inferred release line (best effort, explicitly caveated)
I published the script we now run:
kubeflow-version-snapshot.sh: https://gist.github.com/fizz/307ce198f24c78b55a721f80971e491e
The minimal policy that helps
Before calling a platform “upgraded,” require all lanes to agree:
- critical runtime image tags
- reconcile-driving config values
- post-reconcile child specs in managed namespaces
If one lane disagrees, you’re partially converged, not done.
That wording alone improves operational decisions because “partially converged” triggers follow-up work instead of release notes optimism.
What this catches in practice
- UI path looks upgraded, but reconciler defaults still point to old tags.
- Config values are bumped, but managed child specs have not converged yet.
- Runtime converges, but controller RBAC is still wrong for the next restart.
Longform context: https://ferkakta.dev/kubeflow-is-a-version-matrix-not-a-version/