Kubeflow version is a matrix, so snapshot runtime plus reconciler inputs

2026-02-27

I asked a simple platform question: what version of Kubeflow are we on?

The useful answer was not one number.

It was a matrix.

What we saw in one cluster

At the same time, all of these were true:

core Kubeflow components on v1.8.0
KFP frontend running 2.5.0
multiple KFP services still on 2.0.5
controller config values that could still influence reconcile behavior

If you only report the control-plane version, you miss runtime drift.

If you only report one upgraded component, you miss fallback defaults that can reapply older behavior on next reconcile.

Why this is an incident risk, not a dashboard nit

In controller-heavy platforms, version is distributed policy.

The system doesn’t ask, “what number do humans prefer to say?”

It asks, “what values are wired into reconcile logic right now?”

That distinction matters under restart and rollout pressure.

A cluster can look healthy and still carry downgrade pressure in controller inputs. The risk only becomes visible when reconcile runs again and resolves conflicting truths.

This is how teams end up saying “we upgraded this” and “it reverted” in the same incident while both statements are locally correct.

What to snapshot every time

You need both:

Runtime truth: what images are running now.
Reconciler truth: what config/env/defaults will be rendered next.

In practice, our minimum snapshot includes:

core component image tags
KFP component image tags
pipeline-install-config.appVersion
controller override env config maps
inferred release line (best effort, explicitly caveated)

I published the script we now run:

kubeflow-version-snapshot.sh: https://gist.github.com/fizz/307ce198f24c78b55a721f80971e491e

The minimal policy that helps

Before calling a platform “upgraded,” require all lanes to agree:

critical runtime image tags
reconcile-driving config values
post-reconcile child specs in managed namespaces

If one lane disagrees, you’re partially converged, not done.

That wording alone improves operational decisions because “partially converged” triggers follow-up work instead of release notes optimism.

What this catches in practice

UI path looks upgraded, but reconciler defaults still point to old tags.
Config values are bumped, but managed child specs have not converged yet.
Runtime converges, but controller RBAC is still wrong for the next restart.

Longform context: https://ferkakta.dev/kubeflow-is-a-version-matrix-not-a-version/

#kubeflow #kubernetes #reliability #platform-engineering